Move startup context into system reminders#4053
Conversation
E2E Test Summary ReportLocal verification target: the current PR branch was built with ResultAll planned E2E groups passed against the local built CLI after the latest merge Raw API logs were inspected locally during the run. Since those local files are
Latest Focused Validationcd packages/core && npx vitest run src/core/client.test.ts src/utils/environmentContext.test.ts
npm run build && npm run typecheck && npm run bundleObserved results:
Notes
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
The refactor of getInitialChatHistory now calls warmAll() and getMcpServerInstructions() on the tool registry even when startup context is skipped, so the resume test stub needs those methods to avoid throwing inside the resume catch block.
…-startup-context # Conflicts: # packages/core/src/core/geminiChat.test.ts # packages/core/src/core/geminiChat.ts # packages/core/src/tools/mcp-client.ts
…-startup-context # Conflicts: # packages/core/src/core/client.test.ts # packages/core/src/core/client.ts
wenshao
left a comment
There was a problem hiding this comment.
CodeRabbit Review (gemini-3-pro)
6 suggestions found. Build and tests pass (712 passed, 0 failures).
Note: The existing Critical finding by @wenshao about MCP server instruction escaping (
environmentContext.ts:195) is the highest-priority issue and should be addressed first. The suggestions below are supplementary.
Summary of Findings
| # | File | Line | Severity | Issue |
|---|---|---|---|---|
| 1 | environmentContext.ts | 148 | Suggestion | serverName not escaped with escapeSystemReminderTags |
| 2 | environmentContext.ts | 262 | Suggestion | getStartupContextLength: prefix-only check, no closing-tag verification |
| 3 | client.ts | 485 | Suggestion | refreshStartupContextReminder does not propagate includeDeferredToolsReminder option |
| 4 | prompts.ts | 369 | Suggestion | buildDeferredToolsSection is dead code (no callers pass deferredTools anymore) |
| 5 | environmentContext.ts | 227 | Suggestion | warmAll() unconditionally called even for sub-agents with includeDeferredToolsReminder: false |
Deterministic Analysis
TSC and ESLint passed with 0 errors.
Test Results
| Suite | Result |
|---|---|
| Build | ✅ Pass |
| Core tests | ✅ 630 passed |
| CLI tests | ✅ 82 passed |
| Total | 712 passed, 2 skipped, 0 failures |
Reviewed by CodeRabbit using gemini-3-pro model.
wenshao
left a comment
There was a problem hiding this comment.
Review: Move startup context into system reminders
Overall this is a well-thought-out architectural change — moving startup context from the system instruction into chat history entries, and adding progressive MCP tool reminders. The migration is clean and test coverage is solid. I found 4 actionable issues (2 critical, 1 major, 1 suggestion) and 9 minor/nit items.
Critical / Major Issues
1. [Critical] Startup context permanently lost after auto-compaction
tryCompress() replaces history via setHistory(newHistory) where newHistory = [summary, model_ack, ...historyToKeep]. The startup context (working directory, date/time, OS, directory structure, MCP instructions) that was at position 0 is consumed into historyToCompress and replaced by the summary.
Before this PR, environment context lived in the system instruction (sent every API request, unaffected by compaction). After this PR it lives only in a chat history user entry — compaction removes it permanently. The comment at line 533-536 explicitly acknowledges "no env-context refresh here".
Manual /compress correctly recovers (via startChat() → getInitialChatHistory()), but auto-compaction does not.
Suggestion: Call refreshStartupContextReminder() after setHistory(newHistory) in the COMPRESSED branch.
2. [Critical] Reconnected MCP tools silently skipped from re-announcement
When an MCP server disconnects, removeMcpToolsByServer() clears tools from the registry and revealedDeferred, but announcedDeferredToolNames retains the old tool names. On reconnect, queueAddedMcpToolsReminder checks !this.announcedDeferredToolNames.has(tool.name) — the stale name is still present, so the tool is silently skipped. The model sees updated tool declarations but the user never gets a "new tools available" reminder.
Suggestion: After line 567, reconcile announcedDeferredToolNames against currentDeferredNames to remove stale entries.
3. [Major] stripOrphanedUserEntriesFromHistory can pop MCP tool reminders on retry
The function pops all trailing user entries. If drainPendingAddedMcpToolsReminder injects a reminder, then the user's turn fails (no model response), both the user message AND the MCP reminder get popped as trailing orphans. Since pendingAddedMcpTools was already cleared, the announcement is permanently lost — queueAddedMcpToolsReminder won't re-queue because the tool is already in announcedDeferredToolNames.
4. [Suggestion] Unnecessary filesystem I/O in refreshStartupContextReminder
When called after revealing a deferred tool, this rebuilds the entire startup context including getFolderStructure() (directory-scanning I/O). Only the deferred tools list changed; the environment context is static within a session. Consider caching the environment context or splitting the rebuild.
Minor / Nit Items
- tool-search.ts:469 — Comment still says "deferred-tools section of the system prompt" but deferred tools are now in a startup reminder.
- prompts.ts:82, 118, 369 —
buildDeferredToolsSectionanddeferredToolsparameter are dead code (no production callers pass it). - client.ts:656-680 —
warmAll()called 3× duringstartChat: explicitly, insidegetInitialChatHistory, and insidesetTools(). The 2nd and 3rd are no-ops. - environmentContext.ts:106 —
JSON.stringifyescapes JSON injection but not XML-like markup. A tool name likefoo</system-reminder>barwould inject a closing tag inside the wrapper. - geminiChat.ts:359 —
appendCuratedContentmerges consecutive user entries unconditionally. WhenextractCuratedHistorydrops an invalid model response, the startup reminder and real user message become adjacent and get merged, removing the structural boundary. - geminiChat.ts:1226 — Guard only checks
parts[0].text.startsWith(SYSTEM_REMINDER_OPEN). IfappendCuratedContentmerges entries, the reminder could move toparts[1]and the guard fails. - tool-registry.ts:703 —
ToolRegistry.getMcpServerInstructions()delegates toMcpClientManager.getServerInstructions(). Consider renaming the McpClientManager method for a 1:1 match. - environmentContext.test.ts — Missing: adversarial server-name test for
buildMcpServerInstructionsReminder, direct unit tests forgetStartupContextLength. - client.test.ts:1042 — Missing: test for the replacement path of
refreshStartupContextReminder(existing test only covers the removal path).
wenshao
left a comment
There was a problem hiding this comment.
Review: Move startup context into system reminders
Overall a well-designed architectural change. The migration from system-instruction to chat-history user entries is clean, and progressive MCP tool reminders are a good addition. I found 4 actionable issues beyond the existing comments from @wenshao:
Critical / Major
-
[Critical] Startup context lost after auto-compaction —
tryCompress()replaces history viasetHistory(newHistory)(geminiChat.ts:537). The startup context (wd, date/time, OS, directory structure, MCP instructions) at position 0 is consumed intohistoryToCompressand replaced by the summary. Before this PR, env context lived in the system instruction (unaffected by compaction); now it only exists in chat history. The comment at L533-536 acknowledges "no env-context refresh here". Manual/compressrecovers (viastartChat()), but auto-compaction does not. -
[Critical] Reconnected MCP tools silently skipped —
announcedDeferredToolNamesis never cleaned up when an MCP server disconnects. On reconnect, tools are inannouncedDeferredToolNames→ skipped byqueueAddedMcpToolsReminder→ no "new tools" reminder. (Inline comment on client.ts:571) -
[Major]
stripOrphanedUserEntriesFromHistorycan pop MCP reminders on retry — The guard only fires whenhistory.length === 1. When an MCP reminder + user turn are both trailing, both get popped. SincependingAddedMcpToolswas already cleared, the announcement is permanently lost. (Inline comment on geminiChat.ts:1231)
Suggestion
- [Suggestion] Unnecessary filesystem I/O in
refreshStartupContextReminder— Rebuilds entire startup context includinggetFolderStructure()when only deferred tools changed. Consider caching the environment context. (Inline comment on client.ts:485)
Minor / Nit (not commented inline)
- tool-search.ts:469 — Stale comment still says "deferred-tools section of the system prompt" (now in startup reminder).
- prompts.ts:82, 118, 369 —
buildDeferredToolsSectionanddeferredToolsparameter are dead code (no production callers). - client.ts:656-680 —
warmAll()called 3× duringstartChat(explicitly, insidegetInitialChatHistory, insidesetTools()). 2nd and 3rd are no-ops. - environmentContext.ts:106 —
JSON.stringifydoesn't escape</>. A tool name likefoo</system-reminder>barwould inject a closing tag. - geminiChat.ts:359 —
appendCuratedContentmerges consecutive user entries unconditionally. IfextractCuratedHistorydrops an invalid model response, the startup reminder and user message become adjacent and get merged. - tool-registry.ts:703 —
getMcpServerInstructions()delegates togetServerInstructions()(naming mismatch). - Test gaps: direct unit tests for
getStartupContextLength, replacement-path test forrefreshStartupContextReminder, adversarial server-name test forbuildMcpServerInstructionsReminder.
Resume safety (verified OK)
buildApiHistoryFromConversationnever includes startup context (it wasn't chat record before either);startChat(resumedHistory)prepends a fresh one. No duplication.STARTUP_CONTEXT_MODEL_ACKremoval is clean (zero remaining references).
- Centralize escapeSystemReminderTags() in wrapSystemReminder() so MCP server instructions, server headings, and JSON.stringify'd tool names/descriptions can no longer break out of the reminder envelope with an embedded </system-reminder>. - Re-prepend a fresh startup prelude after auto-compaction (restoreStartupContextAfterCompaction on the ChatCompressed path); previously only manual /compress rebuilt it, so auto-compaction silently lost env/tool/MCP context for the rest of the session. - Prune stale names from announcedDeferredToolNames when a tool leaves the deferred set so a reconnecting MCP server is re-announced. - Preserve trailing <system-reminder> entries in stripOrphanedUserEntriesFromHistory so a failed-prompt retry can't drop a mid-history MCP added-tool reminder. - Require a closing tag in getStartupContextLength to avoid misclassifying a user message that merely starts with the literal tag. - Remove dead buildDeferredToolsSection() and the unused deferredTools params from getCoreSystemPrompt/getCustomSystemPrompt. - Fix a stale comment in tool-search.ts. Adds regression tests for auto-compaction restore, MCP reconnect re-announcement, and mid-history reminder preservation.
Review response — e40aef7Thanks for the thorough multi-pass review. Addressed in e40aef7; all inline threads replied to and resolved individually. Summary of the substantive items: [Critical] Startup context lost after auto-compaction (from the review body)
Confirmed and fixed. I traced both triggers end-to-end:
The fix adds PR description note: the body currently says reminders are "rebuilt at explicit lifecycle boundaries like Other actionable items (all fixed in e40aef7)
Deferred (replied inline with rationale)
|
wenshao
left a comment
There was a problem hiding this comment.
Reviewer Test Plan
-
Run CLI tests and verify the 3 startup-context-related failures are resolved:
cd packages/cli && npx vitest run src/acp-integration/session/Session.test.ts src/ui/utils/historyMapping.test.ts
These fail because
getStartupContextLengthnow requires both<system-reminder>and</system-reminder>, but the teststartupEntry()helper only includes the opening tag. -
Verify
restoreStartupContextAfterCompactionsurvives filesystem failure by deleting the working directory mid-session while auto-compaction triggers. -
Verify a saved session from the pre-PR build is correctly loaded after upgrade — subagent forwarding should not leak old env context as conversation history.
-
(Existing code, not in this diff)
isSystemReminderTagIgnorableinxml.ts:51misses Unicode format control characters likeU+061C(Arabic Letter Mark). A malicious MCP server could bypass<system-reminder>tag escaping by inserting invisible Unicode characters into tag names. Consider aligning to the full UnicodeDefault_Ignorable_Code_Pointlist or applying NFKC normalization.
The startup-context detector now requires both the opening and closing <system-reminder> tags, but the rewind tests still seeded open-tag-only prelude entries, so getStartupContextLength returned 0 and the rewind truncation index was off by one. Export SYSTEM_REMINDER_CLOSE and seed realistic closed reminders in the affected tests. Also rewrite the two polynomial-ReDoS regexes in xml.ts flagged by CodeQL: the tag-candidate scanner now excludes '<' (canonical fix that also stops a stray '<' from swallowing a following real closing tag), and the system-reminder tag matcher is reimplemented as a linear, backtracking-free parser with identical behavior.
…scaping - Wrap restoreStartupContextAfterCompaction in try/catch so a filesystem failure during auto-compaction can't crash the event stream. - Detect the legacy pre-reminder startup pair in getStartupContextLength so resumed old sessions still strip for subagents and index correctly for rewind. - Extend isSystemReminderTagIgnorable to cover U+061C and the other realistically abusable Default_Ignorable code points so MCP content can't smuggle invisible characters past system-reminder tag escaping.
Review response — 2c0261aAddressing the latest Test-plan item 1 — "3 startup-context test failures, That review was performed on static code from an earlier state ("Review performed on static code only"); the failures were already repaired in Test-plan item 4 — Items 2 and 3 of the test plan describe manual scenarios already covered defensively by the try/catch and legacy-format detection added in this commit (see the resolved inline threads). |
…-startup-context # Conflicts: # packages/cli/src/acp-integration/session/Session.test.ts
PR #4053 Verification ReportBranch: Build & Type Check
Unit Tests
Architecture & Code ReviewProblem solved: Startup context (workspace env, deferred-tool metadata, MCP server instructions) previously lived in Key architectural changes:
Security observations:
Legacy compatibility: VerdictReady to merge. The architectural change is well-motivated (prompt-cache stability), thoroughly tested (578 tests across 14 files), and the security hardening around system-reminder tag escaping is robust. Legacy session detection ensures smooth migration. — wenshao |
…-startup-context # Conflicts: # packages/core/src/core/client.test.ts # packages/core/src/core/client.ts # packages/core/src/core/geminiChat.test.ts # packages/core/src/core/geminiChat.ts # packages/core/src/core/prompts.ts
…e prompt Per-turn reminders (plan mode, subagent list, recalled memory) are prepended as an extra part to the SAME user Content as the prompt, so a failed turn is recorded as [<system-reminder>…, prompt]. stripOrphanedUserEntriesFromHistory matched only parts[0], treating that as a structural reminder entry and preserving it — the stale prompt then leaked into the next turn via appendCuratedContent. Add isSystemReminderContent (a pure entry = every part is a <system-reminder>) and break only on those. A genuine turn with a prepended reminder still has a non-reminder prompt part, so it is correctly popped. The helper is shared with the CLI rewind-truncation fix.
isUserTextContent (historyMapping and Session) counted mid-history user-role <system-reminder> entries — the MCP added-tool reminders injected by drainPendingAddedMcpToolsReminder — as real user prompts. computeApiTruncation- Index then landed a turn early and silently dropped a turn's context on rewind. Skip pure system-reminder entries via isSystemReminderContent. Checking every part keeps a real plan-mode turn (prompt + prepended reminder) counted as a prompt.
refreshStartupContextReminder guarded on getStartupContextLength (which returns
2 for the legacy [user(env), model("Got it…")] prelude) but then sliced a
hardcoded 1, leaving the orphaned model ack behind on restored pre-PR sessions.
Slice by the detected length instead.
|
@pomelo-nwu thanks — both are fair questions. Answering each: Cache validation across providers (your question)The cache goal here is prefix stability, which is provider-agnostic at the principle level: keep the cached prefix (system prompt + tool declarations) byte-stable for the session so nothing invalidates it mid-stream. Concretely, MCP tools are deferred ( On measurement: the cross-provider work in this PR was converter-level (confirming Gemini / Anthropic / OpenAI all accept the multi-part user-role reminder payload), not a cache-hit-rate delta. We don't yet emit per-provider cache-hit telemetry, so I'm not claiming a measured latency/token number — I'll open a follow-up to log cache-hit headers on one provider and quantify it. Weaker models and user-role
|
Does this PR mitigate #4218 (MCP
|
| Model | Deferral | qwen v0.17.0 | PR #4053 |
|---|---|---|---|
deepseek-v4-flash |
off (inline) | 2/3 | 2/3 |
deepseek-v4-pro |
off (inline) | 3/3 | 3/3 |
qwen3.7-max |
on | 0/3 | 2/3 |
qwen3.7-plus |
on | 0/3 | 0/3 |
qwen3.6-max-preview |
on | 0/3 | 3/3 |
glm-5.1 |
on | 0/3 | 3/3 |
kimi-k2.6 |
on | 0/3 | 1/3 |
Table 2 — built-ins masked (--exclude-tools list_directory,read_file,write_file,edit,glob,grep_search,run_shell_command):
| Model | Deferral | qwen v0.17.0 | PR #4053 |
|---|---|---|---|
deepseek-v4-flash |
off (inline) | 3/3 | 3/3 |
deepseek-v4-pro |
off (inline) | 3/3 | 3/3 |
qwen3.7-max |
on | 3/3 | 3/3 |
qwen3.7-plus |
on | 3/3 | 3/3 |
Reading:
- Table 1: with built-ins present, the deferral-on models ignore the MCP server on stock v0.17.0 (0/3). This PR's effect is model-dependent — strong for
qwen3.6-max-previewandglm-5.1(0 → 3/3), partial forqwen3.7-max(0 → 2/3) andkimi-k2.6(0 → 1/3), none forqwen3.7-plus(0/3). - Table 2: masking the built-in shadows flips every model to 3/3, on both engines. So the model can reliably reach a deferred MCP tool via
tool_search; [Bug Report] MCP Server "filesystem" shows connected on UI, but tools are not available to the model #4218 is shadowing, not a deferral-reachability bug. (Masking re-run for the four primary models.) - Deepseek matches the
/deepseek-(v3|v4|chat)/gate → deferral disabled → MCP tools inline → it picks MCP vs built-in freely regardless of this PR.
In short, #4053 does what it sets out to do — keep the prompt prefix cache-stable — and as a welcome side effect it nudges some models toward the deferred MCP tools. What it cannot do is guarantee that a user-configured MCP tool wins out over a built-in that does the same job, and that guarantee is exactly what #4218 needs. A dependable fix has to act on the deferral itself: either disable tool-search deferral so MCP tools are sent inline ({ "tools": { "toolSearch": { "enabled": false } } }), or stop deferring user-configured MCP tools altogether by marking them alwaysLoad.
A note on the numbers: each ratio is only three samples and a model's tool choice is probabilistic, so they are best read as directional rather than exact. Every run was on the macOS CLI, whereas the original report comes from the Windows Desktop client on a Qwen model — a build I was not able to exercise directly.
| const ackText = secondEntry?.parts?.[0]?.text; | ||
| if (secondEntry?.role === 'model' && ackText === STARTUP_CONTEXT_MODEL_ACK) { | ||
| return history.slice(2); | ||
| export function getStartupContextLength(history: Content[]): number { |
There was a problem hiding this comment.
[Suggestion] getStartupContextLength and isSystemReminderContent (below) are core architectural helpers — they determine prelude detection, rewind indexing (computeApiTruncationIndex), structural-entry classification (stripOrphanedUserEntriesFromHistory), and subagent history stripping (stripStartupContext). Both are consumed from client.ts, geminiChat.ts, historyMapping.ts, and Session.ts, but neither has direct unit tests in environmentContext.test.ts. They are exercised only indirectly via mocked Config/ToolRegistry in client.test.ts (which duplicates the production logic) and integration tests in geminiChat.test.ts.
A regression in either function would silently corrupt multiple downstream paths — and the indirect tests would not pinpoint the cause. The doc comment at lines 298-314 explicitly calls out the parts.every(...) vs parts[0] distinction as load-bearing, but no test asserts that distinction.
Consider adding direct describe blocks covering the edge cases:
getStartupContextLength: new-format prelude (→ 1), legacy[user(env), model("Got it. Thanks for the context!")]pair (→ 2), empty/non-prelude history (→ 0), user message starting with literal<system-reminder>text but no closing tag (→ 0), multi-partContentwhereparts[0]is a reminder butparts[1]is not.isSystemReminderContent: pure single-part reminder (true), multi-part all-reminder (true), mixed[reminder, prompt]plan-mode case (false), empty parts (false), user message quoting the literal<system-reminder>text without closing tag (false).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Reasonable — direct unit tests would localize regressions better than the current indirect coverage. Deferring it out of this PR as non-blocking test-hardening, though: both helpers are already exercised — indirectly via client.test.ts/geminiChat.test.ts, plus the regression tests added in earlier rounds for the rewind-truncation, plan-mode, and mid-history MCP-reminder cases — so there is no behavioral gap here, just a coverage-granularity improvement. At this review round I would rather let the PR converge than keep expanding it for non-blocking suggestions. Direct describe blocks for both helpers will go into a follow-up.
| * Mirrors `getStartupContextLength`'s open+close requirement so a user message | ||
| * that merely quotes the literal tag text isn't misclassified. | ||
| */ | ||
| export function isSystemReminderContent(content: Content): boolean { |
There was a problem hiding this comment.
[Suggestion] Same test-coverage gap as getStartupContextLength above. The parts.every(...) requirement is the load-bearing invariant that distinguishes a pure startup prelude entry from a per-turn reminder prepended to a real user prompt (the plan-mode / subagent case documented in the comment at lines 298-314). No direct test asserts that parts[0] being a reminder while parts[1] is a real prompt returns false.
A regression here would cause stripOrphanedUserEntriesFromHistory to either preserve orphaned failed prompts (leaking them into the next turn via appendCuratedContent) or pop structural MCP reminders (losing announcements permanently), and would cause isUserTextContent in historyMapping.ts / Session.ts to miscount rewind boundaries.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Same call as the getStartupContextLength thread above. Agreed the parts.every(...) invariant deserves a direct assertion, but it is already covered indirectly and by the plan-mode regression tests, so there is no current gap — deferring it to the same follow-up as non-blocking test-hardening rather than expanding the PR at this round.
| * prelude and no-ops when absent), this prepends when absent. No-ops if a | ||
| * prelude is already present so it can't double-prepend. | ||
| */ | ||
| async restoreStartupContextAfterCompaction(): Promise<void> { |
There was a problem hiding this comment.
[Suggestion] restoreStartupContextAfterCompaction has no unit-test coverage at all (grep restoreStartupContextAfterCompaction in client.test.ts returns zero matches). The no-op guard below — if (getStartupContextLength(currentHistory) !== 0) return; — prevents double-prepending when a prelude is already present after compaction, but is untested. The happy path (prepending a fresh prelude) is also untested.
This is a non-trivial method that mutates chat history on the auto-compaction path. A regression here would silently lose workspace folder structure, deferred-tool metadata, and MCP server instructions for the rest of the session — with no user-visible error and no structured telemetry (the catch at line ~1838 only emits debugLogger.warn).
Consider adding two tests triggered via the ChatCompressed event path:
getHistory()returns a history with no prelude → assertssetHistoryprepends the rebuilt prelude.getHistory()returns a history that already starts with a<system-reminder>entry → assertssetHistoryis NOT called (or called without a prepended entry).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fair — restoreStartupContextAfterCompaction has no direct unit test; the compaction-restore path is exercised at the integration level but the no-op guard and happy path are not pinned at the unit level. Deferring to the follow-up as non-blocking test-hardening at this round.
| return; | ||
| } | ||
|
|
||
| const currentHistory = this.getChat().getHistory(); |
There was a problem hiding this comment.
[Suggestion] Both refreshStartupContextReminder (this line) and restoreStartupContextAfterCompaction (line 679) call this.getChat().getHistory(), which internally runs structuredClone(history) — O(n·m) for n entries with m parts each. Both methods only inspect the first 1-2 entries via getStartupContextLength and then pass the remainder to setHistory without mutating individual Content objects. GeminiChat already exposes getHistoryShallow() (O(n), copyContentContainer per entry) which is used on similar read-then-replace hot paths elsewhere (lines 343, 1152, 1216, 1497, 2257).
For long conversations with hundreds of entries containing large tool-call / function-response payloads, the unnecessary deep clone adds measurable latency and memory pressure on the ToolSearch-reveal path and the auto-compaction recovery path.
| const currentHistory = this.getChat().getHistory(); | |
| const currentHistory = this.getChat().getHistoryShallow(); |
(and the same change at line 679 in restoreStartupContextAfterCompaction).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Good spot — both methods only read the prelude then replace history wholesale, so getHistoryShallow() is safe here and avoids the structuredClone. It is a non-correctness optimization, though, so I would rather not land a code change this late in review; I will fold it into the same follow-up as the test-hardening.
| } | ||
| } | ||
|
|
||
| private drainPendingAddedMcpToolsReminder(): void { |
There was a problem hiding this comment.
[Suggestion] drainPendingAddedMcpToolsReminder has no test coverage at all (grep drainPending in client.test.ts returns zero matches). The call site at line 1678 is gated on two non-trivial conditions that are untested:
messageType === SendMessageType.Cron(in addition toUserQuery) triggers the drain — no test verifies cron-fired prompts actually drain.!hasPendingToolCallsuppresses the drain when the last model entry contains afunctionCall— no test verifies this guard. If it were accidentally removed, the MCP reminder would be injected between a model'sfunctionCalland the corresponding tool response, violating provider role alternation.
The drain itself also mutates raw history by injecting a role: 'user' entry mid-history (producing [..., model, user_reminder, user_prompt, ...]) which extractCuratedHistory(curated=true) must then coalesce via appendCuratedContent. That mid-history coalescing path is also untested (separate finding).
Consider adding tests that:
- Call
runTurn(SendMessageType.Cron)after queuing MCP tools → assertsbuildAddedMcpToolsReminderwas called andaddHistoryreceived the reminder. - Queue an MCP tool, mock a
functionCallresponse → asserts the reminder is NOT drained; then aUserQueryturn → asserts it IS drained.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Agreed the Cron-drain path and the !hasPendingToolCall guard are not pinned by a direct test (the drain itself is covered by the "queues and drains" and "does not drain on tool-result" tests). The guard is in place and working, so a test here would protect against future refactors rather than close a current gap — deferring it to the follow-up as non-blocking test-hardening at this round.
|
Thanks for this PR @tanzhenxin — it's a well-motivated architectural change with a clear narrative. Template: the PR body doesn't follow the template headings exactly (uses "Summary" / "Rationale" / "Implementation" instead of "What this PR does" / "Why it's needed" / "Reviewer Test Plan"), but the content is thorough — the reviewer focus areas are well-scoped and the validation section is detailed. Not blocking, but worth aligning with the template on future PRs. On direction: the motivation is solid and well-timed. Keeping the system prompt byte-stable for the session lifetime so provider prefix caching isn't invalidated by progressive MCP discovery is a real win. The PR description traces the accretion path across three prior PRs (#3589 → #4022 → #4166) honestly — the dynamic-catalogue-in-static-prompt design was fine until MCP discovery became async, at which point it had to move. Claude Code's CHANGELOG shows similar investment in prompt-cache stability (cache TTL fixes, sub-agent cache miss fixes, resume cache miss with deferred tools), confirming this is a first-class concern. On approach: the scope is large (1736 additions, 34 files) but proportionate — every lifecycle path that builds or trims history (rewind, compression, retry cleanup, subagents, progressive MCP) must be updated when the shape of startup entries changes, and the test coverage tracks each one. A few observations:
One question before diving into code: have you tested how weaker models handle user-role Moving on to code review. 🔍 中文说明感谢 @tanzhenxin 的贡献——方向明确、叙述清晰的架构改动。 模板: PR body 没有严格遵循模板标题(用了 Summary/Rationale/Implementation 而非 What this PR does/Why it's needed/Reviewer Test Plan),但内容充实——reviewer focus 范围合理,validation 部分详细。不 block,但建议后续 PR 对齐模板。 方向: 动机扎实。保持 system prompt 在整个 session 生命周期内字节级稳定,避免 progressive MCP 发现导致 provider 前缀缓存失效,这是真实的性能收益。PR 描述诚实追溯了三步演进路径(#3589 → #4022 → #4166)——在 MCP 发现变成异步之前,动态目录放在静态 prompt 里没问题,之后就必须搬了。Claude Code 的 CHANGELOG 也显示类似投入(缓存 TTL 修复、sub-agent 缓存 miss 修复、带 deferred tools 的 resume 缓存 miss),确认这是行业一等关注点。 方案: 范围大(1736 行新增,34 文件)但成比例——启动条目的形状变了,每个构建或裁剪 history 的生命周期路径(rewind、compression、retry cleanup、subagent、progressive MCP)都必须同步更新,测试覆盖了每条路径。几点观察:
一个想在 code review 前提出的问题:弱模型对 user-role 进入代码审查。🔍 — Qwen Code · qwen3.7-max |
Code ReviewIndependent proposal vs PR approach: my baseline design matched the PR closely — single user-role No critical blockers found. The implementation is correct across every lifecycle path I traced:
Security hardening is solid: the Minor notes (not blockers):
Real-Scenario TestingUnit tests: 668 tests passed across all affected files:
Build + typecheck: passed. Before (installed build)After (this PR — non-interactive)After (this PR — interactive, startup + /clear + post-clear)The model correctly received startup context in both sessions (verified by the model's thinking output listing MCP tools, workspace dir, OS, agents). Post- 中文说明代码审查独立方案 vs PR 方案: 我的基线设计与 PR 高度一致——单个 user-role 未发现 critical blocker。 追溯的每条生命周期路径实现正确:
安全加固扎实: regex 改动防止二次扫描,手动解析器替代 regex 是线性无回溯的。对抗性空白测试(50k tabs + 50k 小问题(非 blocker):
真实场景测试单元测试: 668 项测试全部通过。Build + typecheck: 通过。 Before/after 非交互测试均正常返回。交互式测试:启动正常接收系统上下文, — Qwen Code · qwen3.7-max |
ReflectionStepping back, this PR does what good infrastructure work should: it makes a conceptual distinction (static vs dynamic session context) concrete in the code, and every downstream consequence follows naturally from that split. The motivation is real — prompt-cache stability directly affects response latency and cost for users with MCP servers. The three-PR accretion path (#3589 → #4022 → #4166) that created the problem is honestly documented, and the fix is the minimal structural change needed: move dynamic content out of the cached prefix into appended history where it extends rather than invalidates. My independent proposal covered the core move (single reminder entry, builder functions, lifecycle hooks, legacy detection). The PR exceeds it with the progressive MCP announcement mechanism — a clean queue-and-drain pattern that only injects on UserQuery/Cron turns, avoiding tool-call loop disruption. This is genuinely new value that wasn't in my baseline, and it solves a real UX gap (MCP tools arriving mid-session without requiring restart). The security hardening is another area where the PR goes beyond what the core refactor demanded. The ReDoS-free tag parser, the expanded zero-width character coverage, and the envelope breakout prevention are all real defenses against adversarial MCP servers. These aren't speculative — the After seeing it run: startup context is received correctly (the model's thinking explicitly lists MCP tools, workspace dir, OS), The scope is large (1736 additions, 34 files) but proportionate. When the shape of history entries changes, every lifecycle path that builds or trims history must be updated — that's not over-engineering, it's thoroughness. The test coverage tracks each path individually. One thing I'd flag for future consideration (not a blocker): the three UI files with unrelated Prettier reformatting add noise. A separate formatting commit or excluding them from this PR would have kept the diff focused. Minor. Verdict: approve. This is well-motivated, correctly implemented, thoroughly tested infrastructure work. The code is straightforward — the complexity lives in the problem domain, not in unnecessary abstraction. If I had to maintain this in six months, the builder functions, detection helpers, and lifecycle hooks would each tell me exactly what they do and why. 中文说明反思退一步看,这个 PR 做了好的基础设施工作应该做的事:把一个概念区分(静态 vs 动态 session 上下文)落实到代码中,所有下游后果自然跟随这个拆分。 动机真实——prompt-cache 稳定性直接影响使用 MCP server 的用户的响应延迟和成本。三步演进路径(#3589 → #4022 → #4166)诚实记录,修复方案是所需的最小结构变更:把动态内容从缓存前缀移到追加 history 中,在那里它扩展而非失效缓存。 我的独立方案覆盖了核心迁移(单 reminder 条目、builder 函数、生命周期钩子、legacy 检测)。PR 超出基线的地方是 progressive MCP 通知机制——一个干净的 queue-and-drain 模式,只在 UserQuery/Cron turn 注入,避免打断 tool-call 循环。这是真正的新价值,解决了 MCP 工具 session 中途到达时不需重启的真实 UX 缺口。 安全加固是另一个超出核心重构需求的领域。无 ReDoS 的标签解析器、扩展零宽字符覆盖、envelope breakout 防御——这些都是对抗性 MCP server 的真实防御。 实测确认:启动上下文正确接收(模型思维链明确列出 MCP tools、工作目录、OS), 范围大(1736 行新增,34 文件)但成比例。history 条目形状变了,每个构建或裁剪 history 的生命周期路径都必须更新——这不是过度工程,是彻底性。 判定:批准。 动机充分、实现正确、测试彻底。代码直观——复杂度在问题域中,不在不必要的抽象中。如果六个月后我来维护它,builder 函数、检测辅助函数和生命周期钩子各自都会告诉我它们做什么和为什么。 — Qwen Code · qwen3.7-max |
pomelo-nwu
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
Independent verification report — runtime, LinuxVerified at PR head Gates
Reviewer-Focus claims — all confirmed at runtime
Progressive MCP (Phase 2A) — full lifecycleA stdio MCP server (
Security — reminder envelope can't be broken out of (untrusted MCP content)An MCP tool whose description was The injected closing tag is escaped ( Captured evidence (baseline first request)First user message, coalesced (system prompt verified clean separately): roles: Notes for reviewers (not blockers)
Conclusion: every Reviewer-Focus item reproduces on Linux against the real built CLI, the build/bundle/typecheck gates and PR-touched unit suites are green, and no regression or envelope-breakout was observed. No blocking issues from this verification. |
isSystemReminderContent and getStartupContextLength detected the closing tag with .includes(), so an IDE-mode user turn — where the editor-context reminder is merged into the prompt's first text part — was misclassified as a pure structural reminder. That corrupted orphan-stripping (a failed turn was preserved instead of popped, risking a consecutive-user 400 on retry) and rewind indexing in both the CLI and ACP paths. Require the close tag at the end of the text, not merely present. Adds direct unit tests for both helpers, including the IDE-merged case.
…-startup-context Resolve conflicts after merging origin/main: 1. packages/core/src/core/prompts.test.ts — both sides deleted a function the other still tested. main (#4587) removed getSubagentSystemReminder (function + injection sites + tests) to stop excessive subagent spawning; this branch removed buildDeferredToolsSection when moving startup context into system reminders. Neither function exists in the merged prompts.ts, so both obsolete describe blocks are dropped. 2. packages/cli/src/acp-integration/session/Session.test.ts — main switched rewindToTurn from getHistory() to getHistoryShallow() and updated its rewind tests to mock both. This branch's new MCP-reminder rewind test predated that change and mocked only getHistory(); add the getHistoryShallow() mock so it exercises the post-merge code path. Verified: npm install + build + typecheck pass; prompts.test.ts (56) and Session.test.ts (79) green.
wenshao
left a comment
There was a problem hiding this comment.
[Critical] npm run typecheck fails: packages/cli/src/ui/components/InputPrompt.test.tsx:205 — TS2345: mockedUseBackgroundTaskViewState mock is missing two required properties on BackgroundTaskViewState: livePanelFocused and livePanelSelectedIndex. These were added to the interface in PR #4476, which this branch merged in via d36efa615.
mockedUseBackgroundTaskViewState.mockReturnValue({
entries: [],
selectedIndex: 0,
dialogMode: 'closed',
dialogOpen: false,
pillFocused: false,
livePanelFocused: false,
livePanelSelectedIndex: 0,
});
This blocks the typecheck gate; the PR cannot merge green without it. (Posted as a body-level comment because the error is on a line not touched by this PR's diff.)
— qwen3.7-max via Qwen Code /review
|
Re: the [Critical] typecheck review (review):
Checked on current HEAD (
The only typecheck failure reproducible locally is unrelated build-ordering noise (TS2305 "no exported member", core Completing the mock to match the interface is harmless hygiene and I'm happy to fold it into the same non-blocking follow-up as the other test-hardening suggestions, but it does not block this PR. |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review
|
Maintainer sign-off / merge rationale. Goal & architecture. This removes the progressive-MCP prompt-cache break (#4166) at the source: workspace/env context, the deferred-tool (ToolSearch) catalogue, and MCP server instructions move out of the cached system prompt into a single The one high-severity concern is resolved. Earlier review flagged that the pure- Remaining items are non-blocking:
On that basis this is good to merge. |
Summary
This PR moves startup scaffolding into user-role
<system-reminder>history and removes the synthetic startup acknowledgement turn. Workspace startup context, deferred-tool discovery metadata, and MCP server instructions now share the same reminder channel instead of being split between chat history and the provider system prompt.After the latest merge, this PR also includes the narrow Phase 2A behavior needed for progressive MCP tools: when MCP deferred tools become available after startup, the client appends an added-tool reminder instead of rewriting the system prompt.
Builds On
This change is the direct follow-on to the deferred-tool / MCP work that made it necessary:
systemInstructioninsetTools()so progressive MCP tools reach the model (made the catalogue mutate mid-session — the prompt-cache break this PR removes).Rationale
The underlying direction of this change is to separate static request context from dynamic session context. The provider-cached prefix — the system prompt and tool declarations — should stay byte-stable for the life of a session, while anything that varies per session or arrives mid-session should live in appended conversation history, where adding it extends the cached prefix instead of invalidating it.
The previous design violated that split, for a reason that accreted across three prior changes. ToolSearch (#3589) introduced the deferred-tool mechanism and placed its catalogue in
systemInstruction; deferring low-frequency built-ins (#4022) grew that catalogue. While the tool set was fixed at startup this was harmless — the system prompt was built once and never changed. Progressive MCP discovery (#4166) broke that assumption: MCP servers now finish connecting after the first request, and that fix surfaced their late tools by rebuildingsystemInstructionmid-session insidesetTools(). That rebuild invalidates the provider's prompt-prefix cache on every discovery — so the moment the catalogue became dynamic, keeping it in the cache-sensitive prefix stopped being viable. This PR is the direct consequence: the dynamic catalogue moves into the appended<system-reminder>channel, where MCP tools (deferred, hence absent from the function-declaration array) are disclosed by deltas that extend the cached prefix instead of breaking it.The previous design also split startup context across two channels. Workspace context lived in synthetic chat history, while deferred-tool metadata and MCP details lived in the system prompt. That made startup behavior harder to reason about because reset, compression, retry cleanup, subagents, and ToolSearch reveal each had to preserve or rebuild different pieces of context in different places.
The synthetic startup acknowledgement was another source of complexity. It inserted assistant text that the model never actually produced, so history-management code had to distinguish real conversation turns from startup scaffolding.
Using user-role
<system-reminder>entries gives us one lifecycle-aware channel for this class of context. The client can seed reminders at session start, rebuild them at explicit lifecycle boundaries like/clearand compression, and append late MCP added-tool reminders without changing the stable system prompt.Implementation
Startup reminders are now ordered user-role history entries. The provider-facing history still preserves role alternation by coalescing adjacent user entries where needed, while raw history keeps the startup prelude separately manageable for rewind, retry cleanup, compression, and subagent handling.
Deferred-tool reminders now live in reminder history and refresh after ToolSearch reveal. Tools already revealed through ToolSearch are no longer advertised as deferred in later reminder content.
MCP server initialize instructions are captured and included in reminder composition. Newly available MCP deferred tools are queued as append-only added-tool reminders and drained at the start of the next user or cron turn. They are not drained on tool-result turns, so tool-call loops keep their expected shape.
Subagents suppress the ToolSearch catalogue in their startup reminder because they receive their allowed tool schemas directly.
Reviewer Focus
Please check that the system prompt remains stable and no longer carries workspace startup context, deferred-tool summaries, MCP server instructions, or late MCP tool deltas.
Please check that startup reminders preserve ordering and provider role alternation, while raw history still keeps the startup prelude separately manageable.
Please check that
model.skipStartupContextskips only workspace context, not deferred-tool or MCP reminders.Please check the progressive MCP path end to end: late MCP reminder, ToolSearch reveal, then MCP tool invocation.
Validation
Commands run:
Prompts and flows used:
Mock OpenAI-compatible headless and interactive E2E flows used the local built CLI. The E2E plan covered ToolSearch reveal,
/clear,/compress, subagent startup, progressive MCP added-tool reminders, and a full progressive MCP reveal-and-call loop.Expected result:
Startup, deferred-tool, and MCP instruction context should be sent through ordered user-role
<system-reminder>content instead of system prompt mutation. Deferred tools should be advertised only until ToolSearch reveals them. Late MCP tools should be announced through append-only reminder deltas without changing the system prompt./clear, compression, retry cleanup, and subagents should rebuild or preserve the appropriate startup reminder shape.Observed result:
Focused core tests for client startup reminders and environment-context reminders passed:
155 passed. Build, typecheck, and bundle completed successfully. E2E groups A-H passed against the local built CLI. The separate E2E report comment summarizes reviewer-readable assertions instead of local log paths: #4053 (comment)Quickest reviewer verification path:
Scope / Risk
This changes the provider-facing startup request shape. The implementation keeps raw history and curated/provider history responsibilities separate so startup reminders remain manageable while providers still receive valid role alternation.
Moving dynamic tool metadata out of the system prompt protects prompt-cache stability, but it touches several lifecycle paths that build or trim history.
This is the first step of the static/dynamic split;
userMemorystill ridessystemInstructionand is a candidate to move into the reminder channel in future work.Removed-tool reminder deltas are intentionally not implemented in this PR. Reveal-driven reminder deltas are intentionally not implemented in this PR. Full end-to-end validation was run on macOS with mock providers, not on Windows/Linux terminals.
No user-facing migration is expected. The synthetic startup acknowledgement text is removed from chat history.
Testing Matrix
macOS was covered with focused
npx vitestruns and rootnpm run build && npm run typecheck && npm run bundle.Windows and Linux were not tested locally. Container sandbox paths are not affected by this change.
Linked Issues / Bugs
Related design: system reminder channel for startup context.
Builds on #3589, #4022, #4166 (see the Builds On section above for how each contributed to the necessity).