feat(core): persist oversized tool results to disk (#4095 Phase 4)#5042
Conversation
Large tool outputs (>28K chars) are now saved to disk as tool-results/<callId>.txt and replaced with a 2KB preview stub in the LLM context, preventing OOM and context pollution. Key mechanics: - Triple-skip gate: read_file exempt → already-truncated skip → threshold+3K headroom - Budget: 50MB per-file cap, 500MB per-session cumulative (Buffer.byteLength) - Security: atomicWriteFile with mode 0o600, noFollow, forceMode; path.basename sanitization - Cleanup: 24h expiry on startup and /clear - Error branch: large stderr also persisted - Fallback: budget exhausted → preview-only stub; write failure → legacy truncateAndSaveToFile
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Hi @doudouOUC, thanks for the PR! The implementation looks interesting, but the PR body doesn't follow our PR template. Could you update it to match? Key sections missing:
- Reviewer Test Plan — the "How to verify" steps, "Evidence (Before & After)" with tmux output, and "Tested on" table. This is the most important one — without it, maintainers can't quickly verify the change locally. Your "Test plan" section is a good start, but it needs to be structured as reviewer-facing verification steps rather than a checkbox list.
- Risk & Scope — main risk/tradeoff, what's out of scope, any breaking changes. For a feature that writes to disk and introduces budget controls, this context matters.
- Linked Issues — the body references #4095 Phase 4, but there's no
Closes #Nor explicit link in the standard section. - 中文说明 — the bilingual summary block.
Your "Summary" and "Design" sections are solid — just need to restructure into the template format. Thanks! 🙏
中文说明
感谢贡献!PR 正文没有按照 PR 模板 的格式来写,缺少以下关键部分:
- Reviewer Test Plan — 需要包含 "How to verify" 验证步骤、"Evidence (Before & After)" tmux 输出截图、以及 "Tested on" 测试平台表格。这是最重要的部分——没有它,维护者无法快速在本地验证你的改动。
- Risk & Scope — 主要风险/权衡、不在范围内的内容、是否有破坏性变更。对于一个涉及磁盘写入和预算控制的功能,这些信息很重要。
- Linked Issues — 正文提到了 #4095 Phase 4,但缺少标准的
Closes #N链接。 - 中文说明 — 双语摘要块。
"Summary" 和 "Design" 部分写得很清楚,只需要按模板格式重新组织即可。谢谢!
— Qwen Code · qwen3.7-max
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
The \x00 regex is intentional security hardening to strip null bytes from callIds before using them as filenames.
|
Thanks — PR body has been restructured to match the template. Lint fix pushed in d486448. |
wenshao
left a comment
There was a problem hiding this comment.
[Critical] 3 existing tests fail (coreToolScheduler.test.ts:860, 987, 1063) because the mock Config in createSchedulerForLegacyToolTests (line ~706) does not include the new methods introduced by this PR: getToolResultBytesWritten(), trackToolResultBytes(), and storage.getToolResultsDir(). When test outputs exceed the gate threshold (threshold + GATE_HEADROOM), maybePersistLargeToolResult() throws a TypeError, routing the tool result to the error path.
Fix: add the missing mocks to the test helper:
getToolResultBytesWritten: () => 0,
trackToolResultBytes: vi.fn(),
// on storage mock:
getToolResultsDir: () => '/tmp/tool-results',— qwen3.7-max via Qwen Code /review
🧪 Local runtime verification (built CLI in tmux, real model + real tools) — ✅ PASS (gate works end-to-end; 2 findings worth a look before merge)Built the PR head ( Steps (all against the running app)
Findings
Verdict (merge reference)PASS. The feature does what it claims at the surface: oversized results from budget-less tools spill to a 🇨🇳 中文版(点击展开)🧪 本地运行时验证(tmux 中运行构建的 CLI,真实模型 + 真实工具)— ✅ 通过(门控端到端可用;合并前有 2 个值得一看的发现)构建 PR head( 步骤(全部针对运行中的程序)
发现
结论(合并参考)**通过。**功能在表面达成所宣称的:无预算工具的超大结果溢出到 |
- Fix mock Config in coreToolScheduler.test.ts: add getToolResultBytesWritten, trackToolResultBytes, and storage.getToolResultsDir to all 4 mock instances - isAlreadyTruncated: change includes to startsWith for <persisted-output> to avoid false positives from tool output containing the literal string - Remove dead code: recalcContentLength function and its call site (contentLength is unconditionally overwritten downstream) - buildStub: non-persisted stubs no longer use <persisted-output> tag to avoid misleading model into wasted read_file calls - Add GATE_HEADROOM rationale comment
Review round 1 — response summary
Test plan Finding 1 (sentinel bypass): fixed via |
…ance The new <persisted-output> model guidance added to prompts.ts changed the system prompt output, requiring snapshot updates.
…5042) * feat(core): persist oversized tool results to disk (#4095 Phase 4) Large tool outputs (>28K chars) are now saved to disk as tool-results/<callId>.txt and replaced with a 2KB preview stub in the LLM context, preventing OOM and context pollution. Key mechanics: - Triple-skip gate: read_file exempt → already-truncated skip → threshold+3K headroom - Budget: 50MB per-file cap, 500MB per-session cumulative (Buffer.byteLength) - Security: atomicWriteFile with mode 0o600, noFollow, forceMode; path.basename sanitization - Cleanup: 24h expiry on startup and /clear - Error branch: large stderr also persisted - Fallback: budget exhausted → preview-only stub; write failure → legacy truncateAndSaveToFile * fix(core): suppress no-control-regex lint for null-byte sanitization The \x00 regex is intentional security hardening to strip null bytes from callIds before using them as filenames. * fix(core): address wenshao review round 1 - Fix mock Config in coreToolScheduler.test.ts: add getToolResultBytesWritten, trackToolResultBytes, and storage.getToolResultsDir to all 4 mock instances - isAlreadyTruncated: change includes to startsWith for <persisted-output> to avoid false positives from tool output containing the literal string - Remove dead code: recalcContentLength function and its call site (contentLength is unconditionally overwritten downstream) - buildStub: non-persisted stubs no longer use <persisted-output> tag to avoid misleading model into wasted read_file calls - Add GATE_HEADROOM rationale comment * fix(core): update prompts.test.ts snapshots for persisted-output guidance The new <persisted-output> model guidance added to prompts.ts changed the system prompt output, requiring snapshot updates.
What this PR does
Persists oversized tool results (>28K chars) to disk as
tool-results/<callId>.txt, replacing the full content in the LLM context with a<persisted-output>stub containing a 2KB preview and a file pointer. The model can recover the full output viaread_file. Also adds 24-hour file cleanup on startup and/clear, and a per-session disk budget (500MB cumulative, 50MB per-file).Why it's needed
Large tool outputs (e.g.
find /,caton a big file, MCP tools returning massive JSON) can cause OOM and pollute the LLM context window, degrading response quality. The existingtruncateAndSaveToFileonly covers Shell/MCP tools and uses random filenames with no cleanup. This PR adds a universal gate that catches all tool types, with organized storage, budget controls, and automatic cleanup.Reviewer Test Plan
How to verify
find / -name "*.ts" 2>/dev/null | head -50000<persisted-output>stub showing file path + 2KB previewls -laconfirming0o600permissionsread_fileon the persisted file to confirm full content is recoverable/clearand check debug logs for cleanup activityecho hello) and confirm output passes through unchangedEvidence (Before & After)
N/A — non-UI change. Core truncation infrastructure, no visual impact.
Tested on
Environment (optional)
Local dev:
npm run devon macOS. TypeScript compilation verified. Unit tests deferred (worktree environment lacks full node_modules).Risk & Scope
ToolResultPersistedEventtelemetry is defined but not wired to emit (deferred to follow-up PR). Unit tests for the new functions are not included in this PR.Linked Issues
Partial fix for #4095 (Phase 4: Tool Result Disk Overflow)
中文说明
这个 PR 做了什么
将超大工具结果(>28K 字符)持久化到磁盘的
tool-results/<callId>.txt文件中,在 LLM 上下文中用包含 2KB 预览和文件指针的<persisted-output>桩替换完整内容。模型可以通过read_file恢复完整输出。同时添加了启动和/clear时的 24 小时文件清理,以及每会话磁盘预算(累计 500MB,单文件 50MB)。为什么需要
大型工具输出(如
find /、对大文件执行cat、MCP 工具返回大量 JSON)可能导致 OOM 并污染 LLM 上下文窗口,降低响应质量。现有的truncateAndSaveToFile仅覆盖 Shell/MCP 工具,使用随机文件名且无清理机制。此 PR 添加了通用截断门,覆盖所有工具类型,具有有序存储、预算控制和自动清理功能。审阅者测试计划
如何验证
find / -name "*.ts" 2>/dev/null | head -50000<persisted-output>桩,显示文件路径和 2KB 预览ls -la确认权限为0o600read_file确认完整内容可恢复/clear并检查调试日志中的清理活动echo hello)确认输出不变通过证据(前后对比)
N/A — 非 UI 改动。核心截断基础设施,无视觉影响。
测试平台
风险和范围
ToolResultPersistedEvent遥测已定义但未接线发射(推迟到后续 PR)。新函数的单元测试未包含在此 PR 中。关联 Issue
部分修复 #4095(Phase 4: Tool Result Disk Overflow)
🤖 Generated with Qwen Code