feat(core): layered tool-output truncation, per-message budget, per-tool limits#4880
Conversation
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
da28e3e to
d55d31d
Compare
|
Thanks for the PR @LaZzyMan! Template looks good ✓ On direction: This solves a real and important problem — unbounded tool output blowing past the model's context window and permanently stuck sessions. It directly mirrors Claude Code's layered truncation model (per-tool budgets, global thresholds, batch budgets), which is a well-validated approach. The CHANGELOG has multiple references to truncation fixes (MCP, Read tool paging, tool error truncation markers) confirming this area is actively maintained and relevant. On approach: The three-layer design feels right — per-tool
Overall the scope feels appropriate for the problem. Moving on to code review. 🔍 中文说明感谢贡献 @LaZzyMan! 模板完整 ✓ 方向: 这解决了一个真实且重要的问题——不受控的工具输出撑爆模型上下文窗口,导致会话永久卡死。直接复刻了 Claude Code 的分层截断模型(per-tool 预算、全局阈值、批预算),这是经过验证的方案。CHANGELOG 中有多处截断相关修复(MCP、Read 分页、工具错误截断标记),确认这个领域在持续维护中。 方案: 三层设计合理——基类上的
范围与问题匹配,进入代码审查。🔍 — Qwen Code · qwen3.7-max |
Code ReviewThe implementation is solid — well-structured, well-tested, and follows the project's conventions. After reading the full diff (16 files, +1111/-39): What works well:
No critical blockers. A few observations worth noting:
Real-Scenario TestingTest 1: Shell large output (
|
|
Stepping back to look at the whole picture: This PR solves a real problem that any user of this CLI will eventually hit — a shell command dumps 100KB+, a giant JSON comes back from an MCP server, and the session gets permanently stuck re-sending an over-budget history. The fix mirrors Claude Code's well-proven layered truncation model, which is the right reference architecture. The implementation is clean and well-thought-out. The 220 unit tests pass, covering the core truncation logic, the scheduler integration, the batch budget offload, the self-managed exemption, media preservation, and edge cases like disk-write failures and token-aware fallback. Real-scenario testing confirms shell truncation works correctly with proper spill files, and read_file passes through the 3000-line case without unnecessary truncation. The observations from Stage 2 (dual shell truncation path, Part[] sentinel inconsistency, This is a well-executed feature PR. Approving. ✅ 中文说明退一步看全貌: 这个 PR 解决了每个 CLI 用户最终都会遇到的问题——shell 命令输出 100KB+、MCP 服务器返回巨型 JSON,会话永久卡死在反复重发超额历史。修复方案复刻了 Claude Code 经过验证的分层截断模型,是正确的参考架构。 实现干净且深思熟虑。基类上的 220 个单元测试通过,覆盖核心截断逻辑、scheduler 集成、批预算 offload、自管豁免、媒体保留,以及磁盘写入失败和 token 感知回退等边缘情况。真实场景测试确认 shell 截断正确工作并生成正确的落盘文件,read_file 在 3000 行场景下无多余截断。 Stage 2 的观察(shell 双重截断路径、Part[] sentinel 不一致、 这是一个执行良好的功能 PR。批准。✅ — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
d55d31d to
1a0f14f
Compare
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
No new issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review
1a0f14f to
a33b678
Compare
Round 7 — addressed in a33b678Fixed (2 Critical):
Declined as out of scope (5 Suggestions) — this PR bounds oversized tool output before it overflows the context window; none of these affect that behavior:
|
a33b678 to
2279a6a
Compare
|
Qwen Code review did not complete successfully: Qwen review timed out after 55 minutes. See workflow logs. |
…ool limits Bound tool output before it enters conversation history, mirroring Claude Code's three-layer model: - Single-result: a per-tool/global char threshold spills oversized output to a temp file with a recoverable preview + read_file pointer; media parts are kept, with a token-aware fallback, 0600 owner-only perms on the spilled file, and a sentinel re-entrancy guard. - Per-message: when one batch of tool results exceeds toolOutputBatchBudget (default 200k chars), the largest results are offloaded to disk; a warning is logged if it still can't get under budget. - Per-tool: DeclarativeTool.maxOutputChars/truncateKeep getters (shell 30k, grep 20k, read-file self-managed, mcp 500k, agent 32k/tail). Truncation runs in CoreToolScheduler, deferring PostToolUse/skill metadata injection until after truncation so reminders are never bisected. Per-tool budgets are char-only so the global line cap can't undercut a self-managed tool (read-file) or a declared char budget (grep). The re-entrancy guard matches the truncation prefix at the start of the output so an injected copy of the phrase can't bypass the budget. Adds the toolOutputBatchBudget setting (core + CLI schema) and supporting tests.
2279a6a to
cdfef51
Compare
Round 8 — addressed in cdfef51Fixed (2):
Documented (1):
Declined (1):
|
DragonnZhang
left a comment
There was a problem hiding this comment.
No issues found. LGTM! The layered truncation design is well-structured, the deferred metadata pattern correctly prevents system-reminder bisection, and the test coverage is thorough across keep-direction, token-aware fallback, per-tool budgets via legacy aliases, batch offload, and Part[] media preservation. CI is all green (30 checks). -- qwen3-coder-plus via Qwen Code /review
| this.config, | ||
| call.request.name, | ||
| output, | ||
| { threshold: BATCH_OFFLOAD_PREVIEW_CHARS }, |
There was a problem hiding this comment.
[Suggestion] offloadCallOutput omits lines here, falling through to config.getTruncateToolOutputLines() (default 1000). Both the shell tool and MCP tool explicitly pass lines: Number.POSITIVE_INFINITY to prevent the global line cap from undercutting their char budgets — the same contract this PR enforces in the first pass. The batch offload breaks that contract: a user-configured small line cap would silently truncate the 2000-char preview.
| { threshold: BATCH_OFFLOAD_PREVIEW_CHARS }, | |
| { threshold: BATCH_OFFLOAD_PREVIEW_CHARS, lines: Number.POSITIVE_INFINITY }, |
— qwen3.7-max via Qwen Code /review
| content: wrappedMessage, | ||
| outputFile, | ||
| }; | ||
| } catch (_error) { |
There was a problem hiding this comment.
[Suggestion] The disk-write failure path returns truncatedContent + '\n[Note: Could not save full output to file]' without the TOOL_OUTPUT_TRUNCATED_PREFIX sentinel. Every other truncation outcome carries the sentinel, which truncateLlmContent, offloadCallOutput, and the combined second pass use to skip already-bounded content. If a failure-path preview exceeds a downstream budget (e.g., BATCH_OFFLOAD_PREVIEW_CHARS), it will be re-truncated — doubly-truncated output with no spill file at either level.
| } catch (_error) { | |
| } catch (_error) { | |
| return { | |
| content: | |
| `${TOOL_OUTPUT_TRUNCATED_PREFIX} (but could not save full output to file).\n` + | |
| truncatedContent, | |
| }; | |
| } |
— qwen3.7-max via Qwen Code /review
Independent runtime verification (Linux) — code looks merge-ready; PR evidence section needs correctionsVerified by building two real esbuild bundles — BEFORE ( Verdict: the implementation works as designed in everything I could exercise — but three claims in the PR's Evidence/Breaking-changes sections do not reproduce on either side of the A/B, and should be corrected (or their provenance explained) before merge so release notes don't inherit them. What entered history (captured provider requests, A/B)
Also confirmed AFTER: the in-tool shell truncation and the scheduler pass don't double-truncate (sentinel idempotency works — no nested headers), and
|
|
@wenshao You're right on all three — the read_file evidence was stale (from an intermediate dev build before the
The two pre-existing issues (short-lived-process stdout race; shell ~60KB buffer cap making the spill non-full) reproduce identically on both A/B builds, so they are out of scope for this PR — not introduced here. |
…ool limits (#4880) Bound tool output before it enters conversation history, mirroring Claude Code's three-layer model: - Single-result: a per-tool/global char threshold spills oversized output to a temp file with a recoverable preview + read_file pointer; media parts are kept, with a token-aware fallback, 0600 owner-only perms on the spilled file, and a sentinel re-entrancy guard. - Per-message: when one batch of tool results exceeds toolOutputBatchBudget (default 200k chars), the largest results are offloaded to disk; a warning is logged if it still can't get under budget. - Per-tool: DeclarativeTool.maxOutputChars/truncateKeep getters (shell 30k, grep 20k, read-file self-managed, mcp 500k, agent 32k/tail). Truncation runs in CoreToolScheduler, deferring PostToolUse/skill metadata injection until after truncation so reminders are never bisected. Per-tool budgets are char-only so the global line cap can't undercut a self-managed tool (read-file) or a declared char budget (grep). The re-entrancy guard matches the truncation prefix at the start of the output so an injected copy of the phrase can't bypass the budget. Adds the toolOutputBatchBudget setting (core + CLI schema) and supporting tests.
What this PR does
Bounds tool-call output before it enters conversation history, mirroring Claude Code's three-layer model. (1) Single-result truncation: a per-tool or global character threshold spills oversized output to a temp file and hands the model a recoverable preview plus a
read_filepointer — media parts are preserved, a token-aware fallback ensures truncation never makes output larger, spilled files get0600perms, and a prefix-matched sentinel guards against double-truncation. (2) Per-message batch budget: when one batch of tool results collectively exceedstoolOutputBatchBudget(default 200k chars), the largest results are offloaded to disk, with a warning logged if it still can't get under budget. (3) Per-tool thresholds declared on each tool:shell30k,grep20k,read-fileself-managed,mcp500k,agent32k keeping the tail. Truncation runs centrally inCoreToolScheduler, deferring PostToolUse/skill metadata injection until after truncation so<system-reminder>envelopes are never bisected. Adds atoolOutputBatchBudgetsetting wired through both core and the CLI settings schema.Why it's needed
Large tool output — a shell command dumping 100KB+, a giant JSON, a many-line file read — currently flows verbatim into history and can blow past the model's input window; in the worst case the session gets permanently stuck re-sending an over-budget history. This centralizes and generalizes output bounding so every string-returning tool is capped before it reaches the model, while keeping the full output recoverable on disk. The per-tool thresholds let each tool declare a budget that fits its typical useful output instead of a one-size-fits-all cap.
Reviewer Test Plan
How to verify
Unit:
npm run test --workspace @qwen-code/qwen-code-core— the truncation and scheduler suites cover keep-direction, the token-aware fallback, per-tool budgets, the batch-budget offload,Part[]media preservation, deferred hook-metadata not being bisected, and the "self-managed tool exempt from the line cap" case. End-to-end (headlessnode dist/cli.js ... --openai-logging):read_fileis unchanged — itsmaxOutputChars: Infinityexemption means the scheduler never truncates it, so it keeps paginating through fileUtils as before (defaultlimit= 1000 lines with a "Showing lines 1-1000" header; an explicitlimitreturns the full content) — whilerun_shell_commandonseq 1 20000is bounded to a ~30k char-only preview and spilled to a*.outputtemp file.Evidence (Before & After)
Non-UI change; verified by an A/B of two real bundles (BEFORE =
origin/main, AFTER = this PR merged into the same main), driving headless runs against a mock provider that captures the exact tool results entering history.read_fileis unchanged across the A/B —origin/mainhas no scheduler-side truncation at all, so both builds page identically: default params return the first 1000 lines with a "Showing lines 1-1000 of 3001" header, and an explicitlimit: 3000returns the complete 13,940 chars. The behavior change is in shell:seq 1 20000goes from 6,389 chars on BEFORE (the global 1000-line cap bit first: head ~200 + tail ~800 lines) to 30,474 chars on AFTER (the 30k char-only budget: head 6,000 + tail 23,961, line cap no longer undercutting), carrying theTool output was too large and has been truncatedpointer and a0600*.outputspill file. Batch offload greedily spills the largest result to a0600preview;-1disables the budget.Tested on
Environment (optional)
npm run build && npm run bundle, then headlessnode dist/cli.jswith--openai-loggingagainst the configured provider; plus the core unit suites.Risk & Scope
CoreToolScheduleron the success path of every tool call (previously only shell and MCP truncated locally), and hook/skill metadata injection is deferred to after truncation — a moderate reordering covered by a dedicated test. Per-tool budgets are char-only so a self-managed tool (read-file) and a declared char budget (grep) are no longer undercut by the global line cap.toolOutputBatchBudgetsetting (default 200000;-1disables). Shell output truncation is now char-only (the 30k budget is no longer undercut by the global line cap) and keeps both head and tail instead of head-only. Noread_filebehavior change — itsmaxOutputChars: Infinityis a defensive exemption guaranteeing the new scheduler pass can never cap a self-managed (paginating) tool; read_file's pagination is identical to before.Linked Issues
Related to #4049 (large tool output overflowing the context window). This does not fully close it — the live context-usage indicator and compaction prompts requested there are out of scope here.
中文说明
这个 PR 做了什么
在工具调用输出进入对话历史之前对其收口,复刻 Claude Code 的三层模型。(1) 单结果截断:按 per-tool 或全局字符阈值,把超大输出落盘到临时文件,给模型一段可恢复的 preview 加
read_file指针——保留媒体 part,token 感知回退保证截断不会让输出反而更大,落盘文件用0600权限,并用 sentinel(按前缀匹配)防止重复截断。(2) per-message 批预算:一批工具结果总字符超过toolOutputBatchBudget(默认 200k)时,把最大的几个落盘;仍降不到预算下时记一条 warning。(3) 每个工具声明的 per-tool 阈值:shell30k、grep20k、read-file自管、mcp500k、agent32k 保留尾部。截断集中在CoreToolScheduler执行,并把 PostToolUse/skill metadata 的注入延后到截断之后,避免<system-reminder>被切断。新增toolOutputBatchBudget配置,core 和 CLI settings schema 都接通。为什么需要
大工具输出——dump 100KB+ 的 shell 命令、巨型 JSON、多行文件读取——目前会原样进入历史,可能撑爆模型输入窗口;最坏情况下会话会永久卡死在反复重发超额历史。本 PR 把输出收口集中化、通用化,让每个返回字符串的工具在到达模型前都被截断,同时完整输出在磁盘上可恢复。per-tool 阈值让每个工具按自身典型有用输出量声明预算,而不是一刀切。
Reviewer 测试计划
如何验证
单测:
npm run test --workspace @qwen-code/qwen-code-core——truncation 与 scheduler 套件覆盖了 keep 方向、token 感知回退、per-tool 预算、批预算 offload、Part[]媒体保留、延后的 hook metadata 不被切断、以及"自管工具豁免行数上限"这个 case。端到端(headlessnode dist/cli.js ... --openai-logging):read_file行为不变——其maxOutputChars: Infinity豁免使 scheduler 永不截断它,仍按 fileUtils 分页(默认limit=1000 行带 "Showing lines 1-1000" header;显式limit返回完整内容);而对seq 1 20000执行run_shell_command被收口成约 30k 的 char-only preview 并落盘到*.output临时文件。证据(Before & After)
非 UI 改动;通过两个真实 bundle 的 A/B 验证(BEFORE=
origin/main,AFTER=本 PR 干净合入同一 main),headless 跑 mock provider 捕获进入历史的工具结果。read_file在 A/B 两侧行为一致——origin/main根本没有 scheduler 侧截断,两个 build 分页相同:默认参数返回前 1000 行带 "Showing lines 1-1000 of 3001" header,显式limit: 3000返回完整 13,940 字符。行为变化在 shell:seq 1 20000从 BEFORE 的 6,389 字符(全局 1000 行上限先生效:head ~200 + tail ~800 行)变为 AFTER 的 30,474 字符(30k char-only 预算:head 6,000 + tail 23,961,行数上限不再 undercut),带Tool output was too large and has been truncated指针和0600*.output落盘文件。批预算 greedy 把最大结果落盘成0600preview;-1禁用。测试平台
见上方 OS 表(🍏 macOS 已测;🪟 Windows / 🐧 Linux 未测,交由 CI)。
运行环境(可选)
npm run build && npm run bundle,然后用 headlessnode dist/cli.js加--openai-logging跑配置好的 provider;外加 core 单测套件。风险与范围
CoreToolScheduler的每次工具调用成功路径上执行(之前只有 shell 和 MCP 在本地截断),且 hook/skill metadata 注入被延后到截断之后——这是一处中等规模的重排,有专门测试覆盖。per-tool 预算是 char-only,因此自管工具(read-file)和声明了字符预算的工具(grep)不再被全局行数上限 undercut。toolOutputBatchBudget配置(默认 200000,-1禁用)。shell 输出截断现在是 char-only(30k 预算不再被全局行数上限 undercut)且同时保留首尾,而非只保留 head。read_file行为无变化——其maxOutputChars: Infinity是防御性豁免,保证新 scheduler pass 永不 cap 自管(分页)工具;read_file 的分页与之前完全一致。