Skip to content

fix(cli,core): harden OOM prevention — idempotent compaction tests, explicit GC, debug log defaults#4914

Merged
wenshao merged 13 commits into
QwenLM:mainfrom
zzhenyao:fix/microcompact-oom-V2.5
Jun 14, 2026
Merged

fix(cli,core): harden OOM prevention — idempotent compaction tests, explicit GC, debug log defaults#4914
wenshao merged 13 commits into
QwenLM:mainfrom
zzhenyao:fix/microcompact-oom-V2.5

Conversation

@zzhenyao

Copy link
Copy Markdown
Contributor

What this PR does

Closes: #4815
Follow-up to #4824

1. Regression tests for compactOldItems idempotency

PR: #4824 Commit: 5957010 fixed a counting bug where already-compacted tool groups were treated as having real output. No test covered that fix. This PR adds three tests:

  • Already-compacted groups (resultDisplay === UI_COMPACT_CLEARED_MESSAGE) are skipped; second call is a no-op
  • All groups already compacted → no-op
  • Mixed group (one tool with output, one already cleared) → only the group with real output gets compacted

2. Enable explicit GC at critical pressure

enableExplicitGC defaults to true now. --expose-gc is added to start.js, dev.js, and the npm start script. Two safety tests added:

  • trigger_gc only appears in the critical tier (≥80%), not soft or hard
  • global.gc() is only called from the trigger_gc case in memoryPressureMonitor.ts — source-level guard against future misuse

This is safe because global.gc() only reclaims unreachable objects. The cleanup steps (compact_history, clear_file_cache) run first and break references; trigger_gc runs last to immediately free old_space.

3. Disable debug log file by default

isDebugLogFileEnabled() now returns false when QWEN_DEBUG_LOG_FILE is unset. PR: #4824 memory monitoring writes a debug line every 30s, that's unnecessary I/O for normal usage. Users can opt in with QWEN_DEBUG_LOG_FILE=1.

Why it's needed

  1. The idempotency fix from 5957010 had no test. A future refactor could silently drop the !== UI_COMPACT_CLEARED_MESSAGE guard and reintroduce over-compaction.
  2. With enableExplicitGC: false, V8 may not reclaim old_space fast enough after compact_history breaks references, which can lead to OOM in that window.
  3. Debug logs write continuously by default, adding I/O overhead for no user benefit.

Reviewer Test Plan

Before / After

Before After
Idempotent compaction test None 3 regression tests
enableExplicitGC false true
--expose-gc in startup scripts Missing Added to start.js, dev.js, npm start
trigger_gc tier guard test None Asserts only in critical
global.gc() call-site guard test None Source scan, exactly 1 call site
Debug log default Enabled Disabled, opt-in via QWEN_DEBUG_LOG_FILE=1

How to verify

npm run build && npm run bundle && npm start

# Tests
cd packages/cli && npx vitest run src/ui/hooks/useHistoryManager.test.ts
cd packages/core && npx vitest run src/services/memoryPressureMonitor.test.ts

Manual: trigger critical memory pressure → debug log should contain global.gc().
Verify debug log off by default: after startup, no new file under ~/.qwen/debug/. Set QWEN_DEBUG_LOG_FILE=1, restart, file appears.

中文

Closes: #4815

改动

1. compactOldItems 幂等性回归测试

PR: #4824 commit #595701096 修了已压缩 tool group 被重复计数的 bug,但没补测试。新增三个:已压缩跳过、全已压缩跳过、混合 group 只压缩有真实输出的。

2. critical 压力下默认开启 global.gc()

enableExplicitGC 改为 true--expose-gc 加到 start.js/dev.js/npm start。加了两个安全测试:trigger_gc 只在 critical 级别出现;global.gc() 只在 memoryPressureMonitor.ts 的 trigger_gc case 中调用。

安全:global.gc() 只回收无引用对象。compact_history/clear_file_cache 先断引用,trigger_gc 最后立即回收 old_space。

3. debug 日志默认关闭

isDebugLogFileEnabled() 未设置 QWEN_DEBUG_LOG_FILE 时返回 false。V2 的 30 秒内存监控每 30 秒写一行,没必要。用户用 QWEN_DEBUG_LOG_FILE=1 手动开启。

验证

npm run build && npm run bundle && npm start
cd packages/cli && npx vitest run src/ui/hooks/useHistoryManager.test.ts
cd packages/core && npx vitest run src/services/memoryPressureMonitor.test.ts

手动:触发 critical 内存压力,看日志里有没有 global.gc() freed N bytes。启动后 ~/.qwen/debug/ 下无新文件;设 QWEN_DEBUG_LOG_FILE=1 重启后有。

zzhenyao added 2 commits June 10, 2026 08:48
Cover the scenario fixed in commit 5957010 where already-compacted
tool groups (resultDisplay === UI_COMPACT_CLEARED_MESSAGE) were
incorrectly counted as having real output, causing over-compaction.

Three new test cases:
- Already-compacted groups are not re-compacted; second call is a no-op
- All tool groups already compacted → no-op
- Mixed tool group (some tools real, some cleared) → only groups with
  real output are compacted
- enableExplicitGC defaults to true, --expose-gc added to start/dev scripts
- isDebugLogFileEnabled() defaults to false (opt-in via QWEN_DEBUG_LOG_FILE=1)
- Add safety tests: trigger_gc only in critical tier, global.gc() only in
  memoryPressureMonitor.ts trigger_gc case
@zzhenyao zzhenyao marked this pull request as ready for review June 10, 2026 01:09

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] packages/core/src/config/config.ts:935-937QWEN_MEMORY_ENABLE_GC=1 env var check is now redundant since DEFAULT_PRESSURE_CONFIG.enableExplicitGC defaults to true. No =0 opt-out path exists. Either add QWEN_MEMORY_ENABLE_GC=0 handling or remove the redundant code.

[Suggestion] packages/core/src/services/memoryPressureMonitor.ts:494-506 — Silent warning chain: when global.gc is unavailable in production, the trigger_gc warn log is swallowed because isDebugLogFileEnabled() now defaults to false. The GC unavailability is invisible to users. Consider using console.warn or stderr for this diagnostic so it is not gated on debug log file settings.

— DeepSeek/deepseek-v4-pro via Qwen Code /review

Comment thread packages/core/src/services/memoryPressureMonitor.ts
Comment thread package.json Outdated
Comment thread packages/core/src/services/memoryPressureMonitor.test.ts Outdated
Comment thread packages/cli/src/ui/hooks/useHistoryManager.test.ts
- Replace brittle source-parsing test with behavioral tests for global.gc()
- Export UI_COMPACT_CLEARED_MESSAGE constant and use in tests
- Remove redundant NODE_OPTIONS override from start script
- Add production bin wrapper with --expose-gc for OOM protection
- Remove unused path import from memoryPressureMonitor.test.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
@zzhenyao

Copy link
Copy Markdown
Contributor Author

Thanks for the review! @wenshao All R1 comments addressed:

  • Replaced brittle source-parsing test with behavioral tests that mock global.gc and verify it is only called under critical pressure with enableExplicitGC: true
  • Exported UI_COMPACT_CLEARED_MESSAGE constant and updated tests to import it instead of hardcoding the string value
  • Fixed NODE_OPTIONS="--expose-gc" in package.json start script that overwrote user-set NODE_OPTIONS (e.g., --max-old-space-size); start.js already adds --expose-gc via nodeArgs so the npm script override was redundant and potentially harmful
  • Added production bin wrapper (scripts/cli-entry.js) that spawns dist/cli.js with --expose-gc so global.gc() is available in production, not just development
  • Changed bin entry in package.json from dist/cli.js to scripts/cli-entry.js and added the wrapper to published files. --expose-gc is non-invasive, it only triggers during critical memory pressurewhere the brief pause prevents an OOM kill

@zzhenyao zzhenyao requested a review from wenshao June 10, 2026 05:05

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] --expose-gc does not reach all deployment modes. Two execution paths were not updated:

  • scripts/create-standalone-package.js (lines 494, 503) — standalone package shims (Unix + Windows) launch node cli.js without --expose-gc
  • AcpBridge.ts:61 and httpAcpBridge.ts:4165 — daemon-spawned sessions use [cliEntry, '--acp'] without forwarding process.execArgv (contrast with relaunch.ts:44 which correctly spreads ...process.execArgv)

With enableExplicitGC: true, these paths hit the "global.gc is not available" fallback at critical pressure — GC won't actually run, undermining OOM prevention for standalone and daemon deployments.

— qwen3.7-plus via Qwen Code /review

Comment thread scripts/cli-entry.js Outdated
DragonnZhang
DragonnZhang previously approved these changes Jun 10, 2026

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the PR-specific changes (OOM prevention: idempotency tests, explicit GC, debug log defaults). The existing review comments from @wenshao cover the significant issues — signal forwarding in cli-entry.js and the --expose-gc coverage gap. The R1 fixes (behavioral GC tests, UI_COMPACT_CLEARED_MESSAGE import) addressed prior feedback well. No additional findings. — qwen-code-reviewer via Qwen Code /review

zzhenyao and others added 2 commits June 10, 2026 18:48
Standalone package shims and daemon-spawned sessions (AcpBridge,
httpAcpBridge) were missing --expose-gc, causing explicit GC to
silently fail under critical memory pressure.

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
@zzhenyao

Copy link
Copy Markdown
Contributor Author

Thanks for the review! @wenshao R2 comments addressed:

  • Forward --expose-gc to standalone package shims and daemon-spawned sessions (AcpBridge, httpAcpBridge)
  • Forward child process signal in cli-entry wrapper for correct OOM exit code (128+signal)

@zzhenyao

Copy link
Copy Markdown
Contributor Author

Pushed a new commit.
@DragonnZhang Could you please take another look when you have a moment? Thanks!

@zzhenyao zzhenyao requested review from DragonnZhang and wenshao June 10, 2026 11:24
Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
Comment thread scripts/cli-entry.js

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Reviewed PR #4914 which hardens OOM prevention through idempotent compaction tests, explicit GC enablement, and debug log defaults.

Changes Overview

  • Added regression tests for compactOldItems idempotency (3 test cases)
  • Enabled explicit GC by default (enableExplicitGC: true) with --expose-gc in startup scripts
  • Disabled debug logging by default (opt-in via QWEN_DEBUG_LOG_FILE=1)
  • Added production bin wrapper (scripts/cli-entry.js) with --expose-gc
  • Forwarded --expose-gc to standalone packages and daemon-spawned sessions

Findings

1 Critical issue found:

  • scripts/cli-entry.js is not marked as executable (mode 100644) despite being a bin entry point with shebang

Assessment

The PR addresses the OOM prevention gaps identified in prior reviews. The implementation is sound:

  • Test coverage is comprehensive and behavioral (not brittle source-parsing)
  • Signal forwarding in cli-entry.js correctly handles child process termination
  • The --expose-gc flag forwarding is now consistent across all deployment modes
  • Removing the redundant QWEN_MEMORY_ENABLE_GC env var check simplifies the config

The executable bit issue on cli-entry.js should be fixed before merge to ensure proper bin behavior across all installation methods.

Verdict: Request Changes (1 critical finding)

@wenshao

wenshao commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Local runtime verification report (head 74b690e)

Verdict: FAIL — one deployment-channel gap; everything else verified working at runtime.

Every behavior this PR claims was exercised in real tmux-driven sessions and works: --expose-gc reaches the process (and survives the self-relaunch), global.gc() actually fires at the critical tier in a live session, debug logging is off by default and opt-in works, the new bin wrapper propagates exit codes and signals, and serve/ACP children inherit execArgv. But the artifact the release pipeline actually publishes to npm bypasses all of it: release.yml publishes from working-directory: dist using a package.json generated by scripts/prepare-package.js, which hardcodes bin: { qwen: 'cli.js' } and does not include cli-entry.js. So for npm install -g @qwen-code/qwen-code — the primary distribution channel — the flagship change is inert (runtime-proven below, graceful warn, no crash). One small prepare-package.js change closes the gap.

Environment / method

  • Linux x64, Node v22.22.2, fresh git worktree at PR head 74b690ede, real npm ci + npm run build && npm run bundle
  • Real sessions driven in tmux with a live model endpoint and real tool calls; non-interactive runs via -y -p "…"
  • Critical pressure forced legitimately: NODE_OPTIONS=--max-old-space-size=256 (baseline heapUsed ≈103 MB → ratio ≈0.34) + QWEN_MEMORY_PRESSURE_SOFT/HARD/CRITICAL=0.3/0.31/0.32 (the minimum values validateMemoryPressureConfig accepts)
  • Unit tests intentionally not re-run locally (CI already covers them on 3 OSes); this report is runtime observation only

Steps (running app, captured output)

  1. npm start (TUI in tmux) → process tree shows the flag on both levels — relaunch preserves it via process.execArgv (relaunch.ts:44):
    2284011  node --expose-gc /root/git/wt-pr4914/packages/cli        ← start.js child
    2284074  node --expose-gc /root/git/wt-pr4914/packages/cli        ← self-relaunched app
    
  2. ✅ Debug log off by default → TUI ran 80+ s (covers the 30 s ticker), zero new files in ~/.qwen/debug/ (6,549 pre-existing files from the old default illustrate the I/O this saves); latest symlink mtime unchanged.
  3. QWEN_DEBUG_LOG_FILE=1 → new session file + latest updated; [MEMORY_USAGE] lines at exact 30 s cadence (21:12:41.98221:13:11.982).
  4. Explicit GC end-to-end at critical tier — real session (read_file tool call, answer correct, exit 0):
    21:14:35 [INFO]  [MEMORY_PRESSURE] Effective memory limit: 15204 MiB; V8 heap limit: 304 MiB
    21:14:41 [INFO]  [MEMORY_DUMP] Phase 1 diagnostics written … (trigger=critical, dump #1)
    21:14:41 [DEBUG] [MEMORY_PRESSURE] global.gc() freed 143360 bytes
    21:14:41 [INFO]  [MEMORY_PRESSURE] Cleanup "aggressive" completed; RSS delta -118784 bytes
    
  5. ✅ Repo-root tarball (npm pack → install into a temp prefix): bin resolves to cli-entry.js; --version0.17.1, exit 0; TUI renders; tree is wrapper (no flag) → --expose-gc child → --expose-gc relaunched app.
  6. 🔍 Unknown flag through the wrapper → exit 1, identical output and exit code as invoking dist/cli.js directly (exit-code propagation holds on the error path).
  7. ✅ Signal forwarding (the 74b690ede fix): kill -TERM on the wrapped child → wrapper re-raises, shell observes exit 143.
  8. ✅ Double Ctrl-C in TUI → "Agent powering down. Goodbye!", exit 0, no leftover processes.
  9. qwen serve (wrapper install) + POST /session (bearer auth) → spawned ACP child inherits the flag (the httpAcpBridge.ts change, observed at the HTTP surface):
    2351087 ppid=2348935  node --expose-gc …/dist/cli.js --acp
    
  10. npm-publish artifact: built the real release layout with npm run prepare:package, installed it, ran it under the same forced-critical setup. No process has --expose-gc (with and without the self-relaunch path), and the critical tier logs:
    21:20:39 [WARN] [MEMORY_PRESSURE] trigger_gc requested but global.gc is not available; start Node.js with --expose-gc
    
    Cross-checks: generated dist/package.jsonbin: { qwen: "cli.js" }, no cli-entry.js in dist/ or its files; npm view @qwen-code/qwen-code bin{ qwen: 'cli.js' } (live package has the same shape).

The gap, precisely

scripts/prepare-package.jswriteDistPackageJson() is the package.json that ships (release.yml "Publish @qwen-code/qwen-code" runs npm publish in dist/). The PR updates the repo-root package.json bin (which only affects packing from the repo root) and the standalone shims, but not this generator. Suggested fix: copy the wrapper into dist/ and point the generated bin at it — note cli-entry.js resolves join(__dirname, '..', 'dist', 'cli.js'), which is wrong for the dist-root layout where cli.js sits next to the wrapper, so the copy needs a path tweak (e.g. resolve ./cli.js relative to itself, which would work in both layouts if kept in a scripts/ subdir, or simplest: emit a small dist-specific wrapper from writeDistPackageJson).

Other observations (non-blocking)

  • ⚠️ kill -TERM <wrapper pid> orphans the app: child and grandchild re-parent to pid 1 and the TUI keeps running while the wrapper exits 143. Interactive Ctrl-C is unaffected (signal goes to the foreground process group — verified). This is the same pre-existing pattern the relaunch layer already has, now one level deeper; it matters for process supervisors and targeted kills. Installing TERM/INT handlers in the wrapper that forward to the child (requires async spawn instead of spawnSync) would close it.
  • The wrapper adds a third resident node process: measured RSS wrapper 45 MB + relauncher 193 MB + app 194 MB. ~45 MB of permanent overhead per session is a real cost for an OOM-prevention PR — worth a conscious sign-off.
  • --expose-gc is legal inside NODE_OPTIONS on Node 22 (verified: NODE_OPTIONS="--expose-gc" node -e 'typeof global.gc'function), so the dev.js approach is sound.
  • With logging enabled, each npm start run produces two session files (launcher process + relaunched app each get a session id). Cosmetic, but doubles file count for the opt-in case.
  • QWEN_MEMORY_ENABLE_GC was removed and no opt-out replaced it: with the default now true, the only way explicit GC stays off is the absence of --expose-gc. Probably fine (the guard degrades gracefully), but it means no supported knob to disable the critical-tier pause.
  • Code-read note: QWEN_DEBUG_LOG_FILE=" " (whitespace-only) enables logging — !value passes, then ' '.trim()'' is not in the deny-list. Degenerate value, but contradicts the "unset → off" spirit.
  • Pre-existing, not this PR: checkpointing's Refresh index: 100% (3086/3086) progress leaks into -p mode output.

Repro (forced critical tier)

QWEN_DEBUG_LOG_FILE=1 QWEN_CODE_NO_RELAUNCH=true \
NODE_OPTIONS=--max-old-space-size=256 \
QWEN_MEMORY_PRESSURE_SOFT=0.3 QWEN_MEMORY_PRESSURE_HARD=0.31 QWEN_MEMORY_PRESSURE_CRITICAL=0.32 \
npm start -- -y -p "Use the read_file tool to read package.json, reply with only the version."
# then: grep -E 'trigger_gc|global.gc' ~/.qwen/debug/$(readlink ~/.qwen/debug/latest)
中文摘要

结论:FAIL — 仅一个发布渠道缺口,其余全部真机验证通过。

  • npm start/dev:进程树两级均带 --expose-gc(relaunch 经 process.execArgv 保留)
  • ✅ critical 档真实触发 global.gc() freed 143360 bytes(真实会话 + 真实工具调用,压低阈值 0.30/0.31/0.32 + 256MB 堆)
  • ✅ debug 日志默认关闭(TUI 跑 80 秒零新文件);QWEN_DEBUG_LOG_FILE=1 时 30 秒节奏正常
  • ✅ 仓库根 tarball 的 cli-entry.js wrapper:退出码透传(坏参数 1=1)、信号重发(TERM→143)、Ctrl-C 干净退出
  • qwen servePOST /session 生成的 ACP 子进程继承 --expose-gc
  • npm 正式发布产物没有 wrapper:release.ymldist/ 目录发布,prepare-package.js 生成的 package.json 硬编码 bin: { qwen: 'cli.js' } 且不含 cli-entry.js。实测该产物 critical 时输出 trigger_gc requested but global.gc is not available。即 npm install -g 用户(主渠道)拿不到本 PR 的核心特性。修复很小:让 writeDistPackageJson() 输出 wrapper(注意 cli-entry.js../dist/cli.js 相对路径在 dist 布局下需调整为 ./cli.js)。
  • ⚠️ 次要:kill -TERM wrapper 会让真正的应用进程变孤儿(交互式 Ctrl-C 不受影响);wrapper 增加约 45MB 常驻内存(三个 node 进程)。

@zzhenyao

Copy link
Copy Markdown
Contributor Author

Thanks for the review! @wenshao All R3 comments addressed:

  • --inspect flag filtering (ee4733a): Filter --inspect/--inspect-brk from process.execArgv before forwarding to daemon children in both httpAcpBridge.ts and AcpBridge.ts, preventing EADDRINUSE when debugging qwen serve with --inspect
  • npm-publish artifact gap (6aef782): Updated scripts/prepare-package.js to generate a dist-specific cli-entry.js wrapper with correct relative path (./cli.js) and set bin: { qwen: 'cli-entry.js' } in the generated dist package.json, so npm install -g users get the --expose-gc wrapper
  • Whitespace-only QWEN_DEBUG_LOG_FILE (9bb7fa2): Added '' to the deny-list in isDebugLogFileEnabled() so QWEN_DEBUG_LOG_FILE=" " no longer silently enables logging
  • QWEN_MEMORY_ENABLE_GC=0 opt-out (9bb7fa2): Restored env var support in loadMemoryPressureConfig() with 0/false/off/no values to disable explicit GC

@zzhenyao

Copy link
Copy Markdown
Contributor Author

Thanks for the review! @DragonnZhang R4 comments addressed:

  • cli-entry.js executable bit (cfc134e): Set mode to 100755 via git update-index --chmod=+x so the bin entry works correctly from git clones and npm link

@zzhenyao zzhenyao requested review from DragonnZhang and wenshao June 10, 2026 22:21
Comment thread scripts/cli-entry.js
zzhenyao added 2 commits June 12, 2026 06:01
…m-V2.5

# Conflicts:
#	packages/cli/src/serve/httpAcpBridge.ts
…nnel

- Add --expose-gc to getAcpMemoryArgs() so daemon-spawned ACP children
  have global.gc() available for critical memory pressure cleanup
- Filter --inspect/-brk flags from process.execArgv to prevent port
  conflicts in multi-session daemon mode
- Update spawnChannel.test.ts for new getAcpMemoryArgs() return shape

This change was previously in httpAcpBridge.ts but lost during the
daemon refactor merge (QwenLM#4490) that moved spawn logic to acp-bridge.
@zzhenyao zzhenyao force-pushed the fix/microcompact-oom-V2.5 branch from 70aec2c to 7a1a5bf Compare June 11, 2026 22:22
DragonnZhang
DragonnZhang previously approved these changes Jun 11, 2026

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the full diff (33 files, +1216/-435). The three substantive changes are clean and well-tested:

  1. Idempotent compaction tests — Three focused regression tests that cover already-compacted groups, all-compacted, and mixed groups. Exporting UI_COMPACT_CLEARED_MESSAGE is the minimum surface change needed.

  2. Explicit GC by defaultenableExplicitGC: true is correct; the env-var semantics flip from opt-in to opt-out cleanly. --expose-gc propagation through start.js, dev.js, cli-entry.js, spawnChannel.ts, and AcpBridge.ts is consistent. The inspect-flag filter on process.execArgv correctly prevents debug-port conflicts in child processes. The new GC safety tests (tier guard + call-site guard) provide good regression coverage.

  3. Debug log default off — Straightforward inversion with sensible addition of empty string to the deny list.

No bugs, security issues, or logic errors found. The bulk of the line count is Prettier reformatting and markdown table alignment, which is noise but harmless.

@zzhenyao

Copy link
Copy Markdown
Contributor Author

@DragonnZhang Thank you for the thorough review and approval! 🙏

@zzhenyao

Copy link
Copy Markdown
Contributor Author

I've just merged the latest upstream changes from main. @wenshao Could you take another look when you have a chance?

@zzhenyao zzhenyao requested a review from wenshao June 11, 2026 22:48
wenshao
wenshao previously approved these changes Jun 12, 2026
@zzhenyao zzhenyao dismissed stale reviews from wenshao and DragonnZhang via 1c2a79e June 12, 2026 06:06
@zzhenyao zzhenyao force-pushed the fix/microcompact-oom-V2.5 branch from 1c2a79e to ddf7147 Compare June 12, 2026 06:19
@zzhenyao

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed reviews and approval! @wenshao @DragonnZhang I've resolved the merge conflicts with main. I kept both sets of tests in memoryPressureMonitor.test.ts.

@zzhenyao zzhenyao requested a review from DragonnZhang June 12, 2026 21:30
@wenshao

wenshao commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

@qwen-code /triage

@qwen-code-ci-bot

Copy link
Copy Markdown
Collaborator

Thanks for the PR! Solid follow-up to the OOM work in #4824 — these are practical hardening changes.

Template is mostly complete ✓ — has "What this PR does", "Why it's needed", and a Reviewer Test Plan with verification steps. Missing the "Risk & Scope" and "Tested on" sections from the template, which would be useful here given that enableExplicitGC and debug-log defaults are behavior changes users will notice.

On direction: Aligned. The three changes directly address the OOM bug from #4815. Regression tests for the compaction idempotency fix are important — that kind of guard is exactly what prevents silent reintroduction. Enabling explicit GC at critical pressure is a reasonable safety net (zero cost when not triggered, and the tier guards are well-tested). Disabling debug log by default is a sensible I/O reduction for normal usage.

On approach: The core logic is tight, but the PR is noisier than it needs to be. About 10 of the 32 changed files are pure Prettier reformatting (markdown tables in docs/design/daemon-acp-http/README.md, code style in packages/sdk-typescript/src/daemon-mcp/serve-bridge/tools/*.ts, ChannelBase.ts, extensionsCommand.ts, etc.) that have nothing to do with OOM prevention. These should be in a separate formatting PR — mixing them in makes the substantive changes harder to review and adds blame noise to git log. Would you consider splitting those out?

One thing worth thinking about: cli-entry.js exists in two places — as scripts/cli-entry.js and as an inline string in scripts/prepare-package.js. If one changes without the other, they'll drift silently. Could prepare-package.js just copy the file from scripts/ instead of inlining the content?

Not blockers — moving on to code review. 🔍

中文说明

感谢贡献!这是 #4824 OOM 修复的扎实后续工作。

模板基本完整 ✓,有 "What this PR does"、"Why it's needed" 和 Reviewer Test Plan。缺少 "Risk & Scope" 和 "Tested on" 部分,考虑到 enableExplicitGC 和 debug log 默认值是用户可感知的行为变更,建议补上。

方向: 对齐。三项改动直接针对 #4815 的 OOM 问题。压缩幂等性回归测试很重要,能防止将来无意中重新引入。critical 压力下启用显式 GC 是合理的安全网(未触发时零开销,压力层级守护测试覆盖充分)。debug 日志默认关闭减少正常使用的 I/O。

方案: 核心逻辑紧凑,但 PR 包含约 10 个纯 Prettier 格式化文件(docs/design/ 的 markdown 表格、sdk-typescript 的代码格式等),与 OOM 无关。建议拆成单独的格式化 PR,避免干扰实质改动的审查。另外 cli-entry.jsscripts/cli-entry.jsscripts/prepare-package.js 中各存一份,有漂移风险——prepare-package.js 可以直接复制文件而非内联内容。

非阻塞项,进入代码审查 🔍

Qwen Code · qwen3.7-max

@qwen-code-ci-bot

Copy link
Copy Markdown
Collaborator

Code Review

The core logic changes are clean — no correctness bugs or security issues found.

Substantive changes (all sound):

  • memoryPressureMonitor.ts — single-line flip enableExplicitGC: false → true. Straightforward.
  • config.ts — inverts the QWEN_MEMORY_ENABLE_GC env var from opt-in to opt-out. Correct: when the env var is unset or truthy, GC stays enabled (the new default); only explicit disable values (0, false, off, no) turn it off.
  • debugLogger.ts — two-line change: return true → return false when env var is unset. Adds '' to the deny list (dead code since the !value guard catches it, but harmless).
  • useHistoryManager.ts — exports UI_COMPACT_CLEARED_MESSAGE so tests can reference it. Clean.
  • cli-entry.js — subprocess wrapper using spawnSync to inject --expose-gc. Signal handling is correct (re-sends child signal to self). Exit code propagation is correct. This is the right approach for the npm/npx bin case where there's no other way to add Node flags.
  • spawnChannel.ts — adds --expose-gc to getAcpMemoryArgs() and propagates process.execArgv (filtering --inspect*). The execArgv propagation ensures daemon-spawned CLI children inherit memory settings.
  • AcpBridge.ts — same execArgv propagation pattern for the AcpBridge.start() path.

Formatting-only changes (~10 files): As flagged in Stage 1 — docs/design/daemon-acp-http/README.md, packages/sdk-typescript/src/daemon-mcp/serve-bridge/tools/*.ts, ChannelBase.ts, extensionsCommand.ts, nonInteractiveCliCommands.ts, server.ts, server.test.ts, client.ts, bridgeErrors.ts are all pure Prettier reformatting. These inflate the diff from ~800 semantic lines to ~1600 and add blame noise.

One maintenance concern: cli-entry.js content is duplicated between scripts/cli-entry.js (checked in) and the inline string in scripts/prepare-package.js (generated at build time). If one is updated without the other, they'll drift silently.

Real-Scenario Testing

Before (installed qwen 0.18.0)

runner@runnervm1li68:~/work/qwen-code/qwen-code$ qwen -p 'say hello in exactly 3 words' 2>&1 | tee /tmp/triage-test-232229/before.log
Hello there friend
runner@runnervm1li68:~/work/qwen-code/qwen-code$

Debug log enabled by default — session file created (16,908 bytes):

$ ls -la ~/.qwen/debug/
-rw-r--r-- 1 runner runner 16908 Jun 12 23:22 4a4463cc-d1a7-4234-a7e0-a45b6e2acc19.txt
lrwxrwxrwx 1 runner runner    40 Jun 12 23:22 latest -> 4a4463cc-d1a7-4234-a7e0-a45b6e2acc19.txt

After (PR branch via npm run dev)

runner@runnervm1li68:~/work/qwen-code/qwen-code$ npm run dev -- -p 'say hello in exactly 3 words' 2>&1 | tee /tmp/triage-test-232229/after.log

> @qwen-code/qwen-code@0.17.1 dev
> node scripts/dev.js -p say hello in exactly 3 words

DEV is set to true, but the React DevTools server is not running. Start it with:

$ npx react-devtools

Hello there friend
runner@runnervm1li68:~/work/qwen-code/qwen-code$

Debug log disabled by default — no session file created (only the unrelated system log remains at 571 bytes):

$ ls -la ~/.qwen/debug/
-rw-r--r-- 1 runner runner 571 Jun 12 23:40 2c9a17c6-1161-4286-b7ee-d539f573cb7a.txt

Behavior change confirmed: debug log file creation is suppressed when QWEN_DEBUG_LOG_FILE is unset. --expose-gc in dev.js works (CLI started successfully).

Unit Tests

All test suites from the PR pass:

Test file Result
useHistoryManager.test.ts 22/22 passed ✅
memoryPressureMonitor.test.ts 72/72 passed ✅
config.test.ts 211/211 passed ✅
spawnChannel.test.ts 20/20 passed ✅
loggers.test.ts 50/50 passed ✅
中文说明

代码审查

核心逻辑改动干净,无正确性 bug 或安全问题。

实质改动(均合理):

  • memoryPressureMonitor.ts — 单行翻转 enableExplicitGC
  • config.tsQWEN_MEMORY_ENABLE_GC 从 opt-in 改为 opt-out。
  • debugLogger.ts — 未设置环境变量时默认返回 false
  • cli-entry.js — 使用 spawnSync 注入 --expose-gc,信号和退出码处理正确。
  • spawnChannel.ts / AcpBridge.ts — 传递 process.execArgv(过滤 --inspect*)。

维护隐患: cli-entry.js 内容在 scripts/cli-entry.jsscripts/prepare-package.js 中各存一份,存在漂移风险。

格式化改动(~10 个文件): 如 Stage 1 所述,建议拆分。

实际测试

  • Before(已安装 qwen 0.18.0): 正常运行,debug 日志默认创建(16,908 字节)。
  • After(PR 分支 npm run dev): 正常运行,debug 日志默认未创建。行为变更已验证。
  • 单元测试: 5 个测试套件共 375 个测试全部通过。

Qwen Code · qwen3.7-max

@qwen-code-ci-bot

Copy link
Copy Markdown
Collaborator

This PR does what it says on the tin — three focused hardening changes that directly address the OOM pain from #4815, backed by solid regression tests. The before/after testing confirms the debug log default change works, and all 375 unit tests pass across the five affected suites.

My independent proposal for this problem would have been nearly identical to the PR's core changes. The one thing the PR does that I wouldn't have thought of is the cli-entry.js subprocess wrapper — it's a clever and correct solution to the "how do we pass --expose-gc when npm just runs node dist/cli.js" problem. The execArgv propagation in spawnChannel.ts and AcpBridge.ts is a slightly heavier approach than just adding --expose-gc to each spawn site, but it's defensible for daemon scenarios and the --inspect* filter is the right safety valve.

The two concerns from earlier stages are real but not blockers: the ~10 formatting-only files dilute the diff and add git blame noise (worth splitting next time, not worth blocking this PR over), and the cli-entry.js duplication between scripts/cli-entry.js and the inline string in prepare-package.js is a maintenance risk that a future refactor will likely catch.

Approving. ✅

中文说明

这个 PR 实现了它所承诺的——三项针对 #4815 OOM 问题的专注加固改动,配合扎实的回归测试。Before/after 测试验证了 debug 日志默认关闭的行为变更,5 个测试套件共 375 个测试全部通过。

我独立提出的方案与 PR 的核心改动几乎一致。PR 比我多想一步的是 cli-entry.js 子进程包装器——巧妙地解决了 npm 直接运行 node dist/cli.js 时如何传递 --expose-gc 的问题。spawnChannel.tsAcpBridge.ts 中的 execArgv 传递比逐个添加 --expose-gc 更重一些,但对 daemon 场景合理,--inspect* 过滤是必要的安全阀。

两个非阻塞顾虑:~10 个纯格式化文件增加了 diff 噪音(建议下次拆分);cli-entry.js 在两处存在内容重复(维护隐患,但不阻塞合并)。

批准。✅

Qwen Code · qwen3.7-max

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looks ready to ship. ✅

@zzhenyao

Copy link
Copy Markdown
Contributor Author

@wenshao Thanks for the reviews!

Regarding the ~10 formatting-only files flagged by triage: these were leftover formatting issues from other contributors' commits that I accidentally staged when running npm run preflight.

Should I remove these formatting changes, or is it fine to keep them? Once merged, future contributors running npm run preflight won't produce diffs for these files anymore.

@wenshao

wenshao commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

🔁 Re-verification — head ddf714784: new daemon GC-forwarding path verified; LGTM stands

Third pass, updating my FAIL report (head 74b690ede — npm-publish artifact gap) and my LGTM (head 94cb9fc — gap fixed, all verified). Since that LGTM the branch added 7a1a5bf45 (acp-bridge spawnChannel — forward --expose-gc, filter --inspect) plus two main merges. Re-verified the current head on Linux 6.12 / Node v22.22.2 (fresh worktree at ddf714784, npm ci + tsc/bundle, real qwen serve driven in tmux).

Verdict: LGTM stands. The new daemon ACP-child GC-forwarding path works (both --expose-gc delivery and --inspect filtering), my original npm-publish blocker remains fixed, and the regression coverage is load-bearing. One prior non-blocking test-hygiene nit (#1) is still open.

1. NEW — daemon-spawned ACP child GC forwarding (7a1a5bf45)

getAcpMemoryArgs() now always appends --expose-gc, and createSpawnChannelFactory filters --inspect/-brk from process.execArgv. This restores the forwarding "lost during the #4490 daemon refactor" — I'd verified the old httpAcpBridge path before; this is the acp-bridge spawnChannel path qwen serve now uses. Driven live (serve + POST /session, inspecting the child's /proc/<pid>/cmdline):

Parent daemon launched as Spawned ACP child cmdline Result
node … serve (no --expose-gc) node --max-old-space-size=16384 --expose-gc … --acp ✅ child gets --expose-gc anyway (unconditional getAcpMemoryArgs); no --inspect
node --inspect=127.0.0.1:9559 --expose-gc … serve node --expose-gc --max-old-space-size=16384 --expose-gc … --acp --inspect filtered (no debugger-port collision); --expose-gc kept
  • spawnChannel.test.ts: 20/20. Mutation check: deleting '--expose-gc' from getAcpMemoryArgs()always includes --expose-gc … fails — the test is load-bearing.
  • Minor (cosmetic, non-blocking): when the parent does have --expose-gc, the child carries it twice (once via forwarded execArgv, once via the unconditional memory args). Harmless duplicate; could dedupe.

2. Original FAIL (06-10) stays fixed — npm-publish artifact

Ran npm run bundle && npm run prepare:package on this head and inspected the published dist/ layout:

  • dist/cli-entry.js present; dist/package.jsonbin: { qwen: "cli-entry.js" }; files includes cli-entry.js.
  • The generated dist wrapper resolves join(__dirname, 'cli.js') — the correct dist-root path (exactly the tweak I flagged; the repo-root wrapper still uses ../dist/cli.js).
  • It spawns node --expose-gc cli.js …; node dist/cli-entry.js --version0.17.1, exit 0. (My LGTM already did the full npm pack → install -g → global.gc() under critical pressure run; this confirms the merges didn't regress it.)

3. Regression tests — green & load-bearing

Suite Result
useHistoryManager.test.ts (compaction idempotency) 22/22
memoryPressureMonitor.test.ts 72/72
config.test.ts "explicit GC is enabled by default"
spawnChannel.test.ts 20/20

Mutation: dropping the !== UI_COMPACT_CLEARED_MESSAGE guard in compactOldItemsshould not re-compact already-compacted tool groups (idempotent) fails. (Note: the PR's net useHistoryManager.ts change is now just export const UI_COMPACT_CLEARED_MESSAGE — the guard itself already lives in main; the PR's value here is the new regression test, which is what the mutation confirms.)

4. Status of my earlier findings

5. CI

ddf714784: Lint ✅, CodeQL ✅, Test ubuntu/macOS/Windows ✅, Classify/review-config ✅.

🇨🇳 中文版(点击展开)

🔁 复核 — head ddf714784:新增的 daemon GC 转发路径已验证;维持 LGTM

第三次复核,更新我的 FAIL 报告(head 74b690ede——npm 发布产物缺口)与 LGTM(head 94cb9fc——缺口已修、全部验证)。自该 LGTM 后,分支新增了 7a1a5bf45(acp-bridge spawnChannel——转发 --expose-gc、过滤 --inspect)以及两次 main 合并。在 Linux 6.12 / Node v22.22.2 上复核当前 head(ddf714784 全新 worktree、npm ci + tsc/bundle、tmux 驱动真实 qwen serve)。

结论:维持 LGTM。 新的 daemon ACP 子进程 GC 转发路径有效(--expose-gc 投递 + --inspect 过滤),我最初的 npm 发布阻塞项仍处于已修复状态,回归覆盖具有实际约束力。此前一个不阻塞的测试卫生小问题(#1)仍未处理。

1. 新增——daemon 派生的 ACP 子进程 GC 转发(7a1a5bf45
getAcpMemoryArgs() 现在总是追加 --expose-gccreateSpawnChannelFactoryprocess.execArgv 过滤 --inspect/-brk。这恢复了"在 #4490 daemon 重构合并中丢失"的转发——我之前验证的是旧的 httpAcpBridge 路径,这次是 qwen serve 现在使用的 acp-bridge spawnChannel 路径。实测(serve + POST /session,读子进程 /proc/<pid>/cmdline):

父 daemon 启动方式 派生的 ACP 子进程 cmdline 结果
node … serve --expose-gc node --max-old-space-size=16384 --expose-gc … --acp ✅ 子进程仍获得 --expose-gc(无条件 getAcpMemoryArgs);无 --inspect
node --inspect=127.0.0.1:9559 --expose-gc … serve node --expose-gc --max-old-space-size=16384 --expose-gc … --acp --inspect过滤(无调试端口冲突);--expose-gc 保留
  • spawnChannel.test.ts20/20。变异检查:从 getAcpMemoryArgs() 删除 '--expose-gc'always includes --expose-gc … 失败——测试有约束力。
  • 次要(外观、不阻塞):父进程本身带 --expose-gc 时,子进程会带两次(一次来自转发的 execArgv,一次来自无条件的 memory args)。无害重复,可去重。

2. 最初的 FAIL(06-10)仍处于已修复状态——npm 发布产物
在本 head 上跑 npm run bundle && npm run prepare:package 并检查发布的 dist/ 布局:

  • 存在 dist/cli-entry.jsdist/package.jsonbin: { qwen: "cli-entry.js" }files 包含 cli-entry.js
  • 生成的 dist wrapper 解析 join(__dirname, 'cli.js')——dist 根目录下的正确路径(正是我提示的那处调整;仓库根 wrapper 仍用 ../dist/cli.js)。
  • 它派生 node --expose-gc cli.js …node dist/cli-entry.js --version0.17.1,退出 0。(我的 LGTM 已做过完整的 npm pack → install -g → critical 压力下 global.gc();本次确认这些合并未导致回归。)

3. 回归测试——全绿且有约束力

套件 结果
useHistoryManager.test.ts(压缩幂等) 22/22
memoryPressureMonitor.test.ts 72/72
config.test.ts "explicit GC is enabled by default"
spawnChannel.test.ts 20/20

变异:去掉 compactOldItems 里的 !== UI_COMPACT_CLEARED_MESSAGE 卫语句 → should not re-compact already-compacted tool groups (idempotent) 失败。(注意:本 PR 对 useHistoryManager.ts 的净改动现在只是 export const UI_COMPACT_CLEARED_MESSAGE——卫语句本身已在 main;PR 在此处的价值是新增的回归测试,变异正是验证了它。)

4. 我此前findings的状态

5. CI
ddf714784:Lint ✅、CodeQL ✅、Test ubuntu/macOS/Windows ✅、Classify/review-config ✅。

@wenshao wenshao merged commit f9080e4 into QwenLM:main Jun 14, 2026
28 checks passed
@yiliang114

yiliang114 commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Two notes — first one's a real blocker, second is just a design thought (non-blocking).

Release is broken right now. The standalone packager enforces a strict dist allowlist and the new dist/cli-entry.js isn't on it, so the release job fails with Unexpected dist asset: .../dist/cli-entry.js (run). Quick fix, same as #5049 did for fzfWorker.js — I've put up #5153 to unblock it.

One thing I'd love your take on (non-blocking): the OOM hardening here is clearly worth doing. I'm just not sure the bin wrapper is the cheapest way to get --expose-gc into production — it re-spawns a child node for the whole CLI lifetime, plus the arg/stdio/signal forwarding. Since the compaction is doing the real work and the code already degrades gracefully when gc() isn't available, I wonder whether the wrapper is buying enough to justify the extra process. Could well be missing something — just felt worth a second look when you get a chance.

Opened #5154 to track this so it doesn't get lost here — no rush, purely for discussion.

pomelo-nwu pushed a commit that referenced this pull request Jun 15, 2026
The OOM-prevention work in #4914 added a dist/cli-entry.js bin wrapper
(re-spawns node --expose-gc cli.js) via prepare-package.js, but did not
register it in the standalone packager's strict dist allowlist. The
release job then fails with:

  Error: Unexpected dist asset: .../dist/cli-entry.js

Add cli-entry.js to DIST_ALLOWED_ENTRIES, same fix as #5049 did for
fzfWorker.js.
doudouOUC pushed a commit that referenced this pull request Jun 15, 2026
…xplicit GC, debug log defaults (#4914)

* test(cli): add compactOldItems idempotency regression tests

Cover the scenario fixed in commit 5957010 where already-compacted
tool groups (resultDisplay === UI_COMPACT_CLEARED_MESSAGE) were
incorrectly counted as having real output, causing over-compaction.

Three new test cases:
- Already-compacted groups are not re-compacted; second call is a no-op
- All tool groups already compacted → no-op
- Mixed tool group (some tools real, some cleared) → only groups with
  real output are compacted

* fix(cli,core): enable explicit GC and disable debug log by default

- enableExplicitGC defaults to true, --expose-gc added to start/dev scripts
- isDebugLogFileEnabled() defaults to false (opt-in via QWEN_DEBUG_LOG_FILE=1)
- Add safety tests: trigger_gc only in critical tier, global.gc() only in
  memoryPressureMonitor.ts trigger_gc case

* fix: address R1 review comments for memory pressure monitor

- Replace brittle source-parsing test with behavioral tests for global.gc()
- Export UI_COMPACT_CLEARED_MESSAGE constant and use in tests
- Remove redundant NODE_OPTIONS override from start script
- Add production bin wrapper with --expose-gc for OOM protection
- Remove unused path import from memoryPressureMonitor.test.ts

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* fix: forward --expose-gc to all deployment modes

Standalone package shims and daemon-spawned sessions (AcpBridge,
httpAcpBridge) were missing --expose-gc, causing explicit GC to
silently fail under critical memory pressure.

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* fix: forward child process signal in cli-entry wrapper

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>

* fix(cli,channels): filter --inspect flags when forwarding execArgv to daemon children

* fix: make cli-entry.js executable (mode 100755)

* fix(core): reject whitespace-only QWEN_DEBUG_LOG_FILE and add QWEN_MEMORY_ENABLE_GC=0 opt-out

* fix(scripts): include cli-entry.js wrapper in dist package for npm publish

* fix(acp-bridge): forward --expose-gc and filter --inspect in spawnChannel

- Add --expose-gc to getAcpMemoryArgs() so daemon-spawned ACP children
  have global.gc() available for critical memory pressure cleanup
- Filter --inspect/-brk flags from process.execArgv to prevent port
  conflicts in multi-session daemon mode
- Update spawnChannel.test.ts for new getAcpMemoryArgs() return shape

This change was previously in httpAcpBridge.ts but lost during the
daemon refactor merge (#4490) that moved spawn logic to acp-bridge.

---------

Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Severe OOM with qwen --resume and Escape key broken

5 participants