fix(cli,core): prevent memory monitor starvation during autonomous loops via heartbeat fallback#5097
fix(cli,core): prevent memory monitor starvation during autonomous loops via heartbeat fallback#5097zzhenyao wants to merge 5 commits into
Conversation
…ops via heartbeat fallback
|
Thanks for the review! @wenshao All R1 comments addressed:
|
qqqys
left a comment
There was a problem hiding this comment.
Prior critical issue is resolved at d32acbc: the core→CLI starvation callback is now guarded with try/catch, so a callback failure will not mask the original scheduler/tool error. I did not find any remaining critical issue in this pass.
|
Thank @qqqys for reviewing and approving! |
| this.onStarvationCallback?.(); | ||
| } catch (err) { | ||
| debugLogger.error( | ||
| `onStarvation callback failed: ${err instanceof Error ? err.message : String(err)}`, |
There was a problem hiding this comment.
[Suggestion] The new catch block uses the inline pattern err instanceof Error ? err.message : String(err), but getErrorMessage is already imported (line 12) and used 8 other times in this file. getErrorMessage additionally handles error.cause chains and guards against String(error) throwing.
| `onStarvation callback failed: ${err instanceof Error ? err.message : String(err)}`, | |
| `onStarvation callback failed: ${getErrorMessage(err)}`, |
— qwen3.7-max via Qwen Code /review
|
| Binary | Result |
|---|---|
pre-fix (merge-base f9080e44) |
✅ * Type your message or @path/to/file (starts normally) |
PR (d32acbc1a3) |
❌ ERROR useConfig must be used within a ConfigProvider (crashes on mount) |
Reproduced on a minimal clean launch (node dist/cli.js --auth-type openai --approval-mode yolo), no special env — it's a React context error, so it happens on every start.
Root cause
useMemoryMonitor now calls useConfig() (packages/cli/src/ui/hooks/useMemoryMonitor.ts), and AppContainer invokes that hook in its component body:
AppContainer.tsx:314 useMemoryMonitor(historyManager); // hook runs here (body)
AppContainer.tsx:3714 <ConfigContext.Provider value={config}> // provider only rendered here (return)
gemini.tsx:347 renders <AppContainer> with no ConfigProvider above it, so when useMemoryMonitor → useConfig() runs at line 314 there is no ConfigContext in scope and useConfig() throws (ConfigContext.tsx).
The fix is local: AppContainer already has config in scope — it's destructured from props at AppContainer.tsx:306 (const { settings, config, … } = props;), and used at line 3714. Thread that config into useMemoryMonitor (param/option) instead of calling useConfig() inside the hook. Note: simply swapping useConfig() for a non-throwing useContext(ConfigContext) would stop the crash but leave config undefined in the body, so getMemoryPressureMonitor() would be skipped and the heartbeat would silently never register — the config must actually be threaded.
Why the green tests didn't catch it
Both test edits in this PR mock away the exact integration that breaks:
AppContainer.test.tsx(+3) addsvi.mock('./hooks/useMemoryMonitor.js', () => ({ useMemoryMonitor: () => {} }))— the hook is stubbed out, so the test never runs the realuseConfigcall underAppContainer.useMemoryMonitor.test.ts(+9) adds avi.mock('../contexts/ConfigContext.js', …)returning a fakegetMemoryPressureMonitor, so the hook test never hits the real provider lookup.
So the suite passes while the binary crashes.
Secondary: the new behavior is untested
Mutation test — reverting both source files to merge-base (removing the entire starvation/heartbeat fix) and keeping the PR's tests:
core memoryPressureMonitor.test.ts → 72 passed (unchanged)
cli useMemoryMonitor.test.ts + AppContainer.test.tsx → 97 passed (unchanged)
Nothing fails — so the starvation fallback and the heartbeat have zero behavioral coverage. memoryPressureMonitor.test.ts is untouched by the PR; the cli test edits are mock-plumbing only.
The mechanism itself is sound (when reachable)
I exercised the real built MemoryPressureMonitor directly (throwaway probe): an explicit starvation state (a pending check + Date.now() advanced >60 s) does run performCheck() synchronously and fire onStarvationCallback once, and a callback that throws is swallowed (the R1 try/catch works). The renamed setOnStarvationCallback and the [STARVATION] warn log are present. So the core logic is fine — it just never gets a chance to run because the app crashes at mount.
Minor: cold-start threshold nuance (fix while you're in here)
lastCheckTime initializes to 0, so now - lastCheckTime > 60_000 is true immediately with real Date.now(). On a fresh monitor, if the first microtask is starved, the 2nd scheduleCheck triggers the synchronous fallback right away rather than after 60 s. The existing dedup test still passes only because the duplicate performCheck's eviction is coincidentally suppressed by the same-millisecond cleanup cooldown (it counts evictNotAccessedSince, not performCheck; my probe shows performCheck actually runs twice). Initializing lastCheckTime = Date.now() in the constructor would make the 60 s threshold hold from process start. (Non-blocking, and moot until the crash is fixed.)
Verdict
Do not merge. This PR crashes the CLI on every startup; the feature it adds never initializes. Needs: (1) thread config into useMemoryMonitor instead of useConfig(); (2) a test that mounts AppContainer with the real (un-mocked) useMemoryMonitor so this regression is actually guarded; ideally (3) the lastCheckTime init tweak. Happy to re-verify once pushed.
中文版(Chinese version)
⚠️ 本地验证:此 PR 会导致 CLI 启动即崩溃 —— 请勿按现状合并
我在 Linux(Node 22.22.2)上将 d32acbc1a3 构建为真实二进制运行。CLI 无法启动:在 mount 阶段抛错,根本到不了输入提示。 单元测试全绿、CI/approval 看起来没问题,但测试改动恰好把这个崩溃掩盖了。
阻断性:无条件启动崩溃
ERROR useConfig must be used within a ConfigProvider
- useConfig → useMemoryMonitor → AppContainer → renderWithHooks → …
A/B(启动命令、环境、工作区完全相同,唯一差别是 PR 代码):
| 二进制 | 结果 |
|---|---|
修复前(merge-base f9080e44) |
✅ * Type your message or @path/to/file(正常启动) |
PR(d32acbc1a3) |
❌ ERROR useConfig must be used within a ConfigProvider(mount 即崩溃) |
用最简洁的干净命令(node dist/cli.js --auth-type openai --approval-mode yolo、无特殊环境)也能复现——这是 React context 错误,每次启动都会发生。
根因
useMemoryMonitor 现在调用了 useConfig()(packages/cli/src/ui/hooks/useMemoryMonitor.ts),而 AppContainer 在其组件函数体里调用该 hook:
AppContainer.tsx:314 useMemoryMonitor(historyManager); // hook 在此(函数体)执行
AppContainer.tsx:3714 <ConfigContext.Provider value={config}> // provider 仅在 return 里渲染
gemini.tsx:347 渲染 <AppContainer> 时其上方没有 ConfigProvider,因此 useMemoryMonitor → useConfig() 在第 314 行执行时作用域内没有 ConfigContext,useConfig() 抛错(ConfigContext.tsx)。
修复就在本地: AppContainer 其实已经持有 config —— 它在 AppContainer.tsx:306 从 props 解构(const { settings, config, … } = props;),并在 3714 行使用。把这个 config 作为参数/选项传给 useMemoryMonitor,不要在 hook 内部调用 useConfig()。注意:仅把 useConfig() 换成不抛错的 useContext(ConfigContext) 能避免崩溃,但函数体里 config 会是 undefined,于是 getMemoryPressureMonitor() 被跳过、心跳静默地永远不会注册——必须真正把 config 传进去。
为什么全绿的测试没抓到
本 PR 的两处测试改动恰好把会崩溃的那段集成 mock 掉了:
AppContainer.test.tsx(+3)新增vi.mock('./hooks/useMemoryMonitor.js', () => ({ useMemoryMonitor: () => {} }))——hook 被打桩,测试根本不会在AppContainer下跑真实的useConfig。useMemoryMonitor.test.ts(+9)新增vi.mock('../contexts/ConfigContext.js', …)返回假的getMemoryPressureMonitor,于是 hook 测试不会触发真实的 provider 查找。
所以测试通过、二进制却崩溃。
次要:新行为没有测试覆盖
变异测试——把两个源文件都回退到 merge-base(移除整个 starvation/heartbeat 修复),保留 PR 的测试:
core memoryPressureMonitor.test.ts → 72 通过(不变)
cli useMemoryMonitor.test.ts + AppContainer.test.tsx → 97 通过(不变)
没有任何测试失败——说明 starvation 兜底与心跳零行为覆盖。memoryPressureMonitor.test.ts 本 PR 未改动;cli 测试改动仅是 mock 管线。
机制本身是对的(只要能跑到)
我直接驱动了真实构建的 MemoryPressureMonitor(一次性探针):显式构造 starvation(有 pending check + Date.now() 推进 >60s)确实会同步执行 performCheck() 并触发一次 onStarvationCallback;回调抛错会被吞掉(R1 的 try/catch 生效)。重命名后的 setOnStarvationCallback 与 [STARVATION] warn 日志都在。所以 core 逻辑没问题——只是因为 app 在 mount 阶段崩溃,它根本没机会运行。
小问题:冷启动阈值(顺手一起修)
lastCheckTime 初始为 0,因此在真实 Date.now() 下 now - lastCheckTime > 60_000 立即为真。新建 monitor 时若第一个 microtask 被饿死,第 2 次 scheduleCheck 会立刻触发同步兜底,而非等满 60s。既有 dedup 测试仍通过,只是因为重复的 performCheck 的 eviction 恰好被同毫秒的 cleanup cooldown 抑制(它统计的是 evictNotAccessedSince 而非 performCheck;我的探针显示 performCheck 实际跑了两次)。在构造函数里 lastCheckTime = Date.now() 可让 60s 阈值从进程启动时起算。(不阻断,且在崩溃修好前无意义。)
结论
不要合并。 此 PR 每次启动都会让 CLI 崩溃,它新增的功能根本不会初始化。需要:(1) 把 config 传给 useMemoryMonitor,不要用 useConfig();(2) 增加一个真正 mount AppContainer + 真实(不 mock)useMemoryMonitor 的测试,以守护此回归;最好再加上 (3) lastCheckTime 初始化的小修。推送后我可以再验证一次。
…nd add behavioral test coverage - Thread `config` into `useMemoryMonitor` as a parameter instead of calling `useConfig()` inside the hook, fixing the unconditional startup crash caused by hook execution before `ConfigContext.Provider` was mounted. - Initialize `lastCheckTime = Date.now()` so the 60 s starvation threshold holds from process start. - Add 5 behavioral tests: 2 for starvation fallback in core, 3 for heartbeat callback in CLI.
|
Thanks for the review! @wenshao All feedback from R2 and the local verification report has been addressed.
|
|
@zzhenyao — a process ask for this PR and for future ones, so review cycles aren't spent on things one local run would catch. Why I'm raising it. In R2 I built this PR as the real binary and it crashed on every startup — You've since pushed a startup-crash fix ( The ask — before pushing, not after review:
This isn't about the idea: the starvation/heartbeat mechanism itself checks out, and your OOM series (#4824 / #4914 / #4982) is solid, landed work. It's specifically the "did I actually run it?" step. If every round needs a reviewer to build and launch the binary just to catch a deterministic startup crash, that's a lot of borrowed time per PR. 🙏 🇨🇳 中文@zzhenyao —— 一个关于这个 PR、也关于今后的流程要求,免得 review 的精力花在本地跑一次就能发现的问题上。 为什么提这个。 在 R2,我把这个 PR 构建成真实二进制运行,结果每次启动都崩溃—— 你之后提交了启动崩溃的修复( 要求——在 push 之前,而不是 review 之后:
这不是针对想法本身:starvation/heartbeat 机制本身是成立的,你的 OOM 系列(#4824 / #4914 / #4982)也都是扎实落地的工作。问题专门在「我到底跑过没有」这一步。如果每一轮都要 reviewer 构建并启动二进制,才能抓出一个确定性的启动崩溃,那每个 PR 借用的时间就太多了。🙏 |
Verification: prevent memory-monitor starvation via heartbeat fallbackVerdict: PASS for the observable refactor, wiring, and no-regression — verified at runtime. Transparent caveat: the novel 60s-starvation fallback branch itself could not be reproduced at the real surface (a normal async agentic workload does not starve the event loop), so that specific path rests on the PR's unit tests. Details below. I built the PR in an isolated worktree (real Claim (my read of the diff): under autonomous loops with zero idle time, Core's Method: worktree Steps
Sample (debug log — monitor continuity under load)Findings
中文版本(点击展开)验证:通过心跳兜底防止内存监控器饿死结论:通过(PASS)—— 可观察的重构、接线与无回归均已在运行时验证。透明说明: 新增的 60s 饿死兜底分支本身无法在真实界面复现(正常的异步 agent 工作负载不会让事件循环饿死),因此该具体路径依赖 PR 的单元测试。详见下文。 我在隔离 worktree 中构建本 PR(真实 被验证的声明: 零空闲的自主循环下,Core 的 方法: worktree 步骤
样例(调试日志——负载下监控器持续性)观察
|
|
@qwen-code /triage |
|
Thanks for the PR, @zzhenyao! Template looks good ✓ — all required sections present. Minor: "Evidence (Before & After)" is blank rather than explicitly marked N/A for this non-UI fix, and "Not validated / out of scope" bullet is absent. Not blocking. Direction: This is squarely in-scope. OOM crashes during autonomous loops are a real reliability problem — the event loop starving Approach: The scope is tight and proportional. Three changes: (1) starvation detection in Moving on to code review and testing. 🔍 中文说明感谢贡献,@zzhenyao! 模板完整 ✓ —— 所有必需章节均在。小瑕疵:"Evidence (Before & After)" 留空而非标注 N/A(这是非 UI 修复),"Not validated / out of scope" 条目缺失。不阻断。 方向: 完全在项目范围内。自主循环中的 OOM 崩溃是真实的可靠性问题——事件循环饿死 方案: 范围紧凑且成比例。三个改动:(1) 进入代码审查和测试 🔍 — Qwen Code · qwen3.7-max |
Code ReviewThe implementation is clean and well-structured. Six files, one coherent theme, no scope creep. Core ( CLI (
Tests: Behavioral coverage is real, not mock-plumbing. The starvation test in core advances No blockers found. Real-Scenario TestingBuilt the PR ( Startup (no crash)CLI launched cleanly — no Debug log — memory monitor activeMonitor initialized correctly and Unit tests中文说明代码审查实现干净且结构清晰。6 个文件,同一主题,无范围膨胀。 Core( CLI(
测试: 行为覆盖是真实的,不是 mock 管线。Core 的饿死测试推进 未发现阻断性问题。 真实场景测试构建 PR( CLI 干净启动——无 调试日志显示监控器正确初始化, 单元测试全部通过:core 74 + cli hook 11 + AppContainer 91 = 176 测试 ✅ — Qwen Code · qwen3.7-max |
|
This PR solves a real and painful problem — OOM crashes during autonomous loops when memory monitors get starved — with a minimal, well-targeted fix. The starvation detection via a 60s timestamp check in The R2→R3 iteration is worth noting: the original submission had a startup crash ( My independent take: I wouldn't have structured this differently. The cross-package callback is the natural coupling for core→CLI, the Build is clean, all 176 relevant tests pass, CLI starts without errors under tmux, and the memory monitor initializes and runs correctly. Verdict: Approve. ✅ 中文说明本 PR 解决了一个真实且痛苦的问题——自主循环中内存监控器被饿死导致的 OOM 崩溃——用了最小化、精准的修复。通过 R2→R3 的迭代值得注意:原始提交有一个启动崩溃( 我的独立判断:不会用不同的方式构建。跨包回调是 core→CLI 的自然耦合, 构建干净,176 个相关测试全部通过,CLI 在 tmux 下无错误启动,内存监控器正确初始化和运行。 结论:通过。 ✅ — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
| const monitor = config.getMemoryPressureMonitor(); | ||
| if (!monitor) return; | ||
|
|
||
| monitor.setOnStarvationCallback(() => { |
There was a problem hiding this comment.
[Critical] This heartbeat — the CLI half of the fix — is never registered in the real interactive CLI, so the starvation fallback is dead in production. The unit + AppContainer tests pass only because they stub the monitor.
config.getMemoryPressureMonitor() returns undefined here at mount: the root config's monitor is created inside config.initialize() (config.ts:1776), which is awaited in a post-mount effect (AppContainer.tsx:521-524). At mount, getMemoryPressureMonitor() finds no own and no inherited monitor and returns undefined (config.ts:3780-3798), so if (!monitor) return (line 129) bails. The effect's deps [config, runMemoryCheck] are both stable (config is a fixed prop; compactOldItems is useCallback(…, []) → runMemoryCheck stable), so the effect never re-runs after initialize() later creates the monitor. Net: setOnStarvationCallback is never called, and the core monitor's this.onStarvationCallback?.() is always a no-op. Both tests inject a getMemoryPressureMonitor stub that returns a monitor synchronously at render, which never reproduces the "undefined until initialize() resolves" timing.
Fix: gate registration on init readiness so the effect re-runs once the monitor exists — e.g. thread the existing configInitialized state into the hook and add it to this effect's deps (or register from the same effect that awaits config.initialize()), plus a test where getMemoryPressureMonitor() returns undefined first and a real monitor only after an awaited initialize().
Two related gaps to fix alongside (so the feature helps the autonomous-loop scenario it targets):
- Per-child monitor: subagent/background tool completions call
scheduleCheck()on a separateMemoryPressureMonitorthatgetMemoryPressureMonitor()lazily clones for prototype-inheriting configs (config.ts:3781-3795); that child monitor has noonStarvationCallback, so starvation during subagent loops won't reach the CLI even after the fix above. Propagate/route the callback to child monitors (or register against the root and have children delegate). - Core branch reachability: the core fallback
if (this.pendingCheck && now - this.lastCheckTime > 60_000)(memoryPressureMonitor.ts:330) appears unreachable from its only caller —scheduleCheck()runs in thefinallyof theawaitedexecuteSingleToolCall(coreToolScheduler.ts:2927), and the queued microtask drains between awaited completions, sopendingCheckisfalseat the next call. Worth a test that drives the real awaited sequence rather than two back-to-back synchronousscheduleCheck()calls.
— claude-opus-4-8[1m] via Qwen Code /qreview
@wenshao Thanks for the detailed review, sorry for wasting your time. |
|
@zzhenyao Thanks for working on the memory-monitor starvation fix — flagging the key blocker so you can pick it back up when you continue (I see it's still a draft). The main one is a Suggested fix: gate registration on init-readiness so the effect re-runs once the monitor exists — e.g. thread the existing Two related gaps worth fixing in the same pass, so the feature actually helps the autonomous-loop scenario it targets:
Full detail is in the inline thread on 中文说明@zzhenyao 感谢推进内存监控 starvation 的修复——把主要卡点标一下,方便你继续(看到还是 draft)。 最主要的是一条 建议修法: 把注册门控在初始化就绪上,让 monitor 存在后 effect 能重跑——比如把现有的 顺带建议同一轮一起修的两个相关缺口,让这个特性真正能帮到它针对的 autonomous-loop 场景:
完整细节在 |
What this PR does
Under autonomous agent/goal loops, the event loop has zero idle time —
queueMicrotaskandsetIntervalcallbacks never fire, so both memory monitors are completely starved. UI history grows until OOM.Fix: Core's
scheduleCheck()detects starvation (≥60s since last successful check) and falls back to synchronous execution. On fallback, it fires a heartbeat callback to CLI. CLI checks if its ownsetIntervalhas been silent for ≥60s, and if so runs the full memory check inline.Why it's needed
During autonomous agent/goal loops, the event loop has no idle time. Both monitors get starved:
scheduleCheck()usesqueueMicrotask— no microtask gap, never executesuseMemoryMonitorusessetInterval(30s/60s) — timer phase never reachedUI history grows unbounded
Reviewer Test Plan
Before / After
Before: Both monitors completely starved during autonomous loops. Memory grows until OOM.
After: Core detects starvation after 60s, falls back to synchronous check, notifies CLI via heartbeat. CLI detects its
setIntervalhasn't run in 60s, runs full check inline.How to verify
npm run build && npm run typecheck/goalwith continuous tool operations, confirm[MEMORY_USAGE]logs keep appearing (previously stopped after a few minutes).Tested on
Risk & Scope
setOnToolCompleteCallbackis additive.Linked Issues
Closes: #4815
Follow-up to #4824 #4892
中文
本 PR 做了什么
自主 agent/goal 循环下事件循环零空闲,
queueMicrotask和setInterval回调完全饿死,两个内存监控器都不执行。UI 历史无限增长。修复:Core 的
scheduleCheck()检测饿死(距上次成功检查 ≥60s)后降级为同步执行,同时通过心跳通知 CLI。CLI 发现自己的setInterval超过 60s 没跑,主动执行完整内存检查。为什么需要
自主 agent/goal 循环期间事件循环没有空闲,两个监控器都被饿死:
scheduleCheck()用queueMicrotask— 没有微任务间隙,不执行useMemoryMonitor用setInterval(30s/60s)— timer 阶段到不了UI 历史无限增长
审查者测试计划
改前 / 改后
改前: 自主循环期间两个监控器完全饿死,内存持续增长直到 OOM。
改后: Core 60s 后检测到饿死,降级同步检查,通过心跳通知 CLI。CLI 发现
setInterval超 60s 没跑,主动执行完整检查。如何验证
npm run build && npm run typecheck/goal,确认[MEMORY_USAGE]日志持续输出(之前几分钟后就停了)。测试环境
风险与范围
setOnToolCompleteCallback是新增接口。