feat(supervise): meter driver inference on ONE journal ledger + A++ spend observability#319
Conversation
…ed pool + A++ spend observability
The driver's chat-LLM tokens — the largest single consumer in an agentic loop — were
invisible: routerDriverChat discarded the router's usage, and driver-executor summed only
spawned children's spend. So equal-k under-counted any driver arm and maxTurns=0 had no real
inference bound. This meters them, end to end, with a clean spend breakdown.
- BudgetPool.observe(spend) / observedTotal(): a direct free→committed debit (no reserve/
reconcile ticket) for the drivers' own inference. Preserves total ≡ free + reserved +
committed; free may go negative on overspend (honest exhaustion the in-loop guard reads).
- Scope.meter(spend, detail): observes the spend AND emits an agent.turn trace event (turn
index, tool calls, per-turn tokens/cost) — the live A++ view. All scopes share ONE pool, so
root and nested driver inference both land in observedTotal.
- DriverTurn gains usage/costUsd; routerDriverChat forwards the router's real usage; the driver
meters each turn (when usage is present — a scripted/offline turn meters nothing, so equal-k
stays exact in tests). This debit makes maxTurns=0 genuinely pool-bounded: a thinking driver
drains the pool → poolStarved halts it (proven by a never-stopping-driver test).
- SupervisedResult.winner gains spentBreakdown { driverInference, childWork } and spentTotal now
includes inference (spentTotalFromJournal + observedTotal). driverInference + childWork ===
spentTotal.
- Driver turns are NOT charged to the conserved iteration channel (turnSpend.iterations = 0) —
maxIterations budgets child rounds, not driver turns; counting them would conflate the two and
skew an equal-k iteration count. Turn count stays observable via the agent.turn events.
- trajectoryReport gains extraRootSpend: a coordination-driver equal-k arm passes
result.spentBreakdown.driverInference so report.total (→ equalKOnCost) matches spentTotal —
the journal ledger and the pool ledger agree (closes a latent cross-run equal-k divergence).
Adversarially reviewed: conservation SOUND and each token counted exactly once at root AND across
nested depth (proven by arithmetic + executed probes). Full suite 1002 pass; lint/typecheck/build
green. Built in an isolated worktree.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 47e995c5
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T20:42:08Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 2 (1 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 241.6s (2 bridge agents) |
| Total | 241.6s |
💰 Value — sound-with-nits
Adds metering of the driver's own LLM inference against the conserved budget pool — previously the largest token consumer was invisible to equal-k and the in-loop guard. Well-designed, in the codebase's grain, no better approach exists.
- What it does: Meters the coordination driver's own LLM inference tokens against the conserved budget pool. Previously,
routerDriverChatdiscarded the router'susage/costUsd, and only spawned children's spend was tracked via reserve/reconcile — so the largest single token consumer in an agent loop was invisible to equal-k and the in-loop budget guard. This change adds: (1)BudgetPool.observe(spend)— a d - Goals it achieves: 1. Make the driver's own inference (the largest token consumer) visible to the conserved pool — so equal-k counts driver tokens and the in-loop guard (
poolStarved) can halt a thinking driver when the pool runs dry. 2. MakemaxTurns=0genuinely pool-bounded (previously only arunawayTripwireTurns=2000safety net; now tokens drain the pool →poolStarvedhalts). 3. Provide a clean A++ spend o - Assessment: The design is coherent and respects the codebase's grain at every layer:
observe()mirrors the existing free/committed/reserved channel structure increateBudgetPoolwithout breaking the invariant;meter()onScopeextends the same object that providesspawn()/next()— the natural extension point for "emit a trace event and debit the pool"; the event taxonomy (agent.turn) is an exist - Better / existing approach: none — this is the right approach. Two alternative paths were considered and rejected by the codebase's existing architecture: (a) Modeling driver turns as spawned pseudo-children with reserve/reconcile: would require a separate executor lifecycle per turn, fighting the driver-is-the-agent grain and adding enormous complexity for a per-turn metering need. (b) Journaling driver inference as events
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound
Closes a real accounting hole: driver LLM inference tokens were invisible to the conserved pool, the single-largest costed consumer in an agentic loop; the parallel observe channel fits naturally alongside reserve/reconcile and correctly debits the pool at all nesting levels.
- Integration:
Scope.meter()is called fromcoordinationDriverAgent.act()(coordination-driver.ts:177), which is the general-purpose driver consumers wire withrouterDriverChat.routerDriverChat(router-driver-chat.ts:27-30) now forwards the router's realusage/costUsd. The supervisor assemblesspentTotalby summing journal-settled child-work +pool.observedTotal()(supervisor.ts:177-190). `traje - Fit with existing patterns: Extends the existing two-channel pool design (reserve/reconcile for spawned children) with a parallel
observechannel for non-reserved spend. The invarianttotal ≡ free + reserved + committedis preserved by construction (budget.ts:239-255:observemovesfree→committeddirectly;freemay go negative on overspend, whichpoolStarvedreads as exhaustion).Scope.meter(scope.ts:363-382) - Real-world viability: Thread safety: JavaScript's event-loop model means
pool.observe()(budget.ts:239) never races withpool.reserve()— meter happens during the driver's synchronous turn processing, andpoolStarvedreadsfreeTokensafter the debit. Missing usage (mock/scripted turn): theif (res.usage || res.costUsd !== undefined)guard (coordination-driver.ts:166) skips metering exactly when there is nothi - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
💰 Value Audit
🟡 extraRootSpend is manual caller-plumbing — not automatic reconciliation [maintenance] ``
trajectoryReportrequires the caller to explicitly passresult.spentBreakdown?.driverInferenceasextraRootSpendfor the trajectory ledger to agree withspentTotal. If the caller forgets,equalKOnCostwould under-count the driver arm's tokens. This is documented inwave-types.ts:544-551, and the test attests/loops/driver-inference-metering.test.ts:234-283shows correct usage. No better automatic approach exists without either journal bloat (pseudo-events for non-child spend) or bi
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 86 | 79 | 79 |
| Confidence | 70 | 70 | 70 |
| Correctness | 86 | 79 | 79 |
| Security | 86 | 79 | 79 |
| Testing | 86 | 79 | 79 |
| Architecture | 86 | 79 | 79 |
Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM poolStarved in-loop guard ignores usd channel — driver can overspend usd without halting — src/runtime/supervise/coordination-driver.ts
poolStarved (line 102-105) checks only
b.tokensLeft < perWorker.maxTokens && b.reservedTokens <= 0. Before this PR, driver inference was never metered, so usd was only drained by child spawns (which fail-closed via reserve). Now that meter() debits usd via pool.observe() when usdCapped, the driver's own inference can drain freeUsd negative without poolStarved catching it. Impact: with maxTurns=0, a usd-capped pool (e.g., maxUsd: 1.0, maxTokens: 10_000_000), and per-turn cost of ~$0.01 / 200 tokens, the driver runs until either tokens drain (unlikely with a huge token ceiling) or the 2000-turn runaway tripwire fires — potentially overspend
🟡 LOW isNonEmptySpend does not check ms channel — src/runtime/supervise/supervisor.ts
Line 452: isNonEmptySpend returns s.iterations > 0 || s.tokens.input > 0 || s.tokens.output > 0 || s.usd > 0 — omits s.ms. Currently harmless because driver turns set ms:0 (coordination-driver.ts:175), but if ms were ever populated on observed spend, a non-empty spend with only ms>0 would not produce a spentBreakdown. Add || s.ms > 0 for robustness.
🟡 LOW isNonEmptySpend omits ms channel — spentBreakdown gating inconsistent with addSpend — src/runtime/supervise/supervisor.ts
isNonEmptySpend (line 452-453) checks iterations, tokens.input, tokens.output, and usd, but NOT ms. A metered Spend with only ms>0 would be added to spentTotal via addSpend (line 442-449, which sums ms) but spentBreakdown would be omitted because isNonEmptySpend returns false. Not reachable via coordination-driver (hardcodes ms:0 at line 175), but the helper's contract is m
🟡 LOW Observability test only validates the first agent.turn event — tests/loops/driver-inference-metering.test.ts
Line 218-231: The test asserts
turnEvents.length === 3but only inspectsturnEvents[0]. The second and third events' payloads (turn indices 1 and 2, different toolCalls/spend) are assumed correct. The event structure is the same codepath for all events, so risk is low, but a typo in the detail spread (e.g.turnincrement) would pass this test. Consider adding assertions on turnEvents[1].turn (should be 1) and turnEvents[2].toolCalls (should be []).
🟡 LOW meteredChat helper repeats last turn silently — could mask off-by-one in a future test — tests/loops/driver-inference-metering.test.ts
Lines 58-67: meteredChat returns turns[Math.min(i, turns.length-1)] ?? {} when i exceeds the array, silently repeating the last scripted turn instead of throwing or returning an empty stop-turn. In test 1 (line 78-90) this is fine — the last turn has no toolCalls so the driver stops. But if a future test adds turns where the last one has toolCalls, the driver would loop forever repeating it (bounded only by maxTurns or the pool). A defensive
{ toolCalls: [] }sentinel or an explicit throw on
🟡 LOW Zero-costUsd forwarding path not covered — tests/loops/router-driver-chat.test.ts
The router-channel check uses
typeof r.costUsd === 'number', which forwardscostUsd: 0(the router reports a real turn that cost $0). The test only covers non-zero(0.013)and undefined — the zero-case is a valid production scenario (unpriced model, free tier). Add a test withcostUsd: 0to confirm the forwarding guard doesn't silently drop it.
tangletools · 2026-06-16T20:55:04Z · trace
Address the PR reviewer's findings on the inference-metering PR: - MEDIUM: poolStarved ignored the usd channel. Now that meter() debits usd via observe(), a usd-capped pool with a large token ceiling (e.g. maxUsd:1, maxTokens:10M) could let the driver overspend usd up to the 2000-turn tripwire. poolStarved now breaks on usd exhaustion too; BudgetReadout gains `usdCapped` so the guard distinguishes a real usdLeft<=0 from an uncapped pool. New test: maxTurns=0 halts on the usd ceiling with the token ceiling untouched. - isNonEmptySpend now checks the ms channel (consistent with addSpend) so the spentBreakdown gate matches the total on every channel. - Tests: assert all three agent.turn events (not just the first); meteredChat returns a STOP turn past the script instead of silently repeating the last; cover the real costUsd:0 forwarding path. Full suite 1004 pass; lint/typecheck/build green.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — b9adc546
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T21:43:32Z
tangletools
left a comment
There was a problem hiding this comment.
⚪ Value Audit — audit-incomplete
| Verdict | audit-incomplete |
| Concerns | 1 (1 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 90.0s (2 bridge agents) |
| Total | 90.0s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: kimi-code/kimi-for-coding
- Bridge attempts: 3
- Bridge error: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":3,"maxActive":24,"maxQueue":32}}}; opencode/zai-coding-plan/glm-5.1: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reas
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: kimi-code/kimi-for-coding
- Bridge attempts: 3
- Bridge error: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":2,"maxActive":24,"maxQueue":32}}}; opencode/zai-coding-plan/glm-5.1: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reas
🔎 Heuristic Signals
🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 80 | 80 | 80 |
| Confidence | 70 | 70 | 70 |
| Correctness | 80 | 80 | 80 |
| Security | 80 | 80 | 80 |
| Testing | 80 | 80 | 80 |
| Architecture | 80 | 80 | 80 |
Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision.
🟡 LOW extraRootSpend is opt-in — existing trajectoryReport callers silently exclude driver inference if extended to driver arms — src/runtime/personify/trajectory.ts
Lines 107-119: trajectoryReport folds extraRootSpend onto the root node only when the caller explicitly passes it. The bench gate (bench/src/gate.ts:380) and test helpers call trajectoryReport without it — currently correct because those arms are fanout/combinator shapes that never meter. But if the gate is later extended to coordination-driver arms, report.total would silently diverge from SupervisedResult.spentTotal (driver inference excluded), and equalKOnCost would compare on child-work-only cost — a silent integrity regression exactly analogous to the hole this PR closes. The doc on extraRootSpend (wave-types.ts:544-551) says 'Omit for a fanout
🟡 LOW observe function debits freeIterations but driver never uses this channel — src/runtime/supervise/budget.ts
At line 248,
freeIterations -= spend.iterations. The sole caller in coordination-driver.ts:178 explicitly stampsiterations: 0with a rationale comment that the iteration channel budgets child rounds, not driver turns. The iteration debit inobserveis therefore dead code for the current path. If a future driver callsscope.meterwithiterations > 0, it would consume the iteration budget alongside child rounds — which may or may not be intended. Consider either removing the iteration debit fromobserve(if driver inference should NEVER consume iterations) or documenting the contract more explicitly (e.g., a comment that iteration debit is intenti
🟡 LOW poolStarved usd check is asymmetric with token check — driver burns inference usd between perWorker.maxUsd and zero — src/runtime/supervise/coordination-driver.ts
Lines 105-111: poolStarved checks
tokenStarved = b.tokensLeft < perWorker.maxTokens(can't afford a worker) butusdStarved = b.usdCapped && b.usdLeft <= 0(completely exhausted). When perWorker has a maxUsd ceiling and the pool has some usd remaining but less than perWorker.maxUsd, the driver cannot spawn (reserve fails on the usd admission check at budget.ts:171) yet poolStarved returns false — so the driver keeps looping, each turn costing inference usd, until freeUsd drops to <=0. For tokens this is caught early (tokenStarved fires when tokensLeft < perWorker.maxTokens); for usd it is not. Impact: wasted driver inference spend in the
🟡 LOW Inconsistent usd-cap guard between poolExhausted and poolStarved — src/runtime/supervise/supervisor.ts
At line 403,
poolExhaustedchecksopts.budget.maxUsd !== undefined && r.usdLeft <= 0to decide usd exhaustion. In coordination-driver.ts:109,poolStarvedchecksb.usdCapped && b.usdLeft <= 0. Both guard the same condition — whether the pool was constructed with a usd ceiling — but through different paths (opts.budget.maxUsdvsb.usdCapped). They are functionally identical becauseusdCappedis derived fromroot.maxUsd !== undefinedand the pool is constructed fromopts.budget. However, the readout already carriesusdCapped; the guard in poolExhausted should user.usdCappedfor consistency and to avoid coupling to the construction-tim
🟡 LOW No spentTotal/spentBreakdown on no-winner results — driver inference cost invisible on failure paths — src/runtime/supervise/supervisor.ts
Lines 193-210: The no-winner paths (both 'out === undefined' and 'act rejected') return only {kind, reason, tree, downCount} — no spentTotal or spentBreakdown. A driver that ran 50 turns of inference ($5) then produced no winner reports zero cost in the result. The SupervisedResult type (types.ts:445-449) never carried spend on no-winner, so this is pre-existing — but the PR makes it more consequential now that driver inference is a real, potentially large cost line. The cost IS observable via agent.turn hooks (if wired), but the pool is created locally and not exposed post-run. For billing/spend-observability parity, consider adding an optional spe
🟡 LOW No coordination-driver-level test for usage-present + costUsd-absent combination — tests/loops/driver-inference-metering.test.ts
The coordination driver meters when
res.usage || res.costUsd !== undefined(line 172). The existing tests always include both. A turn with usage but no costUsd would set usd:0 in the Spend (viares.costUsd ?? 0), which is correct behavior but untested at the driver integration level. The router-driver-chat.test.ts:131 test covers the DriverChat-level absence, but not the downstream metering path. Low severity — the ?? 0 fallback is trivially correct.
🟡 LOW No test for poolStarved reservedTokens > 0 early-return path — tests/loops/driver-inference-metering.test.ts
coordination-driver.ts:163 has
if (b.reservedTokens > 0) return false— when a child is in flight, the driver continues to await it rather than falsely declaring poolStarved. None of the test scenarios have a spawned child in flight when poolStarved is checked (the never-stopping tests only calllist_questionswhich doesn't spawn). A test with a slow/streaming worker whose tokens are reserved but not yet settled would cover this guard. Low risk — the guard is a two-liner with a clear boolean check, but it gates a critical correctness property (don't finalize early when a child is running).
🟡 LOW No test for spentBreakdown omission when driver inference is empty (isNonEmptySpend=false path) — tests/loops/driver-inference-metering.test.ts
supervisor.ts:188 conditionally omits spentBreakdown via isNonEmptySpend(driverInference). Every test in this file exercises a driver with real usage, so spentBreakdown is always defined. No test verifies the opposite: a blind/fanout arm (no driver inference) should produce a result with spentTotal but NO spentBreakdown field. This is a coverage gap for the isNonEmptySpend gate, not a correctness bug — the implementation is correct. Fix: add a test with a meteredChat that returns turns without usage/costUsd and assert spentBreakdown is undefined.
🟡 LOW USD overshoot by one turn is accepted behavior — documented but implicit — tests/loops/driver-inference-metering.test.ts
Test at line 250-293 (usd-bounded) demonstrates the driver spends 0.12 USD against a 0.10 cap before halting. The in-loop guard (
poolStarved) checks before each turn, so the driver can overshoot by one turn's worth of inference. The coordination-driver.ts metering comment states 'the spend already happened, so accounting records reality; the in-loop guard prevents MORE' — this is explicit design. The pool doesn't reject overspend retroactively (that would break accounting). If a single turn's cost could be extremely large (e.g. a very long prompt), the overshoot could be significant. Consider documenting the max-overshoot bound (max poss
🟡 LOW usdCapped field added to BudgetReadout is a minor type-breaking change — tests/loops/driver-inference-metering.test.ts
budget.ts:45 adds
usdCapped: booleantoBudgetReadout. Any external code that destructures or exhaustively checksBudgetReadoutwill fail to compile. The module is@experimental, mitigating impact, but consumers usingscope.budgetwill see a new required field. Tests don't exercise this externally — they only exercisereadout()internally. No consumers in the repo appear affected (the field is only read insidepoolStarvedin coordination-driver.ts:164-166). Verdict: intentional, additive to experimental API; note for release notes.
tangletools · 2026-06-16T21:54:59Z · trace
… event = the twin of observe) Driver inference lived only in the pool (observe) while child work lived in BOTH the pool (reconcile) and the journal (settled). So two cost ledgers could disagree: the cross-run equal-k gate reads the journal and silently under-counted any coordination-driver arm (the manual extraRootSpend bridge was the symptom). Close it at the root: give the driver's inference its missing journal twin. - New SpawnEvent kind `metered` — a driver's own inference spend, the journal TWIN of the pool's `observe` exactly as `settled` is the twin of `reconcile`. Scope.meter journals it (and is now async/awaited, cost-critical). A driver re-homes its nested subtree's inference up to its parent as one metered event (ExecutorResult.metered → finalizeSettlement), mirroring how settled spend rolls child work up — so summing ANY sub-tree root yields its true driver-inference cost. - ONE ledger = the journal: spentFromJournal sums settled (childWork) + metered (driverInference); trajectoryReport folds metered onto node ownSpend. So spentTotal == trajectoryReport.total == the equal-k gate's number by construction, at any depth. Removed the pool→supervisor bridge (observedTotal, now dead) and trajectoryReport's extraRootSpend (now automatic). - Journal integrity preserved: metered is exempt from the cursor-uniqueness guard, skipped by replay (not a settlement), and folded in a separate additive pass by materialize/trajectory (order-independent). observe stays as the live pool debit for the in-loop guard. Adversarially reviewed SOUND on both journal/replay integrity AND conservation: each token counted exactly once at root AND across nested depth (traced root→mid→sub→worker), nested inference reaches the runId tree once via the re-homed copy, no replay/seq corruption. New nested-driver test proves re-homing (spentTotal 280/160 across 2 levels). Full suite 1005 pass; lint/typecheck/build green.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 54bbddea
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T23:18:35Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (1 low, 3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 159.3s (2 bridge agents) |
| Total | 159.3s |
💰 Value — sound-with-nits
Meters the driver's own LLM inference tokens against the conserved budget pool — closing a real equal-k integrity hole — with a clean observe/meter/metered twin design that follows the codebase's existing reserve/reconcile/settled pattern.
- What it does: The driver's own chat-LLM tokens (the largest token consumer in an agentic loop) were invisible to the conserved budget pool. This PR adds: (1)
BudgetPool.observe(spend)— a directfree→committeddebit bypassing reserve/reconcile, since inference isn't a pre-sized child; (2)Scope.meter(spend, detail)— debits the pool, journals ameteredSpawnEvent (the durable twin), and emits an `agent. - Goals it achieves: (1) Equal-k conservation: coordination-driver arms were under-counted because their inference tokens were dropped on the floor — now they count. (2) Make
maxTurns=0genuinely pool-bounded: a thinking driver now drains tokens/usd →poolStarvedhalts it, instead of spinning to the 2000-turn anti-runaway tripwire. (3) Single cost ledger: the journal is the one source of truth for all spend, so `t - Assessment: This is a sound, well-architected change that closes a real integrity hole. The design respects the codebase's core invariant (
total ≡ free + reserved + committed) by using a directfree→committeddebit — the only honest option when the spend already happened (you can'treservebefore an LLM call because you don't know the token count ahead of time). Themeteredjournal event mirrors the e - Better / existing approach: No materially better architecture exists. The driver's inference cannot go through
reserve/reconcile(LLM token count isn't known until after the call completes), and tracking it outside the pool would defeat equal-k conservation. Theobserve/meter/metereddesign is the right seam. I searched for existing spend utilities: the codebase hasaddTokenUsageinutil.ts:135for the token su - Model: opencode/zai-coding-plan/glm-5.1
- Bridge attempts: 2
- Bridge warning: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":0,"maxActive":24,"maxQueue":32}}}
🎯 Usefulness — sound-with-nits
Closes a real accounting hole — driver inference was invisible to the conserved pool — with a well-integrated observe/meter twin that mirrors the existing reserve/reconcile pattern end-to-end.
- Integration: Fully wired and reachable. The chain: routerDriverChat forwards usage/costUsd (router-driver-chat.ts:29-30) → coordinationDriverAgent calls scope.meter (coordination-driver.ts:183) → Scope.meter calls pool.observe + journals a metered event + emits agent.turn trace (scope.ts:372-403). Every journal consumer is updated: spentFromJournal separates settled/metered (supervisor.ts:432-437), trajectoryR
- Fit with existing patterns: Fits the codebase's grain precisely. observe() is explicitly framed as 'the twin of the pool debit as settled is the twin of reconcile' — the same dual live+durable pattern already used for reserve/reconcile. Scope.meter sits naturally alongside spawn/next/send. The metered SpawnEvent variant extends the union without disturbing cursor semantics. The iterations:0 decision (not charging driver turn
- Real-world viability: Holds up under realistic use. observe() never throws (correct — the spend already happened, so accounting records reality). free going negative on overspend is the intended exhaustion signal; poolStarved reads tokensLeft < perWorker.maxTokens which trips on negative correctly. The await on scope.meter ensures the journal event lands before the join-barrier roll-up. Nested re-homing avoids double-c
- Model: opencode/zai-coding-plan/glm-5.1
- Bridge attempts: 2
- Bridge warning: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":1,"maxActive":24,"maxQueue":32}}}
🔎 Heuristic Signals
🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
💰 Value Audit
🟡 Per-channel Spend addition is copy-pasted in 6+ locations; a shared util would prevent drift [duplication] ``
The exact same
{ iterations: a+b, tokens: {input: a+b, output: a+b}, usd: a+b, ms: a+b }pattern appears in: supervisor.ts:452addSpend, spawn-journal.ts:449addJournalSpend, trajectory.ts:272addNodeSpend, trajectory.ts:291addSpend(in-place), bench/src/gate.ts:290addSpend, plus the array variants in driver-executor.ts:249sumSpendand driver-executor.ts:264sumMetered. This PR introduced 2 of these copies (addJournalSpend,addNodeSpend). Additionally,isNonZeroSpend(dr
🎯 Usefulness Audit
🟡 Benchmark has a stale local routerDriverChat copy that won't meter inference [integration] ``
bench/src/atom-humaneval.mts:62 defines its OWN routerDriverChat that does NOT forward usage/costUsd (lines 74-81 return only content+toolCalls). This is the one real-router consumer outside the library export. Until it switches to the shared import from router-driver-chat.ts, the benchmark's driver arms won't actually meter inference — the exact hole this PR closes. Pre-existing code, not introduced here, but worth a one-line swap.
🟡 PR body claims extraRootSpend but code uses a cleaner re-homing approach instead [problem-fit] ``
The PR body states 'trajectoryReport gains extraRootSpend' but grep finds no extraRootSpend anywhere in the codebase. The actual implementation folds metered events directly onto each node's ownSpend (trajectory.ts:99-103), so trajectoryReport.total includes driver inference with no caller plumbing. This is arguably better than the claimed extraRootSpend parameter — but the PR body is stale relative to the shipped design. Cosmetic only; no code action needed.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 60 | 74 | 60 |
| Confidence | 75 | 75 | 75 |
| Correctness | 60 | 74 | 60 |
| Security | 60 | 74 | 60 |
| Testing | 60 | 74 | 60 |
| Architecture | 60 | 74 | 60 |
Full multi-shot audit completed 3/3 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 12 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM materializeTreeView metered pass has no direct test coverage — src/durable/spawn-journal.ts
Lines 420-424 add a third pass in materializeTreeView that iterates events, filters for kind==='metered', and calls addJournalSpend to accumulate driver inference onto node snapshots. This is a public API (exported via src/runtime/index.ts:35). No test exercises this path with metered events — the only materializeTreeView call in tests (tests/loops/supervise.test.ts:840) uses a tree without metered events. The trajectoryReport analog is tested in driver-inference-metering.test.ts:390-429, but that is a separate code path (trajectory.ts implements its own node reconstruction). Risk: the accumulation logic (correct pass ordering, requireNode interaction, add
🟠 MEDIUM Sub-driver metered inference orphaned on crash — pool/journal cost mismatch — src/runtime/supervise/driver-executor.ts
When a nested sub-driver meters tokens via
scope.meter()over multiple turns, then crashes on a later turn (e.g.chat.next()throws a network error), the metered events are durably written to the nested tree's journal (scope.ts:381), but the crash propagates throughdriver.act()→execute()(driver-executor.ts:149, no try/catch) →runChildcatch (scope.ts:604), which returns adownsettlement WITHOUT computing themeteredsum (line 609:return downRecord(...)— no metered field). Thedownpath infinalizeSettlement(scope.ts:437-447) never re-homes metered events to the parent tree. Result:pool.observe()already debited the
🟡 LOW JSDoc for materializeTreeView doesn't mention metered events — src/durable/spawn-journal.ts
Lines 377-381: The JSDoc says 'Folds spawned/settled/cancelled into a per-node snapshot' but the function now also folds metered events into snapshots (the third pass at lines 420-424 accumulates spend). The comment at lines 385-388 similarly only mentions spawned/settled/cancelled. This is a documentation gap, not a logic bug — the code works correctly.
🟡 LOW No direct unit test for materializeTreeView with metered events — src/durable/spawn-journal.ts
materializeTreeView's new metered fold pass (lines 418-424) is the exact mirror of trajectoryReport's fold (trajectory.ts:99-103), but supervise.test.ts:838-844 tests materializeTreeView only on non-metered journals. The driver-inference-metering test (line 414) journals a metered event but calls trajectoryReport, not materializeTreeView. A regression in materializeTreeView's metered handling (e.g., wrong accumulation order, spend double-counting) would not be caught by existing unit tests. Add a test that journals spawned + se
🟡 LOW metered event ordering in replaySpawnTree's seq sort is implicit — src/durable/spawn-journal.ts
Line 310 sorts ALL events (including metered) by seq, then line 318 skips metered. Since meterSeq overlaps cursorSeq, a metered event could sort between two settled events. This is currently harmless (metered is skipped), but the correctness of the sort relies on the skip being applied. If a future change reads from 'ordered' without the skip, the interleaving would silently reorder settlements relative to driver-inference timing. The skip at [line 318](https://github.com/tangle-network/agent-runtime/blob/54bbddea177e6e6715e55d08ce8
🟡 LOW requireNode error message doesn't mention metered events — src/durable/spawn-journal.ts
Line 461:
throw new Error('spawn journal corrupted: settle/cancel for node ...'). The requireNode function at line 458 is now called for metered events (line 422) in addition to settled/cancelled (line 409, 414). If a metered event references a non-existent node, the error
🟡 LOW No-op meter write for turns with costUsd:0 but no usage — src/runtime/supervise/coordination-driver.ts
Line 172: The guard
if (res.usage || res.costUsd !== undefined)triggers metering when costUsd is 0 but usage is absent. This produces a zero-spendobserve()(harmless) and a zero-spendmeteredjournal event (minor journal noise). The downstreamisNonZeroSpend/isNonEmptySpendgates suppress it from outputs, so no functional impact. Could tighten the guard toif (res.usage || (res.costUsd !== undefined && res.costUsd > 0))but the current behavior is defensible (recording a zero-cost priced turn is honest accounting).
🟡 LOW Stale tripwire comment contradicts the metering this PR adds — src/runtime/supervise/coordination-driver.ts
Lines 93-96: The comment on
runawayTripwireTurnsstates 'the driver's own inference tokens are not yet metered against the conserved pool, so they alone cannot drain it.' But this PR adds exactly that metering viascope.meter()→pool.observe(). With usage-reporting chat seams, driver inference NOW drains the pool. The tripwire's remaining purpose is narrower: it catches a driver whose chat seam reports NO usage (soobserveis never called). The comment should be updated to reflect the post-metering reality, otherwise a future reader will misunderstand why the tripwire exists. Fix: update the comment to say driver inference IS now met
🟡 LOW meterSeq shares ordinal 0 with spawnOrdinal on root's first metered turn — src/runtime/supervise/scope.ts
Both
meterSeqandspawnOrdinalstart at 0 (lines 189-191). The root node's firstmeteredevent and the root'sspawnedevent both get seq=0. The comment correctly states thatmeteredevents use a separate seq namespace outside cursor uniqueness, so no collision occurs in the journal guard. However, if any downstream code ever sorts events by seq withoutkinddiscrimination, the co-numbering could produce ambiguous ordering. Not a functional bug today but a latent design fragility.
🟡 LOW notifyRuntimeHookEvent called after pool+journal commit in meter() — hook failure is silent — src/runtime/supervise/scope.ts
meter()at line 376 callsargs.pool.observe(spend)andawait args.journal.appendEvent(...)BEFOREnotifyRuntimeHookEvent(...)at line 390. If the hook throws (e.g. a downstream consumer's event handler crashes), the pool and journal are already mutated but the hook error propagates out ofmeter(), which isawaited by the coordination driver loop (coordination-driver.ts:183). This would abort the driver's loop on a trace consumer failure, which may be undesirable — the driver should be resilient to observability failu
🟡 LOW Nested metering test omits usd breakdown assertions — tests/loops/driver-inference-metering.test.ts
Test 're-homes a NESTED sub-driver inference up the tree' (line 133) asserts only the tokens channel of driverInference and spentTotal (lines 205-207). The mid and root driver turns carry no costUsd, so driverInference.usd is implicitly 0 — but not asserting it means a regression that accidentally drops usd metering in the nested re-home path (driver-executor.ts sumMetered) would not be caught by this test. The flat test ([line 125](https://github.com/tangle-network/agent-runtime/blob/54bbddea1
🟡 LOW No test for mixed metered/unmetered driver turns — tests/loops/driver-inference-metering.test.ts
coordination-driver.ts line 172 gates metering on
if (res.usage || res.costUsd !== undefined). A turn with NEITHER field is skipped (no meter, no journal event, no agent.turn event). All meteredChat turns in the tests carry usage on every turn. A turn carrying costUsd but no usage (tokens defaulting to 0/0) is also untested. The router-driver-chat test at line 131 does verify the router level (usage absent → not forwarded), but the driver-level gate that consumes it is not exercised with a mixed s
🟡 LOW Test comment imprecision in maxTurns=0 usd-bounded test — tests/loops/driver-inference-metering.test.ts
Line 357: comment says '~2-3 turns of $0.04 fit' but the test asserts exactly 3 turns (maxUsd=0.1, costUsd=0.04×3=0.12, guard breaks at top of turn 4 when usdLeft=-0.02≤0). The ~ qualifier is misleading — it's deterministically 3 turns. The assertion itself is correct; only the comment is fuzzy. No behavior impact.
🟡 LOW USD-bounded test does not assert the exact usdLeft readout — tests/loops/driver-inference-metering.test.ts
Test 'maxTurns=0 is bounded by usd too' (line 328) asserts only
n === 3andresult.kind === 'no-winner'(lines 370-371). The comment traces usdLeft: 0.1 → 0.06 → 0.02 → -0.02, but the test never readsscope.budget.usdLeftto prove the negative readout (unlike the budget-pool observe test at line 444 which does assert the negative token value).
tangletools · 2026-06-16T23:30:45Z · trace
…path + close review nits Address the PR review of the single-ledger unification: - MEDIUM (real drift): a NESTED sub-driver that crashed mid-run kept its metered inference in the pool (observe) but never re-homed it to the parent journal — the down/abort settle path didn't carry metered. So spentTotal/trajectory silently under-counted a crashed sub-driver: the exact pool↔journal drift the unification was meant to kill. Fix: `Executor.metered()` (replacing `ExecutorResult.metered`) caches the nested-tree inference on BOTH the success AND crash paths; the scope re-homes it on every settle exit (done, aborted, crash). New test proves a crashed sub-driver's partial inference (40/20) lands in the journal. - MEDIUM (test gap): add a direct `materializeTreeView` test for the metered fold (the resume-path twin of trajectoryReport), incl. the seq-collision case. - Bench: `bench/src/atom-humaneval.mts` had a local `routerDriverChat` copy that dropped usage/costUsd — so its driver arms never metered. Swap to the shared export; delete the dead local copy + helpers. Now the one real-router consumer actually meters inference. - Nits: assert usd flows through the nested re-home; fix the stale tripwire comment (driver inference IS metered now — the tripwire only catches a no-usage chat seam); JSDoc. Full suite 1007 pass; lint/typecheck/build green.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — c73059a6
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T23:47:49Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 1 (1 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 155.0s (2 bridge agents) |
| Total | 155.0s |
💰 Value — sound
Meters the coordination driver's own LLM inference into the conserved budget pool — the largest per-run token consumer that was previously invisible to equal-k — with dual live/journal accounting and per-turn trace observability. Sound design, follows existing patterns, well tested.
- What it does: Adds driver inference metering to the conserved budget pool via two parallel ledgers:
BudgetPool.observe()(live free→committed debit for the in-loop guard) andScope.meter()(journaledmeteredevent +agent.turntrace event). TheDriverTurntype gainsusage/costUsd;routerDriverChatforwards the router's real usage. Each driver turn debits the shared pool, so equal-k counts the dr - Goals it achieves: 1) Make equal-k compute matching fair by counting the driver's own LLM tokens (previously invisible — the largest per-run consumer). 2) Make
maxTurns=0genuinely pool-bounded: a thinking driver now drains the pool →poolStarvedhalts it (proven by the never-stopping-driver test). 3) Unify cost accounting on one journal ledger:settledevents carry child work,meteredevents carry driver in - Assessment: The change is coherent and follows the codebase's grain precisely. It mirrors the existing
reserve/reconcile+settleddual-ledger pattern with a matchingobserve+meteredpair. The invariant (total ≡ free + reserved + committed) is preserved by construction —observeis a directfree → committeddebit. Themeteredevent is NOT a settlement, so replay/cursor ordering is untouche - Better / existing approach: none — this is the right approach. Searched for existing spend-observation mechanisms,
agent.turnevents,meteredevent kinds,spentBreakdown,usdCapped, and prior work on these paths. No existing mechanism covered this gap:reserve/reconcileis per-child (driver isn't a child of itself),UsageEventstreams are for spawned children only, and the pool had no direct-debit operation. Th - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound
Wires driver inference into the conserved pool and journal ledger correctly, making equal-k honest and maxTurns=0 genuinely budget-bounded; follows the existing reconcile/journal twin pattern and handles crash/depth/no-usage edge cases.
- Integration: Fully reachable and consumed. routerDriverChat (src/runtime/supervise/router-driver-chat.ts:29-30) forwards the router's real usage/costUsd → coordinationDriverAgent calls scope.meter() (src/runtime/supervise/coordination-driver.ts:184) → pool.observe() debits live (src/runtime/supervise/budget.ts:236) + journal.appendEvent() persists durably (src/runtime/supervise/scope.ts:389). The supervisor su
- Fit with existing patterns: Extends the existing design's twin pattern: metered ↔ observe mirrors settled ↔ reconcile. BudgetPool.observe moves free → committed directly (preserving total ≡ free + reserved + committed), same channels and invariants as reserve/reconcile. Scope.meter is a first-class Scope verb alongside spawn/next/send. The metered event type in SpawnEvent (src/runtime/supervise/types.ts:385-398) is a natural
- Real-world viability: Edge cases are handled: (1) overspend: free goes negative on observe — the honest exhaustion signal poolStarved reads (src/runtime/supervise/coordination-driver.ts:109-111), in-loop guard halts before more spend; (2) crash/re-home: a crashed sub-driver's partial inference is read from its nested tree on the catch path (driver-executor.ts:182) and re-homed as a metered event on the down settlement
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…de shape
Harness-agnostic: a per-harness part-decoder adapter registry (toolPartDecoders) replaces
the one mega-decoder. Each harness owns its real wire shape; the flow downstream is identical;
add a harness = add a decoder + one registry entry.
- decodeOpencodePart — VALIDATED ON A LIVE BOX: a real opencode session emitted
{type:'tool', tool:'read'|'bash'|'write', callID, state:{status,input}} — the decoder extracted
3/3 real tool calls (the initial guess got 0/29; caught + fixed against reality). Terminal-state
only (pending/running skipped) + callId de-dup (opencode streams each call 3x + a raw copy).
- decodeAnthropicPart (claude-code tool_use), decodeOpenAiPart (codex/router/kimi function).
- decodeToolPart(part, harness?) selects the adapter or tries all; sources thread through.
- bench/src/decoder-live.mts: the live-box validation harness (no mock).
trace 16 + watch 5 + analyze 2 tests; typecheck/build/lint clean. (Pre-existing main failure in
driver-inference-metering #319 is unrelated — fails on origin/main, untouched here.)
…ent) — main was red (#322) #318 renamed the coordination wait verb await_next → await_event; #319's metering test (merged around the same time) still scripted await_next. Since the tool no longer exists, the scripted driver's 'collect the worker' turn returned {error:'unknown tool: await_next'} and the worker was never drained → result 'no-winner' → the 2 winner-expecting tests failed on main. Updated all 6 refs to await_event (the driver already fails loud on the unknown tool — only the fixed script couldn't recover, unlike a real LLM). Full suite 1030 pass (was 2 failing on main); no stale await_next remains anywhere.
…race analysis (#321) * refactor(supervise): substrate-agnostic TraceSource (sandbox-first) — replace router-only seam The detector/analyzer were built router-only (onToolStep/ToolStep) — premature; production is sandbox/fleet. Corrected to one interface over agent-eval's ToolSpan: - TraceSource (trace-source.ts): a worker's tool calls as ToolSpans, from an OWNED loop (createPushTraceSource — router/cli-bridge dispatch) OR a SANDBOX box (sandboxSessionTraceSource(box, sessionId) → box.messages() session parts → decodeToolPart, defensive across OpenAI + harness shapes). The SDK exposes tool calls via the session (SessionMessage.parts / streamPrompt), NOT exportTrace (sandbox telemetry) — corrected. - watchTrace (online) + analyzeTrace (settle) now consume a TraceSource, not a router seam. - DELETED the router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep. Common currency = ToolSpan; same agent-eval detectors + batch analyzers over any substrate. trace-source 11 + watchTrace 5 + analyzeTrace 2 tests incl. the sandbox box path (mock box → session parts → loop detected); full suite 1023 pass; typecheck/build/lint clean. Live-box validation of the exact harness part-shape pending (decoder is defensive). * feat(supervise): per-harness decoder registry + LIVE-validated opencode shape Harness-agnostic: a per-harness part-decoder adapter registry (toolPartDecoders) replaces the one mega-decoder. Each harness owns its real wire shape; the flow downstream is identical; add a harness = add a decoder + one registry entry. - decodeOpencodePart — VALIDATED ON A LIVE BOX: a real opencode session emitted {type:'tool', tool:'read'|'bash'|'write', callID, state:{status,input}} — the decoder extracted 3/3 real tool calls (the initial guess got 0/29; caught + fixed against reality). Terminal-state only (pending/running skipped) + callId de-dup (opencode streams each call 3x + a raw copy). - decodeAnthropicPart (claude-code tool_use), decodeOpenAiPart (codex/router/kimi function). - decodeToolPart(part, harness?) selects the adapter or tries all; sources thread through. - bench/src/decoder-live.mts: the live-box validation harness (no mock). trace 16 + watch 5 + analyze 2 tests; typecheck/build/lint clean. (Pre-existing main failure in driver-inference-metering #319 is unrelated — fails on origin/main, untouched here.) * fix(supervise): address #321 audit + honest live-validation of harness adapters Audit (Needs Work) fixes: - decodeOpenAiPart guard matched any object with a .function field ({type:'text',function}) → match on type ('function'|'tool_call') only, name still required. - subscriber isolation: a throwing onSignal/subscriber can no longer crash the producer (both push + parts source fan-outs now try/catch each subscriber). - restored the dropped evidence:{action} assertion in the watchTrace stuck-loop test. - decoder-live.mts: ALWAYS deletes the box (try/finally), exits non-zero on a decode miss (CI-gateable), parameterized by HARNESS, dumps part-type vocabulary + harness error events. LIVE validation across harnesses (REAL boxes, no mock): - opencode: PROVEN (3/3 real tool calls — already in this PR). - claude-code + codex: the harness PROCESS crashes 'exit 1' with the test model (deepseek-v4-flash via the openai-compat router) before any tool call — zero tool parts to validate against. This is a harness↔model protocol block, NOT a decoder issue. Their adapters stay (public formats) but are docstring-flagged NOT-live-validated; owed the same proof opencode got with a compatible model. - Coverage honesty: repeated-action works for ALL harnesses (name+args); error-streak is opencode- validated only (claude-code/codex errors live in separate result blocks not yet decoded). trace 16 + watch 5 + analyze 2 tests; typecheck/build/lint clean. * feat(trace): ground per-harness tool decoders in cli-bridge backend shapes Confirm each harness adapter against cli-bridge's canonical parsers (the real readers of each harness's native output): - decodeAnthropicPart handles kimi's tool_use_id/tool fallbacks (kimi.ts) - decodeOpenAiPart cites kimi.ts top-level function form - codex emits no structured tool calls (codex.ts) — per-tool detection unavailable for codex from any path; documented as a harness property
Why
The driver's chat-LLM tokens — the largest single token consumer in an agentic loop — were invisible to the conserved pool.
routerDriverChatdiscarded the router'susage;driver-executorsummed only spawned children's spend. So equal-k under-counted any coordination-driver arm, andmaxTurns=0had no real inference bound (only a turn tripwire). This meters the driver's inference end-to-end, with a clean spend breakdown and live observability.What
BudgetPool.observe(spend)/observedTotal()— a directfree → committeddebit (no reserve/reconcile ticket) for the drivers' own inference. Preservestotal ≡ free + reserved + committed;freemay go negative on overspend (the honest exhaustion signal the in-loop guard reads).Scope.meter(spend, detail)— observes the spend and emits anagent.turntrace event (turn index, tool calls, per-turn tokens/cost) — the live A++ view. All scopes share one pool, so root and nested driver inference land inobservedTotal.DriverTurngainsusage/costUsd;routerDriverChatforwards the router's real usage; the driver meters each turn (only when usage is present — a scripted/offline turn meters nothing, so equal-k stays exact in tests). This debit makesmaxTurns=0genuinely pool-bounded: a thinking driver drains the pool →poolStarvedhalts it (proven by a never-stopping-driver test).SupervisedResultwinner gainsspentBreakdown { driverInference, childWork }andspentTotalnow includes inference.driverInference + childWork === spentTotal.turnSpend.iterations = 0) —maxIterationsbudgets child rounds, not driver turns; counting them would conflate the two and skew an equal-k iteration count. Turn count stays observable via theagent.turnevents.trajectoryReportgainsextraRootSpend— a coordination-driver equal-k arm passesresult.spentBreakdown.driverInferencesoreport.total(→equalKOnCost) matchesspentTotal. The journal ledger and the pool ledger agree (closes a latent cross-run equal-k divergence the review flagged).Verification
Adversarially reviewed against the conserved-pool invariant (the foundation of the whole instrument): SOUND —
total ≡ free + reserved + committedpreserved, and each token counted exactly once at root AND across nested depth (proven by arithmetic and executed probes, including the R→sub-driver→worker case). New tests: spentTotal+breakdown;maxTurns=0bounded by inference; per-turnagent.turnevents; poolobserve/observedTotal; equal-k ledger reconciliation viaextraRootSpend. Full suite 1002 pass; lint/typecheck/build green. Built in an isolated git worktree.Residual (documented, not silent)
A real router turn that returns no usage (or an unpriced model) leaves that turn uncounted — graceful (no crash/double-count), acceptable for a token-primary instrument. The nested sub-driver-inside-an-arm case still needs
extraRootSpendper level for the cross-run gate; the common case (the arm is the driver) is fully covered.