Skip to content

feat(supervise): meter driver inference on ONE journal ledger + A++ spend observability#319

Merged
drewstone merged 4 commits into
mainfrom
feat/meter-driver-inference
Jun 16, 2026
Merged

feat(supervise): meter driver inference on ONE journal ledger + A++ spend observability#319
drewstone merged 4 commits into
mainfrom
feat/meter-driver-inference

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Why

The driver's chat-LLM tokens — the largest single token consumer in an agentic loop — were invisible to the conserved pool. routerDriverChat discarded the router's usage; driver-executor summed only spawned children's spend. So equal-k under-counted any coordination-driver arm, and maxTurns=0 had no real inference bound (only a turn tripwire). This meters the driver's inference end-to-end, with a clean spend breakdown and live observability.

What

  • BudgetPool.observe(spend) / observedTotal() — a direct free → committed debit (no reserve/reconcile ticket) for the drivers' own inference. Preserves total ≡ free + reserved + committed; free may go negative on overspend (the honest exhaustion signal the in-loop guard reads).
  • Scope.meter(spend, detail) — observes the spend and emits an agent.turn trace event (turn index, tool calls, per-turn tokens/cost) — the live A++ view. All scopes share one pool, so root and nested driver inference land in observedTotal.
  • DriverTurn gains usage/costUsd; routerDriverChat forwards the router's real usage; the driver meters each turn (only when usage is present — a scripted/offline turn meters nothing, so equal-k stays exact in tests). This debit makes maxTurns=0 genuinely pool-bounded: a thinking driver drains the pool → poolStarved halts it (proven by a never-stopping-driver test).
  • SupervisedResult winner gains spentBreakdown { driverInference, childWork } and spentTotal now includes inference. driverInference + childWork === spentTotal.
  • Driver turns are NOT charged to the iteration channel (turnSpend.iterations = 0) — maxIterations budgets child rounds, not driver turns; counting them would conflate the two and skew an equal-k iteration count. Turn count stays observable via the agent.turn events.
  • trajectoryReport gains extraRootSpend — a coordination-driver equal-k arm passes result.spentBreakdown.driverInference so report.total (→ equalKOnCost) matches spentTotal. The journal ledger and the pool ledger agree (closes a latent cross-run equal-k divergence the review flagged).

Verification

Adversarially reviewed against the conserved-pool invariant (the foundation of the whole instrument): SOUNDtotal ≡ free + reserved + committed preserved, and each token counted exactly once at root AND across nested depth (proven by arithmetic and executed probes, including the R→sub-driver→worker case). New tests: spentTotal+breakdown; maxTurns=0 bounded by inference; per-turn agent.turn events; pool observe/observedTotal; equal-k ledger reconciliation via extraRootSpend. Full suite 1002 pass; lint/typecheck/build green. Built in an isolated git worktree.

Residual (documented, not silent)

A real router turn that returns no usage (or an unpriced model) leaves that turn uncounted — graceful (no crash/double-count), acceptable for a token-primary instrument. The nested sub-driver-inside-an-arm case still needs extraRootSpend per level for the cross-run gate; the common case (the arm is the driver) is fully covered.

…ed pool + A++ spend observability

The driver's chat-LLM tokens — the largest single consumer in an agentic loop — were
invisible: routerDriverChat discarded the router's usage, and driver-executor summed only
spawned children's spend. So equal-k under-counted any driver arm and maxTurns=0 had no real
inference bound. This meters them, end to end, with a clean spend breakdown.

- BudgetPool.observe(spend) / observedTotal(): a direct free→committed debit (no reserve/
  reconcile ticket) for the drivers' own inference. Preserves total ≡ free + reserved +
  committed; free may go negative on overspend (honest exhaustion the in-loop guard reads).
- Scope.meter(spend, detail): observes the spend AND emits an agent.turn trace event (turn
  index, tool calls, per-turn tokens/cost) — the live A++ view. All scopes share ONE pool, so
  root and nested driver inference both land in observedTotal.
- DriverTurn gains usage/costUsd; routerDriverChat forwards the router's real usage; the driver
  meters each turn (when usage is present — a scripted/offline turn meters nothing, so equal-k
  stays exact in tests). This debit makes maxTurns=0 genuinely pool-bounded: a thinking driver
  drains the pool → poolStarved halts it (proven by a never-stopping-driver test).
- SupervisedResult.winner gains spentBreakdown { driverInference, childWork } and spentTotal now
  includes inference (spentTotalFromJournal + observedTotal). driverInference + childWork ===
  spentTotal.
- Driver turns are NOT charged to the conserved iteration channel (turnSpend.iterations = 0) —
  maxIterations budgets child rounds, not driver turns; counting them would conflate the two and
  skew an equal-k iteration count. Turn count stays observable via the agent.turn events.
- trajectoryReport gains extraRootSpend: a coordination-driver equal-k arm passes
  result.spentBreakdown.driverInference so report.total (→ equalKOnCost) matches spentTotal —
  the journal ledger and the pool ledger agree (closes a latent cross-run equal-k divergence).

Adversarially reviewed: conservation SOUND and each token counted exactly once at root AND across
nested depth (proven by arithmetic + executed probes). Full suite 1002 pass; lint/typecheck/build
green. Built in an isolated worktree.
tangletools
tangletools previously approved these changes Jun 16, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 47e995c5

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T20:42:08Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 2 (1 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 241.6s (2 bridge agents)
Total 241.6s

💰 Value — sound-with-nits

Adds metering of the driver's own LLM inference against the conserved budget pool — previously the largest token consumer was invisible to equal-k and the in-loop guard. Well-designed, in the codebase's grain, no better approach exists.

  • What it does: Meters the coordination driver's own LLM inference tokens against the conserved budget pool. Previously, routerDriverChat discarded the router's usage/costUsd, and only spawned children's spend was tracked via reserve/reconcile — so the largest single token consumer in an agent loop was invisible to equal-k and the in-loop budget guard. This change adds: (1) BudgetPool.observe(spend) — a d
  • Goals it achieves: 1. Make the driver's own inference (the largest token consumer) visible to the conserved pool — so equal-k counts driver tokens and the in-loop guard (poolStarved) can halt a thinking driver when the pool runs dry. 2. Make maxTurns=0 genuinely pool-bounded (previously only a runawayTripwireTurns=2000 safety net; now tokens drain the pool → poolStarved halts). 3. Provide a clean A++ spend o
  • Assessment: The design is coherent and respects the codebase's grain at every layer: observe() mirrors the existing free/committed/reserved channel structure in createBudgetPool without breaking the invariant; meter() on Scope extends the same object that provides spawn()/next() — the natural extension point for "emit a trace event and debit the pool"; the event taxonomy (agent.turn) is an exist
  • Better / existing approach: none — this is the right approach. Two alternative paths were considered and rejected by the codebase's existing architecture: (a) Modeling driver turns as spawned pseudo-children with reserve/reconcile: would require a separate executor lifecycle per turn, fighting the driver-is-the-agent grain and adding enormous complexity for a per-turn metering need. (b) Journaling driver inference as events
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🎯 Usefulness — sound

Closes a real accounting hole: driver LLM inference tokens were invisible to the conserved pool, the single-largest costed consumer in an agentic loop; the parallel observe channel fits naturally alongside reserve/reconcile and correctly debits the pool at all nesting levels.

  • Integration: Scope.meter() is called from coordinationDriverAgent.act() (coordination-driver.ts:177), which is the general-purpose driver consumers wire with routerDriverChat. routerDriverChat (router-driver-chat.ts:27-30) now forwards the router's real usage/costUsd. The supervisor assembles spentTotal by summing journal-settled child-work + pool.observedTotal() (supervisor.ts:177-190). `traje
  • Fit with existing patterns: Extends the existing two-channel pool design (reserve/reconcile for spawned children) with a parallel observe channel for non-reserved spend. The invariant total ≡ free + reserved + committed is preserved by construction (budget.ts:239-255: observe moves free→committed directly; free may go negative on overspend, which poolStarved reads as exhaustion). Scope.meter (scope.ts:363-382)
  • Real-world viability: Thread safety: JavaScript's event-loop model means pool.observe() (budget.ts:239) never races with pool.reserve() — meter happens during the driver's synchronous turn processing, and poolStarved reads freeTokens after the debit. Missing usage (mock/scripted turn): the if (res.usage || res.costUsd !== undefined) guard (coordination-driver.ts:166) skips metering exactly when there is nothi
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }

💰 Value Audit

🟡 extraRootSpend is manual caller-plumbing — not automatic reconciliation [maintenance] ``

trajectoryReport requires the caller to explicitly pass result.spentBreakdown?.driverInference as extraRootSpend for the trajectory ledger to agree with spentTotal. If the caller forgets, equalKOnCost would under-count the driver arm's tokens. This is documented in wave-types.ts:544-551, and the test at tests/loops/driver-inference-metering.test.ts:234-283 shows correct usage. No better automatic approach exists without either journal bloat (pseudo-events for non-child spend) or bi


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260616T205212Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 47e995c5

Readiness 79/100 · Confidence 70/100 · 6 findings (1 medium, 5 low)

deepseek glm aggregate
Readiness 86 79 79
Confidence 70 70 70
Correctness 86 79 79
Security 86 79 79
Testing 86 79 79
Architecture 86 79 79

Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM poolStarved in-loop guard ignores usd channel — driver can overspend usd without halting — src/runtime/supervise/coordination-driver.ts

poolStarved (line 102-105) checks only b.tokensLeft < perWorker.maxTokens && b.reservedTokens <= 0. Before this PR, driver inference was never metered, so usd was only drained by child spawns (which fail-closed via reserve). Now that meter() debits usd via pool.observe() when usdCapped, the driver's own inference can drain freeUsd negative without poolStarved catching it. Impact: with maxTurns=0, a usd-capped pool (e.g., maxUsd: 1.0, maxTokens: 10_000_000), and per-turn cost of ~$0.01 / 200 tokens, the driver runs until either tokens drain (unlikely with a huge token ceiling) or the 2000-turn runaway tripwire fires — potentially overspend

🟡 LOW isNonEmptySpend does not check ms channel — src/runtime/supervise/supervisor.ts

Line 452: isNonEmptySpend returns s.iterations > 0 || s.tokens.input > 0 || s.tokens.output > 0 || s.usd > 0 — omits s.ms. Currently harmless because driver turns set ms:0 (coordination-driver.ts:175), but if ms were ever populated on observed spend, a non-empty spend with only ms>0 would not produce a spentBreakdown. Add || s.ms > 0 for robustness.

🟡 LOW isNonEmptySpend omits ms channel — spentBreakdown gating inconsistent with addSpend — src/runtime/supervise/supervisor.ts

isNonEmptySpend (line 452-453) checks iterations, tokens.input, tokens.output, and usd, but NOT ms. A metered Spend with only ms>0 would be added to spentTotal via addSpend (line 442-449, which sums ms) but spentBreakdown would be omitted because isNonEmptySpend returns false. Not reachable via coordination-driver (hardcodes ms:0 at line 175), but the helper's contract is m

🟡 LOW Observability test only validates the first agent.turn event — tests/loops/driver-inference-metering.test.ts

Line 218-231: The test asserts turnEvents.length === 3 but only inspects turnEvents[0]. The second and third events' payloads (turn indices 1 and 2, different toolCalls/spend) are assumed correct. The event structure is the same codepath for all events, so risk is low, but a typo in the detail spread (e.g. turn increment) would pass this test. Consider adding assertions on turnEvents[1].turn (should be 1) and turnEvents[2].toolCalls (should be []).

🟡 LOW meteredChat helper repeats last turn silently — could mask off-by-one in a future test — tests/loops/driver-inference-metering.test.ts

Lines 58-67: meteredChat returns turns[Math.min(i, turns.length-1)] ?? {} when i exceeds the array, silently repeating the last scripted turn instead of throwing or returning an empty stop-turn. In test 1 (line 78-90) this is fine — the last turn has no toolCalls so the driver stops. But if a future test adds turns where the last one has toolCalls, the driver would loop forever repeating it (bounded only by maxTurns or the pool). A defensive { toolCalls: [] } sentinel or an explicit throw on

🟡 LOW Zero-costUsd forwarding path not covered — tests/loops/router-driver-chat.test.ts

The router-channel check uses typeof r.costUsd === 'number', which forwards costUsd: 0 (the router reports a real turn that cost $0). The test only covers non-zero (0.013) and undefined — the zero-case is a valid production scenario (unpriced model, free tier). Add a test with costUsd: 0 to confirm the forwarding guard doesn't silently drop it.


tangletools · 2026-06-16T20:55:04Z · trace

Address the PR reviewer's findings on the inference-metering PR:

- MEDIUM: poolStarved ignored the usd channel. Now that meter() debits usd via observe(), a
  usd-capped pool with a large token ceiling (e.g. maxUsd:1, maxTokens:10M) could let the driver
  overspend usd up to the 2000-turn tripwire. poolStarved now breaks on usd exhaustion too;
  BudgetReadout gains `usdCapped` so the guard distinguishes a real usdLeft<=0 from an uncapped
  pool. New test: maxTurns=0 halts on the usd ceiling with the token ceiling untouched.
- isNonEmptySpend now checks the ms channel (consistent with addSpend) so the spentBreakdown gate
  matches the total on every channel.
- Tests: assert all three agent.turn events (not just the first); meteredChat returns a STOP turn
  past the script instead of silently repeating the last; cover the real costUsd:0 forwarding path.

Full suite 1004 pass; lint/typecheck/build green.
tangletools
tangletools previously approved these changes Jun 16, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — b9adc546

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T21:43:32Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Value Audit — audit-incomplete

Verdict audit-incomplete
Concerns 1 (1 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 90.0s (2 bridge agents)
Total 90.0s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: kimi-code/kimi-for-coding
  • Bridge attempts: 3
  • Bridge error: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":3,"maxActive":24,"maxQueue":32}}}; opencode/zai-coding-plan/glm-5.1: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reas

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: kimi-code/kimi-for-coding
  • Bridge attempts: 3
  • Bridge error: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":2,"maxActive":24,"maxQueue":32}}}; opencode/zai-coding-plan/glm-5.1: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reas

🔎 Heuristic Signals

🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260616T214650Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — b9adc546

Readiness 80/100 · Confidence 70/100 · 10 findings (10 low)

deepseek glm aggregate
Readiness 80 80 80
Confidence 70 70 70
Correctness 80 80 80
Security 80 80 80
Testing 80 80 80
Architecture 80 80 80

Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 10 changed files. Global verifier still owns final merge decision.

🟡 LOW extraRootSpend is opt-in — existing trajectoryReport callers silently exclude driver inference if extended to driver arms — src/runtime/personify/trajectory.ts

Lines 107-119: trajectoryReport folds extraRootSpend onto the root node only when the caller explicitly passes it. The bench gate (bench/src/gate.ts:380) and test helpers call trajectoryReport without it — currently correct because those arms are fanout/combinator shapes that never meter. But if the gate is later extended to coordination-driver arms, report.total would silently diverge from SupervisedResult.spentTotal (driver inference excluded), and equalKOnCost would compare on child-work-only cost — a silent integrity regression exactly analogous to the hole this PR closes. The doc on extraRootSpend (wave-types.ts:544-551) says 'Omit for a fanout

🟡 LOW observe function debits freeIterations but driver never uses this channel — src/runtime/supervise/budget.ts

At line 248, freeIterations -= spend.iterations. The sole caller in coordination-driver.ts:178 explicitly stamps iterations: 0 with a rationale comment that the iteration channel budgets child rounds, not driver turns. The iteration debit in observe is therefore dead code for the current path. If a future driver calls scope.meter with iterations > 0, it would consume the iteration budget alongside child rounds — which may or may not be intended. Consider either removing the iteration debit from observe (if driver inference should NEVER consume iterations) or documenting the contract more explicitly (e.g., a comment that iteration debit is intenti

🟡 LOW poolStarved usd check is asymmetric with token check — driver burns inference usd between perWorker.maxUsd and zero — src/runtime/supervise/coordination-driver.ts

Lines 105-111: poolStarved checks tokenStarved = b.tokensLeft < perWorker.maxTokens (can't afford a worker) but usdStarved = b.usdCapped && b.usdLeft <= 0 (completely exhausted). When perWorker has a maxUsd ceiling and the pool has some usd remaining but less than perWorker.maxUsd, the driver cannot spawn (reserve fails on the usd admission check at budget.ts:171) yet poolStarved returns false — so the driver keeps looping, each turn costing inference usd, until freeUsd drops to <=0. For tokens this is caught early (tokenStarved fires when tokensLeft < perWorker.maxTokens); for usd it is not. Impact: wasted driver inference spend in the

🟡 LOW Inconsistent usd-cap guard between poolExhausted and poolStarved — src/runtime/supervise/supervisor.ts

At line 403, poolExhausted checks opts.budget.maxUsd !== undefined && r.usdLeft <= 0 to decide usd exhaustion. In coordination-driver.ts:109, poolStarved checks b.usdCapped && b.usdLeft <= 0. Both guard the same condition — whether the pool was constructed with a usd ceiling — but through different paths (opts.budget.maxUsd vs b.usdCapped). They are functionally identical because usdCapped is derived from root.maxUsd !== undefined and the pool is constructed from opts.budget. However, the readout already carries usdCapped; the guard in poolExhausted should use r.usdCapped for consistency and to avoid coupling to the construction-tim

🟡 LOW No spentTotal/spentBreakdown on no-winner results — driver inference cost invisible on failure paths — src/runtime/supervise/supervisor.ts

Lines 193-210: The no-winner paths (both 'out === undefined' and 'act rejected') return only {kind, reason, tree, downCount} — no spentTotal or spentBreakdown. A driver that ran 50 turns of inference ($5) then produced no winner reports zero cost in the result. The SupervisedResult type (types.ts:445-449) never carried spend on no-winner, so this is pre-existing — but the PR makes it more consequential now that driver inference is a real, potentially large cost line. The cost IS observable via agent.turn hooks (if wired), but the pool is created locally and not exposed post-run. For billing/spend-observability parity, consider adding an optional spe

🟡 LOW No coordination-driver-level test for usage-present + costUsd-absent combination — tests/loops/driver-inference-metering.test.ts

The coordination driver meters when res.usage || res.costUsd !== undefined (line 172). The existing tests always include both. A turn with usage but no costUsd would set usd:0 in the Spend (via res.costUsd ?? 0), which is correct behavior but untested at the driver integration level. The router-driver-chat.test.ts:131 test covers the DriverChat-level absence, but not the downstream metering path. Low severity — the ?? 0 fallback is trivially correct.

🟡 LOW No test for poolStarved reservedTokens > 0 early-return path — tests/loops/driver-inference-metering.test.ts

coordination-driver.ts:163 has if (b.reservedTokens > 0) return false — when a child is in flight, the driver continues to await it rather than falsely declaring poolStarved. None of the test scenarios have a spawned child in flight when poolStarved is checked (the never-stopping tests only call list_questions which doesn't spawn). A test with a slow/streaming worker whose tokens are reserved but not yet settled would cover this guard. Low risk — the guard is a two-liner with a clear boolean check, but it gates a critical correctness property (don't finalize early when a child is running).

🟡 LOW No test for spentBreakdown omission when driver inference is empty (isNonEmptySpend=false path) — tests/loops/driver-inference-metering.test.ts

supervisor.ts:188 conditionally omits spentBreakdown via isNonEmptySpend(driverInference). Every test in this file exercises a driver with real usage, so spentBreakdown is always defined. No test verifies the opposite: a blind/fanout arm (no driver inference) should produce a result with spentTotal but NO spentBreakdown field. This is a coverage gap for the isNonEmptySpend gate, not a correctness bug — the implementation is correct. Fix: add a test with a meteredChat that returns turns without usage/costUsd and assert spentBreakdown is undefined.

🟡 LOW USD overshoot by one turn is accepted behavior — documented but implicit — tests/loops/driver-inference-metering.test.ts

Test at line 250-293 (usd-bounded) demonstrates the driver spends 0.12 USD against a 0.10 cap before halting. The in-loop guard (poolStarved) checks before each turn, so the driver can overshoot by one turn's worth of inference. The coordination-driver.ts metering comment states 'the spend already happened, so accounting records reality; the in-loop guard prevents MORE' — this is explicit design. The pool doesn't reject overspend retroactively (that would break accounting). If a single turn's cost could be extremely large (e.g. a very long prompt), the overshoot could be significant. Consider documenting the max-overshoot bound (max poss

🟡 LOW usdCapped field added to BudgetReadout is a minor type-breaking change — tests/loops/driver-inference-metering.test.ts

budget.ts:45 adds usdCapped: boolean to BudgetReadout. Any external code that destructures or exhaustively checks BudgetReadout will fail to compile. The module is @experimental, mitigating impact, but consumers using scope.budget will see a new required field. Tests don't exercise this externally — they only exercise readout() internally. No consumers in the repo appear affected (the field is only read inside poolStarved in coordination-driver.ts:164-166). Verdict: intentional, additive to experimental API; note for release notes.


tangletools · 2026-06-16T21:54:59Z · trace

… event = the twin of observe)

Driver inference lived only in the pool (observe) while child work lived in BOTH the pool
(reconcile) and the journal (settled). So two cost ledgers could disagree: the cross-run
equal-k gate reads the journal and silently under-counted any coordination-driver arm (the
manual extraRootSpend bridge was the symptom). Close it at the root: give the driver's
inference its missing journal twin.

- New SpawnEvent kind `metered` — a driver's own inference spend, the journal TWIN of the pool's
  `observe` exactly as `settled` is the twin of `reconcile`. Scope.meter journals it (and is now
  async/awaited, cost-critical). A driver re-homes its nested subtree's inference up to its parent
  as one metered event (ExecutorResult.metered → finalizeSettlement), mirroring how settled spend
  rolls child work up — so summing ANY sub-tree root yields its true driver-inference cost.
- ONE ledger = the journal: spentFromJournal sums settled (childWork) + metered (driverInference);
  trajectoryReport folds metered onto node ownSpend. So spentTotal == trajectoryReport.total == the
  equal-k gate's number by construction, at any depth. Removed the pool→supervisor bridge
  (observedTotal, now dead) and trajectoryReport's extraRootSpend (now automatic).
- Journal integrity preserved: metered is exempt from the cursor-uniqueness guard, skipped by
  replay (not a settlement), and folded in a separate additive pass by materialize/trajectory
  (order-independent). observe stays as the live pool debit for the in-loop guard.

Adversarially reviewed SOUND on both journal/replay integrity AND conservation: each token counted
exactly once at root AND across nested depth (traced root→mid→sub→worker), nested inference reaches
the runId tree once via the re-homed copy, no replay/seq corruption. New nested-driver test proves
re-homing (spentTotal 280/160 across 2 levels). Full suite 1005 pass; lint/typecheck/build green.
tangletools
tangletools previously approved these changes Jun 16, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 54bbddea

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T23:18:35Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (1 low, 3 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 159.3s (2 bridge agents)
Total 159.3s

💰 Value — sound-with-nits

Meters the driver's own LLM inference tokens against the conserved budget pool — closing a real equal-k integrity hole — with a clean observe/meter/metered twin design that follows the codebase's existing reserve/reconcile/settled pattern.

  • What it does: The driver's own chat-LLM tokens (the largest token consumer in an agentic loop) were invisible to the conserved budget pool. This PR adds: (1) BudgetPool.observe(spend) — a direct free→committed debit bypassing reserve/reconcile, since inference isn't a pre-sized child; (2) Scope.meter(spend, detail) — debits the pool, journals a metered SpawnEvent (the durable twin), and emits an `agent.
  • Goals it achieves: (1) Equal-k conservation: coordination-driver arms were under-counted because their inference tokens were dropped on the floor — now they count. (2) Make maxTurns=0 genuinely pool-bounded: a thinking driver now drains tokens/usd → poolStarved halts it, instead of spinning to the 2000-turn anti-runaway tripwire. (3) Single cost ledger: the journal is the one source of truth for all spend, so `t
  • Assessment: This is a sound, well-architected change that closes a real integrity hole. The design respects the codebase's core invariant (total ≡ free + reserved + committed) by using a direct free→committed debit — the only honest option when the spend already happened (you can't reserve before an LLM call because you don't know the token count ahead of time). The metered journal event mirrors the e
  • Better / existing approach: No materially better architecture exists. The driver's inference cannot go through reserve/reconcile (LLM token count isn't known until after the call completes), and tracking it outside the pool would defeat equal-k conservation. The observe/meter/metered design is the right seam. I searched for existing spend utilities: the codebase has addTokenUsage in util.ts:135 for the token su
  • Model: opencode/zai-coding-plan/glm-5.1
  • Bridge attempts: 2
  • Bridge warning: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":0,"maxActive":24,"maxQueue":32}}}

🎯 Usefulness — sound-with-nits

Closes a real accounting hole — driver inference was invisible to the conserved pool — with a well-integrated observe/meter twin that mirrors the existing reserve/reconcile pattern end-to-end.

  • Integration: Fully wired and reachable. The chain: routerDriverChat forwards usage/costUsd (router-driver-chat.ts:29-30) → coordinationDriverAgent calls scope.meter (coordination-driver.ts:183) → Scope.meter calls pool.observe + journals a metered event + emits agent.turn trace (scope.ts:372-403). Every journal consumer is updated: spentFromJournal separates settled/metered (supervisor.ts:432-437), trajectoryR
  • Fit with existing patterns: Fits the codebase's grain precisely. observe() is explicitly framed as 'the twin of the pool debit as settled is the twin of reconcile' — the same dual live+durable pattern already used for reserve/reconcile. Scope.meter sits naturally alongside spawn/next/send. The metered SpawnEvent variant extends the union without disturbing cursor semantics. The iterations:0 decision (not charging driver turn
  • Real-world viability: Holds up under realistic use. observe() never throws (correct — the spend already happened, so accounting records reality). free going negative on overspend is the intended exhaustion signal; poolStarved reads tokensLeft < perWorker.maxTokens which trips on negative correctly. The await on scope.meter ensures the journal event lands before the join-barrier roll-up. Nested re-homing avoids double-c
  • Model: opencode/zai-coding-plan/glm-5.1
  • Bridge attempts: 2
  • Bridge warning: opencode/deepseek/deepseek-v4-pro: Bridge returned 503: {"error":{"message":"cli-bridge admission timed out after 30000ms","type":"admission_rejected","reason":"queue_timeout","admission":{"active":24,"queued":1,"maxActive":24,"maxQueue":32}}}

🔎 Heuristic Signals

🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }

💰 Value Audit

🟡 Per-channel Spend addition is copy-pasted in 6+ locations; a shared util would prevent drift [duplication] ``

The exact same { iterations: a+b, tokens: {input: a+b, output: a+b}, usd: a+b, ms: a+b } pattern appears in: supervisor.ts:452 addSpend, spawn-journal.ts:449 addJournalSpend, trajectory.ts:272 addNodeSpend, trajectory.ts:291 addSpend (in-place), bench/src/gate.ts:290 addSpend, plus the array variants in driver-executor.ts:249 sumSpend and driver-executor.ts:264 sumMetered. This PR introduced 2 of these copies (addJournalSpend, addNodeSpend). Additionally, isNonZeroSpend (dr

🎯 Usefulness Audit

🟡 Benchmark has a stale local routerDriverChat copy that won't meter inference [integration] ``

bench/src/atom-humaneval.mts:62 defines its OWN routerDriverChat that does NOT forward usage/costUsd (lines 74-81 return only content+toolCalls). This is the one real-router consumer outside the library export. Until it switches to the shared import from router-driver-chat.ts, the benchmark's driver arms won't actually meter inference — the exact hole this PR closes. Pre-existing code, not introduced here, but worth a one-line swap.

🟡 PR body claims extraRootSpend but code uses a cleaner re-homing approach instead [problem-fit] ``

The PR body states 'trajectoryReport gains extraRootSpend' but grep finds no extraRootSpend anywhere in the codebase. The actual implementation folds metered events directly onto each node's ownSpend (trajectory.ts:99-103), so trajectoryReport.total includes driver inference with no caller plumbing. This is arguably better than the claimed extraRootSpend parameter — but the PR body is stale relative to the shipped design. Cosmetic only; no code action needed.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260616T232305Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 54bbddea

Readiness 60/100 · Confidence 75/100 · 14 findings (2 medium, 12 low)

deepseek glm aggregate
Readiness 60 74 60
Confidence 75 75 75
Correctness 60 74 60
Security 60 74 60
Testing 60 74 60
Architecture 60 74 60

Full multi-shot audit completed 3/3 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 12 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM materializeTreeView metered pass has no direct test coverage — src/durable/spawn-journal.ts

Lines 420-424 add a third pass in materializeTreeView that iterates events, filters for kind==='metered', and calls addJournalSpend to accumulate driver inference onto node snapshots. This is a public API (exported via src/runtime/index.ts:35). No test exercises this path with metered events — the only materializeTreeView call in tests (tests/loops/supervise.test.ts:840) uses a tree without metered events. The trajectoryReport analog is tested in driver-inference-metering.test.ts:390-429, but that is a separate code path (trajectory.ts implements its own node reconstruction). Risk: the accumulation logic (correct pass ordering, requireNode interaction, add

🟠 MEDIUM Sub-driver metered inference orphaned on crash — pool/journal cost mismatch — src/runtime/supervise/driver-executor.ts

When a nested sub-driver meters tokens via scope.meter() over multiple turns, then crashes on a later turn (e.g. chat.next() throws a network error), the metered events are durably written to the nested tree's journal (scope.ts:381), but the crash propagates through driver.act()execute() (driver-executor.ts:149, no try/catch) → runChild catch (scope.ts:604), which returns a down settlement WITHOUT computing the metered sum (line 609: return downRecord(...) — no metered field). The down path in finalizeSettlement (scope.ts:437-447) never re-homes metered events to the parent tree. Result: pool.observe() already debited the

🟡 LOW JSDoc for materializeTreeView doesn't mention metered events — src/durable/spawn-journal.ts

Lines 377-381: The JSDoc says 'Folds spawned/settled/cancelled into a per-node snapshot' but the function now also folds metered events into snapshots (the third pass at lines 420-424 accumulates spend). The comment at lines 385-388 similarly only mentions spawned/settled/cancelled. This is a documentation gap, not a logic bug — the code works correctly.

🟡 LOW No direct unit test for materializeTreeView with metered events — src/durable/spawn-journal.ts

materializeTreeView's new metered fold pass (lines 418-424) is the exact mirror of trajectoryReport's fold (trajectory.ts:99-103), but supervise.test.ts:838-844 tests materializeTreeView only on non-metered journals. The driver-inference-metering test (line 414) journals a metered event but calls trajectoryReport, not materializeTreeView. A regression in materializeTreeView's metered handling (e.g., wrong accumulation order, spend double-counting) would not be caught by existing unit tests. Add a test that journals spawned + se

🟡 LOW metered event ordering in replaySpawnTree's seq sort is implicit — src/durable/spawn-journal.ts

Line 310 sorts ALL events (including metered) by seq, then line 318 skips metered. Since meterSeq overlaps cursorSeq, a metered event could sort between two settled events. This is currently harmless (metered is skipped), but the correctness of the sort relies on the skip being applied. If a future change reads from 'ordered' without the skip, the interleaving would silently reorder settlements relative to driver-inference timing. The skip at [line 318](https://github.com/tangle-network/agent-runtime/blob/54bbddea177e6e6715e55d08ce8

🟡 LOW requireNode error message doesn't mention metered events — src/durable/spawn-journal.ts

Line 461: throw new Error('spawn journal corrupted: settle/cancel for node ...'). The requireNode function at line 458 is now called for metered events (line 422) in addition to settled/cancelled (line 409, 414). If a metered event references a non-existent node, the error

🟡 LOW No-op meter write for turns with costUsd:0 but no usage — src/runtime/supervise/coordination-driver.ts

Line 172: The guard if (res.usage || res.costUsd !== undefined) triggers metering when costUsd is 0 but usage is absent. This produces a zero-spend observe() (harmless) and a zero-spend metered journal event (minor journal noise). The downstream isNonZeroSpend/isNonEmptySpend gates suppress it from outputs, so no functional impact. Could tighten the guard to if (res.usage || (res.costUsd !== undefined && res.costUsd > 0)) but the current behavior is defensible (recording a zero-cost priced turn is honest accounting).

🟡 LOW Stale tripwire comment contradicts the metering this PR adds — src/runtime/supervise/coordination-driver.ts

Lines 93-96: The comment on runawayTripwireTurns states 'the driver's own inference tokens are not yet metered against the conserved pool, so they alone cannot drain it.' But this PR adds exactly that metering via scope.meter()pool.observe(). With usage-reporting chat seams, driver inference NOW drains the pool. The tripwire's remaining purpose is narrower: it catches a driver whose chat seam reports NO usage (so observe is never called). The comment should be updated to reflect the post-metering reality, otherwise a future reader will misunderstand why the tripwire exists. Fix: update the comment to say driver inference IS now met

🟡 LOW meterSeq shares ordinal 0 with spawnOrdinal on root's first metered turn — src/runtime/supervise/scope.ts

Both meterSeq and spawnOrdinal start at 0 (lines 189-191). The root node's first metered event and the root's spawned event both get seq=0. The comment correctly states that metered events use a separate seq namespace outside cursor uniqueness, so no collision occurs in the journal guard. However, if any downstream code ever sorts events by seq without kind discrimination, the co-numbering could produce ambiguous ordering. Not a functional bug today but a latent design fragility.

🟡 LOW notifyRuntimeHookEvent called after pool+journal commit in meter() — hook failure is silent — src/runtime/supervise/scope.ts

meter() at line 376 calls args.pool.observe(spend) and await args.journal.appendEvent(...) BEFORE notifyRuntimeHookEvent(...) at line 390. If the hook throws (e.g. a downstream consumer's event handler crashes), the pool and journal are already mutated but the hook error propagates out of meter(), which is awaited by the coordination driver loop (coordination-driver.ts:183). This would abort the driver's loop on a trace consumer failure, which may be undesirable — the driver should be resilient to observability failu

🟡 LOW Nested metering test omits usd breakdown assertions — tests/loops/driver-inference-metering.test.ts

Test 're-homes a NESTED sub-driver inference up the tree' (line 133) asserts only the tokens channel of driverInference and spentTotal (lines 205-207). The mid and root driver turns carry no costUsd, so driverInference.usd is implicitly 0 — but not asserting it means a regression that accidentally drops usd metering in the nested re-home path (driver-executor.ts sumMetered) would not be caught by this test. The flat test ([line 125](https://github.com/tangle-network/agent-runtime/blob/54bbddea1

🟡 LOW No test for mixed metered/unmetered driver turns — tests/loops/driver-inference-metering.test.ts

coordination-driver.ts line 172 gates metering on if (res.usage || res.costUsd !== undefined). A turn with NEITHER field is skipped (no meter, no journal event, no agent.turn event). All meteredChat turns in the tests carry usage on every turn. A turn carrying costUsd but no usage (tokens defaulting to 0/0) is also untested. The router-driver-chat test at line 131 does verify the router level (usage absent → not forwarded), but the driver-level gate that consumes it is not exercised with a mixed s

🟡 LOW Test comment imprecision in maxTurns=0 usd-bounded test — tests/loops/driver-inference-metering.test.ts

Line 357: comment says '~2-3 turns of $0.04 fit' but the test asserts exactly 3 turns (maxUsd=0.1, costUsd=0.04×3=0.12, guard breaks at top of turn 4 when usdLeft=-0.02≤0). The ~ qualifier is misleading — it's deterministically 3 turns. The assertion itself is correct; only the comment is fuzzy. No behavior impact.

🟡 LOW USD-bounded test does not assert the exact usdLeft readout — tests/loops/driver-inference-metering.test.ts

Test 'maxTurns=0 is bounded by usd too' (line 328) asserts only n === 3 and result.kind === 'no-winner' (lines 370-371). The comment traces usdLeft: 0.1 → 0.06 → 0.02 → -0.02, but the test never reads scope.budget.usdLeft to prove the negative readout (unlike the budget-pool observe test at line 444 which does assert the negative token value).


tangletools · 2026-06-16T23:30:45Z · trace

…path + close review nits

Address the PR review of the single-ledger unification:

- MEDIUM (real drift): a NESTED sub-driver that crashed mid-run kept its metered inference in the
  pool (observe) but never re-homed it to the parent journal — the down/abort settle path didn't
  carry metered. So spentTotal/trajectory silently under-counted a crashed sub-driver: the exact
  pool↔journal drift the unification was meant to kill. Fix: `Executor.metered()` (replacing
  `ExecutorResult.metered`) caches the nested-tree inference on BOTH the success AND crash paths;
  the scope re-homes it on every settle exit (done, aborted, crash). New test proves a crashed
  sub-driver's partial inference (40/20) lands in the journal.
- MEDIUM (test gap): add a direct `materializeTreeView` test for the metered fold (the resume-path
  twin of trajectoryReport), incl. the seq-collision case.
- Bench: `bench/src/atom-humaneval.mts` had a local `routerDriverChat` copy that dropped
  usage/costUsd — so its driver arms never metered. Swap to the shared export; delete the dead
  local copy + helpers. Now the one real-router consumer actually meters inference.
- Nits: assert usd flows through the nested re-home; fix the stale tripwire comment (driver
  inference IS metered now — the tripwire only catches a no-usage chat seam); JSDoc.

Full suite 1007 pass; lint/typecheck/build green.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — c73059a6

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-16T23:47:49Z

@drewstone drewstone changed the title feat(supervise): meter the driver's own inference against the conserved pool + A++ spend observability feat(supervise): meter driver inference on ONE journal ledger + A++ spend observability Jun 16, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 1 (1 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 155.0s (2 bridge agents)
Total 155.0s

💰 Value — sound

Meters the coordination driver's own LLM inference into the conserved budget pool — the largest per-run token consumer that was previously invisible to equal-k — with dual live/journal accounting and per-turn trace observability. Sound design, follows existing patterns, well tested.

  • What it does: Adds driver inference metering to the conserved budget pool via two parallel ledgers: BudgetPool.observe() (live free→committed debit for the in-loop guard) and Scope.meter() (journaled metered event + agent.turn trace event). The DriverTurn type gains usage/costUsd; routerDriverChat forwards the router's real usage. Each driver turn debits the shared pool, so equal-k counts the dr
  • Goals it achieves: 1) Make equal-k compute matching fair by counting the driver's own LLM tokens (previously invisible — the largest per-run consumer). 2) Make maxTurns=0 genuinely pool-bounded: a thinking driver now drains the pool → poolStarved halts it (proven by the never-stopping-driver test). 3) Unify cost accounting on one journal ledger: settled events carry child work, metered events carry driver in
  • Assessment: The change is coherent and follows the codebase's grain precisely. It mirrors the existing reserve/reconcile + settled dual-ledger pattern with a matching observe + metered pair. The invariant (total ≡ free + reserved + committed) is preserved by construction — observe is a direct free → committed debit. The metered event is NOT a settlement, so replay/cursor ordering is untouche
  • Better / existing approach: none — this is the right approach. Searched for existing spend-observation mechanisms, agent.turn events, metered event kinds, spentBreakdown, usdCapped, and prior work on these paths. No existing mechanism covered this gap: reserve/reconcile is per-child (driver isn't a child of itself), UsageEvent streams are for spawned children only, and the pool had no direct-debit operation. Th
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🎯 Usefulness — sound

Wires driver inference into the conserved pool and journal ledger correctly, making equal-k honest and maxTurns=0 genuinely budget-bounded; follows the existing reconcile/journal twin pattern and handles crash/depth/no-usage edge cases.

  • Integration: Fully reachable and consumed. routerDriverChat (src/runtime/supervise/router-driver-chat.ts:29-30) forwards the router's real usage/costUsd → coordinationDriverAgent calls scope.meter() (src/runtime/supervise/coordination-driver.ts:184) → pool.observe() debits live (src/runtime/supervise/budget.ts:236) + journal.appendEvent() persists durably (src/runtime/supervise/scope.ts:389). The supervisor su
  • Fit with existing patterns: Extends the existing design's twin pattern: metered ↔ observe mirrors settled ↔ reconcile. BudgetPool.observe moves free → committed directly (preserving total ≡ free + reserved + committed), same channels and invariants as reserve/reconcile. Scope.meter is a first-class Scope verb alongside spawn/next/send. The metered event type in SpawnEvent (src/runtime/supervise/types.ts:385-398) is a natural
  • Real-world viability: Edge cases are handled: (1) overspend: free goes negative on observe — the honest exhaustion signal poolStarved reads (src/runtime/supervise/coordination-driver.ts:109-111), in-loop guard halts before more spend; (2) crash/re-home: a crashed sub-driver's partial inference is read from its nested tree on the catch path (driver-executor.ts:182) and re-homed as a metered event on the down settlement
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260616T235216Z

@drewstone drewstone merged commit a9c94b7 into main Jun 16, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 17, 2026
…de shape

Harness-agnostic: a per-harness part-decoder adapter registry (toolPartDecoders) replaces
the one mega-decoder. Each harness owns its real wire shape; the flow downstream is identical;
add a harness = add a decoder + one registry entry.

- decodeOpencodePart — VALIDATED ON A LIVE BOX: a real opencode session emitted
  {type:'tool', tool:'read'|'bash'|'write', callID, state:{status,input}} — the decoder extracted
  3/3 real tool calls (the initial guess got 0/29; caught + fixed against reality). Terminal-state
  only (pending/running skipped) + callId de-dup (opencode streams each call 3x + a raw copy).
- decodeAnthropicPart (claude-code tool_use), decodeOpenAiPart (codex/router/kimi function).
- decodeToolPart(part, harness?) selects the adapter or tries all; sources thread  through.
- bench/src/decoder-live.mts: the live-box validation harness (no mock).

trace 16 + watch 5 + analyze 2 tests; typecheck/build/lint clean. (Pre-existing main failure in
driver-inference-metering #319 is unrelated — fails on origin/main, untouched here.)
drewstone added a commit that referenced this pull request Jun 17, 2026
…ent) — main was red (#322)

#318 renamed the coordination wait verb await_next → await_event; #319's metering test
(merged around the same time) still scripted await_next. Since the tool no longer exists,
the scripted driver's 'collect the worker' turn returned {error:'unknown tool: await_next'}
and the worker was never drained → result 'no-winner' → the 2 winner-expecting tests failed
on main. Updated all 6 refs to await_event (the driver already fails loud on the unknown
tool — only the fixed script couldn't recover, unlike a real LLM).

Full suite 1030 pass (was 2 failing on main); no stale await_next remains anywhere.
drewstone added a commit that referenced this pull request Jun 17, 2026
…race analysis (#321)

* refactor(supervise): substrate-agnostic TraceSource (sandbox-first) — replace router-only seam

The detector/analyzer were built router-only (onToolStep/ToolStep) — premature; production is
sandbox/fleet. Corrected to one interface over agent-eval's ToolSpan:

- TraceSource (trace-source.ts): a worker's tool calls as ToolSpans, from an OWNED loop
  (createPushTraceSource — router/cli-bridge dispatch) OR a SANDBOX box
  (sandboxSessionTraceSource(box, sessionId) → box.messages() session parts → decodeToolPart,
  defensive across OpenAI + harness shapes). The SDK exposes tool calls via the session
  (SessionMessage.parts / streamPrompt), NOT exportTrace (sandbox telemetry) — corrected.
- watchTrace (online) + analyzeTrace (settle) now consume a TraceSource, not a router seam.
- DELETED the router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep.

Common currency = ToolSpan; same agent-eval detectors + batch analyzers over any substrate.
trace-source 11 + watchTrace 5 + analyzeTrace 2 tests incl. the sandbox box path (mock box →
session parts → loop detected); full suite 1023 pass; typecheck/build/lint clean.

Live-box validation of the exact harness part-shape pending (decoder is defensive).

* feat(supervise): per-harness decoder registry + LIVE-validated opencode shape

Harness-agnostic: a per-harness part-decoder adapter registry (toolPartDecoders) replaces
the one mega-decoder. Each harness owns its real wire shape; the flow downstream is identical;
add a harness = add a decoder + one registry entry.

- decodeOpencodePart — VALIDATED ON A LIVE BOX: a real opencode session emitted
  {type:'tool', tool:'read'|'bash'|'write', callID, state:{status,input}} — the decoder extracted
  3/3 real tool calls (the initial guess got 0/29; caught + fixed against reality). Terminal-state
  only (pending/running skipped) + callId de-dup (opencode streams each call 3x + a raw copy).
- decodeAnthropicPart (claude-code tool_use), decodeOpenAiPart (codex/router/kimi function).
- decodeToolPart(part, harness?) selects the adapter or tries all; sources thread  through.
- bench/src/decoder-live.mts: the live-box validation harness (no mock).

trace 16 + watch 5 + analyze 2 tests; typecheck/build/lint clean. (Pre-existing main failure in
driver-inference-metering #319 is unrelated — fails on origin/main, untouched here.)

* fix(supervise): address #321 audit + honest live-validation of harness adapters

Audit (Needs Work) fixes:
- decodeOpenAiPart guard matched any object with a .function field ({type:'text',function}) →
  match on type ('function'|'tool_call') only, name still required.
- subscriber isolation: a throwing onSignal/subscriber can no longer crash the producer
  (both push + parts source fan-outs now try/catch each subscriber).
- restored the dropped evidence:{action} assertion in the watchTrace stuck-loop test.
- decoder-live.mts: ALWAYS deletes the box (try/finally), exits non-zero on a decode miss
  (CI-gateable), parameterized by HARNESS, dumps part-type vocabulary + harness error events.

LIVE validation across harnesses (REAL boxes, no mock):
- opencode: PROVEN (3/3 real tool calls — already in this PR).
- claude-code + codex: the harness PROCESS crashes 'exit 1' with the test model (deepseek-v4-flash
  via the openai-compat router) before any tool call — zero tool parts to validate against. This is a
  harness↔model protocol block, NOT a decoder issue. Their adapters stay (public formats) but are
  docstring-flagged NOT-live-validated; owed the same proof opencode got with a compatible model.
- Coverage honesty: repeated-action works for ALL harnesses (name+args); error-streak is opencode-
  validated only (claude-code/codex errors live in separate result blocks not yet decoded).

trace 16 + watch 5 + analyze 2 tests; typecheck/build/lint clean.

* feat(trace): ground per-harness tool decoders in cli-bridge backend shapes

Confirm each harness adapter against cli-bridge's canonical parsers (the
real readers of each harness's native output):
- decodeAnthropicPart handles kimi's tool_use_id/tool fallbacks (kimi.ts)
- decodeOpenAiPart cites kimi.ts top-level function form
- codex emits no structured tool calls (codex.ts) — per-tool detection
  unavailable for codex from any path; documented as a harness property
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants