refactor: usability overhaul — brain-from-profile, surface shrink, docs that can't lie, supervise() one-call#347
Conversation
…, full doc/module/example inventory + completion criteria
…tify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor
…enerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine'
…becomes a thin adapter (keystone 1/4)
…opChat seam (keystone) Delete DriverChat + routerDriverChat; the coordination-driver brain is now the canonical ToolLoopChat and its loop runs through runToolLoop (routerBrain = 4 lines, was 60). The equal-k driver-inference metering is preserved exactly. Three tool-loop copies collapse to one.
…done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next
…umbing from the public barrel
…es from the public barrel
…profiles; drop unused duplicates)
…ngExperiment, refineGepa label)
… 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes)
… in curated docs must resolve Scans canonical-api/concepts/architecture for backticked symbols outside code fences; reddens on any call-shaped or PascalCase symbol that resolves to no src/bench/substrate export or concept-whitelist entry. Walks every substrate dist/**/*.d.ts (not just index barrels). Closes the gap that let gepaDriver/refineGepa live in the docs unchecked.
…s (WS1b) A supervisor is now an AgentProfile: harness null -> the in-process router tool-loop (coordinationDriverAgent; routerBrain becomes an internal detail), a coding-CLI harness (claude-code/opencode/codex) -> a sandboxed harness driving the coordination verbs via serveCoordinationMcp. Both arms share makeWorkerAgent + the keep-best-delivered oracle. Closes the critique's A2 (driver brain was router-only). Proven offline both arms.
… profile.harness)
supervise(profile, task, { backend|makeWorkerAgent, budget }) defaults blobs/perWorker/
journal/executors/maxDepth so 'just invoke the supervisor' is a one-liner. workerFromBackend
derives the worker seam from a backend config + an optional completion oracle (settled⟺delivered).
The raw seams (supervisorAgent + createSupervisor().run) stay for power use.
…design (round vs turn, prompt-policy retry, real-time trace self-correction)
… — profile + goal, scaffolding defaulted)
… agent, incl. a sub-supervisor) The coordination verb always took a worker OR a driver profile and resolves a sub-supervisor via the role marker — the name lied. Renamed across the tool def, the LLM-facing descriptions, the scripted-brain tests, the examples, and the hand docs. WS4 (naming taxonomy).
…er_agent (consistent verb family) The coordination verbs operate on any spawned agent (a leaf worker OR a sub-supervisor), so the family is now spawn_agent / observe_agent / steer_agent. WS4 (naming taxonomy).
…ategy, supervisorSkill→supervisorInstructions (WS4) They are strategy combinators and a prompt-instruction builder, not 'drivers'/'skills' — reserving 'Driver' for the agent-orchestration layer (coordinationDriverAgent/driverChild).
…nResumeDriver (WS4)
…r selfImprove; generator defaulted from surface)
…ridge runners onto supervise() run-router.ts duplicated examples/supervise/supervise.ts (router brain + router-tools backend). loop.ts's runSupervisorLoop/makeWorkerAgent duplicated supervise()/workerFromBackend. The sandbox + bridge runners now call supervise() with only their load-bearing per-backend seam; the shared demo task + scripted brain move to shared.ts.
…() (single-sourced worker seam) Replaces the bespoke makeWorker (executor construction + per-worker file plumbing) with workerFromBackend(backend, deliverable); the deployable check now reads the worker's real output for ANSWER=42 (completion oracle, not a self-report). Keeps the cli-bridge harness supervisor arm that drives spawn_agent natively over the coordination MCP.
…rvise() example test
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 3 (3 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 80.5s (2 bridge agents) |
| Total | 80.5s |
💰 Value — sound
Substantial usability/DX overhaul: unifies the supervisor brain onto one ToolLoopChat seam (deleting the DriverChat zoo), shrinks the public surface 13→6 subpaths, adds build-time doc freshness gates, and provides one-call supervise()/improve() facades — all in the grain of the existing recursive-su
- What it does: Unifies the supervisor brain seam: the old DriverChat zoo (routerDriverChat, scriptedSupervisorChat as a separate type hierarchy) is replaced by a single ToolLoopChat contract — routerBrain resolves it from the router, and supervisorAgent resolves the brain from profile.harness (null→router, harness name→sandboxed MCP). Deletes src/runtime/supervise/router-driver-chat.ts (65 lines of translation c
- Goals it achieves: 1) Make supervisors brain-resolvable from their profile (backend-as-data), same as workers via createExecutor — no hand-built driver brain. 2) Reduce the public API surface so a competent engineer can derive the call path from inspection, not by reverse-engineering 998 exports. 3) Make docs self-verifying — every backticked symbol must resolve to a real export, so docs can never claim a non-existe
- Assessment: This is a coherent, well-executed usability overhaul that reinforces rather than fights the codebase's grain. The brain-unification (WS1) is the keystone architectural improvement — it makes the supervisor follow the same backend-as-data resolution rule as every other agent (profile.harness null→router, set→sandbox), eliminating the parallel DriverChat type hierarchy. The surface shrink is surgica
- Better / existing approach: none — this is the right approach. Searched for existing alternatives: (1) the old DriverChat type hierarchy and routerDriverChat adapter in router-driver-chat.ts were the exact zoo being eliminated — the new ToolLoopChat seam IS the simpler design; (2) improve() properly delegates to agent-eval's selfImprove and is a thin facade, not a reinvention; (3) supervise() properly delegates to supervisor
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound
The supervise() one-call + brain-from-profile unification + surface shrink + verifiable docs is a coherent usability overhaul that canonicalizes existing patterns without competing or dead-ending.
- Integration: Wires correctly.
supervise()(src/runtime/supervise/supervise.ts:80) composes existing primitives (supervisorAgent→coordinationDriverAgent, createSupervisor, createInMemoryRunContext). All 5 examples (supervise.ts, run-sandbox.ts, run-bridge.ts, run-supervisor-mcp.ts, atom-humaneval.mts) import from@tangle-network/agent-runtime/loops— the verified subpath.routerBrainreplaces the deleted - Fit with existing patterns: Aligns with the codebase's grain. The backend-as-data pattern —
createExecutor({backend})resolves a worker by data;supervisorAgent(profile, deps)resolves a brain fromprofile.harnessthe same way.routerBrainis a 4-line thin wrapper over the existingrouterChatWithTools, emitting the existingToolLoopChatseam.supervise()is a convenience, not a replacement — the raw `supervisor - Real-world viability: Proven offline at multiple layers: both brain arms tested without creds (supervisor-agent.test.ts:68-131 covers harness=null router arm and harness='opencode' sandbox arm via real HTTP MCP), budget conservation gates (poolStarved, deadline, abort) all test-covered (coordination-driver.test.ts:258-352), completion oracle (
gateOnDeliverable) ensures settled≠delivered without a real check, the exam - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/supervise/supervise.ts
+console.log(result.kind === 'winner' ? '✓ delivered' :
✗ no winner (${result.kind}))
🟡 Cruft: commented out code scripts/check-docs-freshness.mjs
+// CLASS 1 version / substrate-peer pins != package.json
🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…oop, fix improve() default model - runToolLoop name collided with the public streaming runToolLoop; the internal brain-loop seam is now runBrainLoop (one grep = one concept). - improve()'s zero-config default reflection model was the dead anthropic/claude-sonnet-4.6 → deepseek-v4-flash (router-served).
The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn / supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE row and routes self-improvement to improve().
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — bf075234
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T21:19:52Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 6 (3 low, 3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 205.3s (2 bridge agents) |
| Total | 205.3s |
💰 Value — sound-with-nits
Large, coherent usability overhaul that unifies the supervisor brain seam onto the canonical ToolLoopChat, adds convenient one-call facades (supervise/improve), shrinks the public surface 12→5 subpaths, renames confusing symbols, and adds a self-verifying docs freshness gate — all in the grain of ex
- What it does: Consolidates the supervisor's brain seam: deletes the parallel DriverChat interface + routerDriverChat adapter (src/runtime/supervise/router-driver-chat.ts:1-65), and routes the coordination driver through the canonical ToolLoopChat seam instead. Adds supervise(profile, task, opts) as a one-call convenience composing supervisorAgent() + createSupervisor().run() with sensible defaults (src/runtime/
- Goals it achieves: Make a competent engineer derive the supervisor call path in seconds (supervise(profile, task, { backend, budget }) instead of hand-wiring coordinationDriverAgent + createInMemoryRunContext + createSupervisor + blobs + perWorker + journal + executors + maxDepth). Eliminate the confusing parallel type hierarchy (DriverChat vs ToolLoopChat — identical purpose, different shapes). Shrink the public AP
- Assessment: The change is well-designed and built in the grain of the existing codebase. The brain unification (DriverChat→ToolLoopChat) is a genuine simplification: the old code had a parallel interface with its own message format and a dedicated adapter (routerDriverChat) that did hand-rolled OpenAI message translation — now the 3-line routerBrain function satisfies the same ToolLoopChat seam every other to
- Better / existing approach: none — this is the right approach. The ToolLoopChat unification was the correct call (checked: no other DriverChat-like parallel type exists; routerBrain is now the canonical chain routerChatWithTools→routerBrain→ToolLoopChat). The supervise() convenience correctly composes existing primitives rather than reinventing them. The naming changes fix real confusion. The subpath reduction removes genuin
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
supervise() delivers a real DX win (profile+goal → one call), brain-from-profile coheres with the executor pattern, and the docs freshness gate hardens the surface — one stale bench reference and a partially-wired code surface in improve() are the only nits.
- Integration: All new surfaces are exported through the 6-subpath public API and reachable. supervise() is called by 3 examples (examples/supervise/supervise.ts:31, examples/supervisor-loop/run-sandbox.ts:58, run-bridge.ts:75) and 2 test files (tests/loops/supervise-convenience.test.ts:51, tests/supervisor-loop-example.test.ts:65). supervisorAgent() is called internally by supervise() (src/runtime/supervise/sup
- Fit with existing patterns: The pattern matches the codebase's grain. createSupervisor().run() and coordinationDriverAgent remain as raw seams for power users (the bench and personify layer use them directly); supervise() wraps the common case with sensible defaults (blobs, perWorker=budget/4, journal, executors, maxDepth=8). The brain-from-profile pattern in supervisorAgent (harness:null → router brain, harness:'opencode' →
- Real-world viability: Core paths are tested: router arm (tests/loops/supervisor-agent.test.ts:68), sandbox arm (same file:83), offline scripted-brain integration (tests/supervisor-loop-example.test.ts:63), fail-loud for missing deps (supervise-convenience.test.ts:71, supervisor-agent.test.ts:104,115). The docs freshness gate (scripts/check-docs-freshness.mjs) smoke-proven to detect stale symbols. Two real-world gaps: (
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/supervise/supervise.ts
+console.log(result.kind === 'winner' ? '✓ delivered' :
✗ no winner (${result.kind}))
🟡 Cruft: commented out code scripts/check-docs-freshness.mjs
+// CLASS 1 version / substrate-peer pins != package.json
🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
💰 Value Audit
🟡 Bench system prompts reference old tool name spawn_worker after rename to spawn_agent [maintenance] ``
6 files still reference spawn_worker in LLM system prompts: bench/src/atom-humaneval.mts:96, bench/src/mcp-mount-probe.mts:89-90, bench/src/atom-mcp-e2e.mts:178, bench/src/profiles.ts:23/98, skills/supervise/SKILL.md:13/24, skills/loop-writer/SKILL.md:113. The MCP tools were renamed from spawn_worker to spawn_agent (src/mcp/tools/coordination.ts:336) and the LLM would fail trying to call a non-existent tool. Also bench/src/profiles.ts:22/24/96/98 references observe_worker/steer_worker (now obser
🎯 Usefulness Audit
🟡 bench/atom-humaneval.mts system prompt references stale tool name spawn_worker [robustness] ``
bench/src/atom-humaneval.mts:96 tells the LLM 'Tools: spawn_worker ... await_event ...' but the coordination tools were renamed to spawn_agent (src/mcp/tools/coordination.ts:336). The import rename (routerDriverChat→routerBrain, line 31-34) was done but the system prompt string wasn't updated. The bench would fail at runtime when the LLM calls a tool name that doesn't exist. Fix: update the system prompt string to reference spawn_agent.
🟡 improve() code surface is half-wired (empty baseline, winner not applied) [ergonomics] ``
At src/improvement/improve.ts:128-133, baselineSurfaceFor('code') returns '' (no worktree ref), and at line 159-160 applyWinnerToProfile('code') returns the profile unchanged. The caller must fish the actual code winner from raw.winner.surface. The surface exists in the type system (ImproveSurface includes 'code') and a generator can be injected, but the facade provides no load-bearing integration — a caller using surface:'code' with a generator gets a profile-back-is-input result and must read
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
… 2775 LOC + tests) A third orchestration substrate (a workflow-as-a-script DSL runner with its own checkpoints/budget/ delegates) that does NOT use the supervisor and is NOT self-improving — redundant with the Scope/Supervisor + supervise() path (the architecture's 'two substrates, do not invent a third'). Zero in-repo or fleet consumers; its ./workflow subpath was already dropped in WS3.
…or can vary budget per worker)
…d model-subset restriction)
…ey (runStrategyEvolution + promotionGate)
❌ Needs Work —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 0 | 37 | 0 |
| Confidence | 95 | 95 | 95 |
| Correctness | 0 | 37 | 0 |
| Security | 0 | 37 | 0 |
| Testing | 0 | 37 | 0 |
| Architecture | 0 | 37 | 0 |
Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH contentAddress dropped from runtime barrel export — src/runtime/index.ts
The old barrel (edc1d54) exported contentAddress at line 30. The new barrel (bf07523) only exports InMemoryResultBlobStore and InMemorySpawnJournal from ../durable/spawn-journal. Two bench files import contentAddress from the runtime barrel: bench/src/atom-humaneval.mts:23 and bench/src/atom-mcp-e2e.mts:24. These will fail to compile. Fix: add contentAddress back to the re-export on line 29:
export { contentAddress, InMemoryResultBlobStore, InMemorySpawnJournal } from '../durable/spawn-journal'.
Other
🟠 MEDIUM Unguarded JSON.parse in applyWinnerToProfile can crash after a successful ship verdict — src/improvement/improve.ts
Lines 152, 154, 156, 158 call
JSON.parse(winner)unconditionally for skills/tools/mcp/hooks. tools/mcp/hooks REQUIRE a caller-supplied driver (no default), and skillOptDriver mutates a JSON-stringified blob. If any such driver produces a non-JSON string (a realistic LLM failure mode — the driver parses model output and a malformed reflection slips through),applyWinnerToProfilethrowsSyntaxErrorAFTERselfImprovereturned gateDecision==='ship'. The throw discards the entire improvement result: the caller never receivesout.raw,out.lift, orout.gateDecision, losing the provenance record and the winner surface. The facade's documented contract ('ret
🟠 MEDIUM Unprotected JSON.parse in applyWinnerToProfile can throw raw SyntaxError — src/improvement/improve.ts
Lines 151-158:
applyWinnerToProfilecallsJSON.parse(winner)forskills/tools/mcp/hookssurfaces without try/catch. If theImprovementDriver(including the defaultskillOptDriver) produces a winner surface string that is not valid JSON — a known LLM failure mode — this throws a rawSyntaxErrorafterselfImprovealready shipped the result. This violates the error taxonomy insrc/errors.ts:6-20which requires all consumer-facing errors to extendAgentEvalError. The risk is partially mitigated because the agent function must have used the surface successfully during evaluation for the gate to ship, but if the agent is lenient (accepts an
🟠 MEDIUM MCP tool wire-format rename breaks external references — src/mcp/tools/coordination.ts
Lines 336,360,377: Tool names changed from spawn_worker→spawn_agent, observe_worker→observe_agent, steer_worker→steer_agent. These are wire-format identifiers that MCP clients use to call tools. Tests updated (tests/loops/coordination.test.ts:65-119 use new names, 54 tests pass). However bench/src/profiles.ts:22-98, bench/src/atom-humaneval.mts:96, bench/src/atom-mcp-e2e.mts:178, and bench/src/mcp-mount-probe.mts:89-119 still reference old tool name strings in LLM prompts — these bench files will silently fail when mounting the coordination MCP after this rename lands. They may be updated in other shots of this PR (80 files total). Verify all bench references
🟠 MEDIUM finalizeBestDelivered has zero tests — src/runtime/supervise/coordination-driver.ts
finalizeBestDelivered (line 198) is a new public export used by both coordination-driver.ts and supervisor-agent.ts for the completion-oracle result selection. It filters settled children by status==='done' AND valid===true, then argmax on score. Zero direct tests. Covered only indirectly through coordinationDriverAgent and supervisorAgent integration tests. Missing cases: no delivered children (returns undefined), tie-breaking on equal scores, undefined score handling, missing outRef, and the valid===true gate (a worker that settled 'done' but wasn't valid should NOT be selected). Add unit tests for these scenarios.
🟠 MEDIUM runBrainLoop has zero direct tests — src/runtime/tool-loop.ts
runBrainLoop (114 lines) is the new shared tool-loop skeleton extracted from router-client.ts and coordination-driver.ts. It is the canonical agentic tool-loop — every brain drives through it. Yet it has zero dedicated tests. The existing src/tool-loop.test.ts tests runToolLoop/streamToolLoop (the OLD functions, not the new runBrainLoop). Indirect coverage comes from tests/loops/router-brain.test.ts (which tests routerBrain, the thin adapter over runBrainLoop) and tests/loops/coordination-driver.test.ts (which exercises runBrainLoop through the driver agent). Missing coverage: stopBefore/beforeTurn/onUsage hook semantics, malformed-arg degradation, message format correctness, edge cases like maxTurns=0 with hooks, and the final text fallback behavior. Add tests covering: (1) hook invocatio
🟡 LOW Cross-doc ref to removed pages verified clean but module removal may surprise consumers — docs/api/README.md
Six module pages removed from the module listing (analyst-loop, audit, improvement, platform, topology, workflow) without deprecation notice. The typedoc.json entryPoint array was reduced from 12 to 6 entries. Source modules still exist on disk and are importable — this is purely a documentation visibility change. No stale cross-references to removed pages found. Consumers navigating from old bookmarks to these pages will get 404s; consider a redirect or deprecation note.
🟡 LOW Public API surface reduction not surfaced as a finding in docs — docs/api/runtime.md
runtime.md drops documentation for ~20 previously-public symbols (RootHandle, RootSignal, SpawnEvent, Runtime, ExecutorFactory, Restart, NodeStatus, NodeId, ToolPartDecoder, contentAddress, replaySpawnTree, materializeTreeView, createSandboxForSpec, depthDriver, breadthDriver, driverChild, isDriverSpec, withDriverExecutor, routerDriverChat, driverRuntime, driverExecutorFactory, createRootHandle, toToolSpan, touchedPathsFromPatch, countDiffLines, isNonEmptyPatch, touchesSecretPath, runCoderChecks, createPartsTraceSource, decodeOpencodePart, decodeAnthropicPart, decodeOpenAiPart, toolPartDecoders, supervisorSkill). Verified these are no longer exported from src/runtime/index.ts at HEAD (base exported them at [lines 30](https://github.com/tangle-network/agent-runtime/blob/bf0752345cea05968500
🟡 LOW Undocumented SpawnJournal interface referenced as plain text in runtime.md — docs/api/runtime.md
InMemorySpawnJournal (line 85) states 'Implements SpawnJournal' but SpawnJournal is no longer documented on the runtime page (removed from
src/runtime/index.tstype re-exports at line 371-388). The reference renders as plain text without a link — not broken, but the interface's contract is invisible to readers. Consider either re-exporting SpawnJournal from the runtime entry point or documenting its contract in InMemorySpawnJournal's own docs.
🟡 LOW No trailing newline at EOF — docs/canonical-api.md
The file ends without a final newline (line 78 is the last line, matching the prior version's 'No newline at end of file'). Trivial; most renderers are unaffected, but some linters/formatters flag it. Not blocking.
🟡 LOW runPersonaConversation location claim inaccurate — docs/canonical-api.md
Line 38: doc claims runPersonaConversation is at 'root . (also /loops)' but the function is only exported from src/index.ts (root '.'), NOT from src/runtime/index.ts (./loops). Verified: grep -c 'runPersonaConversation' src/runtime/index.ts returns 0. Fix: remove '(also /loops)' or export it from the runtime barrel.
🟡 LOW supervise() decision-table row implies backend is required; it is optional — docs/canonical-api.md
Line 35 shows
supervise(profile, task, { backend, budget })as the canonical call, but src/runtime/supervise/supervise.ts:87-93 throws only when NEITHER opts.backend NOR opts.makeWorkerAgent is provided — backend is the happy-path default, not strictly required. Acceptable framing for a decision table ('scaffolding defaulted'), and the precise signature now lives in generated docs/api/. Informational only; no action required unless you want to note 'backend | makeWorkerAgent'.
🟡 LOW Force-cast of SandboxClient breaks type safety at the example seam — examples/supervisor-loop/run-sandbox.ts
line 42:
new SandboxClient(...) as unknown as RuntimeSandboxClient— if the sandbox SDK'sSandboxClientshape diverges from the runtime's internal port, this compiles silently but fails at runtime whencreateExecutorcalls the client. Same pre-existing pattern; a proper type guard or import from the runtime's public SandboxClient would catch mismatches at build time.
🟡 LOW run-supervisor-mcp.ts workerFromBackend+MCP path has no offline test — examples/supervisor-loop/run-supervisor-mcp.ts
run-supervisor-mcp.ts:130 wires workerFromBackend(backend, {check: demoCheck,...}) into serveCoordinationMcp's makeWorkerAgent. Unlike the supervise()+scripted-brain path (covered by tests/supervisor-loop-example.test.ts), this real-MCP path (supervisor harness calling spawn_agent over HTTP) is only exercisable against a live cli-bridge, so the workerFromBackend+gateOnDeliverable composition for the 'bridge' backend is unverified by CI. The substrate pieces are individually tested (coordination-mcp.test.ts, completion-gate tests), and the example throws clear errors on missing env, so this is a coverage gap not a defect. No fix required for merge; flagging so the gap is visible.
🟡 LOW demoCheck object-with-content branch only stringifies content one level deep — examples/supervisor-loop/shared.ts
shared.ts:18-23 — the check does String(content).includes(expectedAnswer) when
outis an object with acontentfield, else JSON.stringify(out). If a backend ever settlesout = { content: { text: 'ANSWER=42' } }(content itself an object), String({text:...}) yields '[object Object]' and the marker is missed, returning false even though the answer is nested inside. The fall-through JSON branch never runs because thecontentkey short-circuits. Impact is nil for the current demo (bridge executor yields content as a string; the test asserts exactly{content: expectedAnswer}), and a throwing/missing check is fail-closed via gateOnDeliverable. Note only: if the sandbox event-stream shape ever nests the marker under a non-string content field, this oracle would false-negative. Fix if it
🟡 LOW scriptedSupervisorChat ignores tools parameter from ToolLoopChat signature — examples/supervisor-loop/shared.ts
line 70: The
ToolLoopChattype expects(messages, tools) => Promise<...>butscriptedSupervisorChatreturns(messages) => Promise.resolve(...), dropping thetoolsparameter. TypeScript allows this but it means the scripted brain can never inspect tool schemas. For a fixed plan this is fine, but consumers porting this pattern to a real brain may miss that the contract includes tool awareness. Non-blocking.
🟡 LOW Default reflection model triggers real LLM spend with no caller opt-in — src/improvement/improve.ts
Line 88
defaultReflectionModel = 'deepseek-v4-flash'+ line 102llm?.model ?? defaultReflectionModelmeans a caller runningimprove(profile, [], { surface: 'prompt', scenarios, judge, agent })with nollmandOPENAI_API_KEYin env silently triggers real router calls on a model they didn't choose. The substrate'sexpectUsage: 'assert'default only catches stub cells, not unwanted spend. For an @experimental API this is borderline acceptable, but the module docstring should state that omittingopts.llmfor 'prompt'/'skills' s
🟡 LOW No runtime shape validation of JSON.parse winner against the target profile field — src/improvement/improve.ts
Even when
JSON.parse(winner)succeeds (lines 152-158), the parsed value is assigned directly toprofile.tools/profile.mcp/profile.hooks/profile.resources.skillswith no structural validation against the AgentProfile contracts (e.g.Record<string, AgentProfileMcpServer>,Record<string, AgentProfileHookCommand[]>,AgentProfileResourceRef[]). A driver producing syntactically-valid-but-semantically-wrong JSON writes a corrupt value into the returned profile. Low severity because the default drivers only cover prompt (string) and skills (round-trips through JSON.stringify/parse of the same array); the risk is confined to caller-supplied driv
🟡 LOW Test coverage limited to prompt surface; 4 of 5 surfaces and the JSON.parse failure path untested — src/improvement/improve.ts
tests/improve.test.ts has 3 tests: gate:none baseline-only (prompt), ship-verbatim writeback (prompt), ConfigError for tools-without-generator. Not covered: applyWinnerToProfile for skills/tools/mcp/hooks, the JSON.parse throw path on a malformed winner, the CodeSurface-winner branch (typeof !== 'string'), the defaultReflectionModel fallback when opts.llm is unset, and llmClientOptions() field projection. The surface matrix the facade exposes is larger than what is verified. Add at minimum: (a) a test that a malformed winner string does not discard raw/lift/gateDecision, (b) one skills-surface writeback round-trip.
🟡 LOW applyWinnerToProfile returns input reference for CodeSurface, not a copy — src/improvement/improve.ts
Lines 136-137 docstring: 'Returns a shallow copy; never mutates the input profile.' Line 147:
if (typeof winner !== 'string') return profilereturns the raw input reference forCodeSurfacewinners, not a copy. Line 160:case 'code': return profilesame issue. While documented in the inline comment on [lines 143-146](https://github.com/tangle-network/agent-runtime/blob/bf0752345cea05968500b158ecf9cab
🟡 LOW Import path narrowed from barrel to direct module — src/mcp/detached-turn.ts
Line 39: Import changed from 'import { createSandboxForSpec } from '../runtime'' to 'import { createSandboxForSpec } from '../runtime/run-loop''. This matches the removal of createSandboxForSpec from the barrel re-export in src/runtime/index.ts (confirmed via git diff: -export { createSandboxForSpec, defaultSelectWinner, runLoop } → +export { defaultSelectWinner, runLoop }). Coordinated, correct — no other src/ consumers imported createSandboxForSpec from the barrel (verified via rg across src/). The narrowing is good practice for tree-shaking.
🟡 LOW No direct test coverage for the renamed resume driver symbol — src/mcp/detached-turn.ts
grep across test/ finds zero references to either createDetachedTurnResumeDriver or createDriveTurnResumeDriver. The function is exercised indirectly through bin.ts wiring, but the rename would not be caught by a direct unit test if a regression were introduced. This is pre-existing (base also had no direct tests), not introduced by this diff. Low risk since the change is identifier-only and typecheck confirms consistency.
🟡 LOW MCP tool names renamed without backward-compat alias — breaking for external clients referencing old names — src/mcp/tools/coordination.ts
Tool names changed from spawn_worker/observe_worker/steer_worker to spawn_agent/observe_agent/steer_agent (lines 336, 360, 377). Any external MCP client or persisted prompt string that calls the old tool names will get a tool-not-found error. This is an intentional vocabulary migration (the entire PR renames worker→agent across 80 files), and src/runtime/supervise/* already uses *_agent consistently, so it is architecturally coherent. Since these tools are @experimental and served by an in-process MCP server (not a public stable API), the break is acceptable. No action required for this PR, but worth noting if any downstream consumer persists tool-call transcr
🟡 LOW Public API feature removal: durable resume and loop journal removed without deprecation — src/runtime/index.ts
The PR removes from the public surface: createFileRunContext, FileLoopJournal, InMemoryLoopJournal, LoopJournal, LoopJournalEntry, FileSpawnJournal, FileResultBlobStore, materializeTreeView, replaySpawnTree, contentAddress, FileLoopJournal, driver-executor exports (driverChild, driverExecutorFactory, driverRuntime, isDriverSpec, withDriverExecutor), patch-checks exports (runCoderChecks, touchesSecretPath, countDiffLines, isNonEmptyPatch, touchedPathsFromPatch), trace-source exports (createPartsTraceSource, decodeAnthropicPart, etc.), and RunLoopOptions.journal. The internal implementations still exist (used by supervisor.ts, scope.ts, strategy.ts, persona.ts, etc.) — only the re-exports were removed. This is a major public API contraction. Consumers importing any of these from the package
🟡 LOW supervise() default runId collision risk — src/runtime/supervise/supervise.ts
supervise() defaults runId to 'supervise' (line 107). Two concurrent calls using defaults will both beginTree('supervise') on the InMemorySpawnJournal. The journal's uniqueness guard would reject the second beginTree. While concurrent in-memory calls are unusual, the failure mode is opaque. Consider generating a unique default runId (e.g.,
supervise-${randomUUID()}) or documenting that callers must provide a unique runId for concurrent runs.
🟡 LOW supervise() hardcodes in-memory stores, no crash durability — src/runtime/supervise/supervise.ts
supervise() always calls createInMemoryRunContext({ withDriver: true }) on line 81. The old path had createFileRunContext(dir) for file-backed journal/blob durability. The supervise() convenience function cannot survive a process crash. While this is a deliberate simplification (the raw createSupervisor().run() path still accepts any SpawnJournal/ResultBlobStore), callers who relied on createFileRunContext for durable supervised runs now have no convenience function. Document this limitation, or add an optional
dirparameter to SuperviseOptions that routes to file-backed stores.
🟡 LOW workerFromBackend constructs executor with throwaway signal and empty seams — src/runtime/supervise/supervise.ts
Lines 26-29:
const ctx: ExecutorContext = { signal: new AbortController().signal, seams: {} }. The AbortController is discarded, so the construction-time signal can never fire. This is safe for current executors because runChild (scope.ts:585) passes the realchildAbort.signaltoexecutor.execute(task, signal)at run time, and createExecutor adds the backend-specific seam internally. But cliExecutor (runtime.ts:692-695) wires abort-detection at CONSTRUCTION time viactx.signal.addEventListener('abort', ...)— that listener will never fire through this path. For a cli-backend workerFromBackend, the execute-time signal still drives the kill via s
🟡 LOW workerFromBackend pre-constructs executor with stale context — src/runtime/supervise/supervise.ts
workerFromBackend (line 25) pre-constructs the executor via createExecutor(backend)(spec, ctx) with ctx = { signal: new AbortController().signal, seams: {} }. This executor is cached in executorSpec and later returned as-is by the registry's BYO factory path (ignoring the scope's proper abort signal and seams). Currently benign because executor.execute(task, signal) receives the proper scope signal at execution time and the built-in sandbox executor doesn't depend on construction-time seams. However, this is fragile — if a future executor implementation reads its abort signal or seams at construction time, it will see a never-aborted signal and empty seams
🟡 LOW Naming collision: src/runtime/tool-loop.ts vs existing src/tool-loop.ts — src/runtime/tool-loop.ts
The new
src/runtime/tool-loop.tsexportsrunBrainLoop/ToolLoopChat(the supervisor brain seam). The pre-existingsrc/tool-loop.tsexportsrunToolLoop/streamToolLoop(the interactive chat tool-dispatch loop). Two files namedtool-loop.tsin adjacent directories with overlapping vocabulary but different purposes is a navigation/maintainability trap. Not a bug. Consider renaming tobrain-loop.tsoragent-tool-loop.tsto disambiguate.
🟡 LOW No dedicated tests for new runBrainLoop, supervise, supervisorAgent, or routerBrain — src/runtime/tool-loop.ts
All three new public API entrypoints (runBrainLoop in tool-loop.ts, supervise/workerFromBackend in supervise.ts, supervisorAgent in supervisor-agent.ts) and the new routerBrain export have no test files. Existing tests (supervise.test.ts: 35 tests) exercise the underlying createSupervisor/createScope directly, not the new convenience APIs. The runBrainLoop hooks contract (stopBefore/beforeTurn/onUsage ordering), the supervise() default-perWorker math, and the supervisorAgent sandbox-arm MCP lifecycle (serveCoordinationMcp → driveHarness → finalizeBestDelivered → mcp.close) are all untested at the integration level.
🟡 LOW runBrainLoop reports turns:maxTurns when stopBefore hook breaks the loop early — src/runtime/tool-loop.ts
Lines 93-94: the post-loop return is
return { ..., turns: maxTurns, ... }. Whenopts.hooks?.stopBefore?.(turn)breaks at turn N (where N < maxTurns), the returnedturnsfield saysmaxTurns— claiming more inference turns ran than actually did. The coordination-driver.ts consumer discards the return value (only calls finalize), and routerToolLoop passes no hooks, so no current consumer is affected. But runBrainLoop is an exported public API; any future consumer using hooks + readingturnsgets an inaccurate equal-compute count. Fix: track actual turns executed (let turnsTaken = 0; turnsTaken += 1after each chat call) and returnturns: turnsTakenpo
🟡 LOW Weakened assertion on tool result feedback in coordination-driver.test.ts — tests/loops/coordination-driver.test.ts
Line 139: Old assertion
toolMsgs.some((m) => m.name === 'await_event' && m.content.includes('done'))changed totoolMsgs.some((m) => String(m.content).includes('done'))— drops the tool-name check. ThetoolMsgs.length >= 2guard on line 138 still proves both tool results exist, but the name check gave extra defense against a mis-plumbed tool role. Low risk since the length guard + journal tree assertions ([lines 142-144](https://github.com/tangle-network/agent-runtime/blob/bf0752345cea05968500b158ecf9cab1
🟡 LOW scriptedBrain captures live message references — 'by turn N' feed-back proof is now testing final state, not per-turn snapshot — tests/loops/scripted-brain.ts
scriptedBrain (scripted-brain.ts:27) does
seen?.push(messages)with NO spread/copy. The old per-file scriptedChat helpers didseen.push([...input.messages]). Because runBrainLoop (tool-loop.ts:63) reuses ONE growing messages array across all turns (passed by reference to chat at line 72), every seen[N] entry is the SAME array object. I proved this empirically: a 3-turn loop yields seen[0].length === seen[1].length === seen[2].length === 5 (final state) and seen[0] === seen[1] === seen[2] === true (same reference). Impact on coordination-driver.test.ts:136-139: the comment claims 'by turn 2 (the 3rd chat call), the conversation ... contains tool messages' bu
tangletools · 2026-06-20T21:56:52Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — bf075234
Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-20T21:56:52Z · immutable trace
…r-sim product evals, one-call)
…ove() (the intelligence loop)
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 504f37e7
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T22:05:13Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 5 (1 medium-concern, 3 low, 1 weak-concern) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 229.3s (2 bridge agents) |
| Total | 229.4s |
💰 Value — sound
Usability/DX overhaul that adds one-call facades (supervise/improve/evalPersona), unifies the supervisor brain seam (DriverChat→ToolLoopChat), deletes dead workflow+journal code (~3500 LOC), and gates docs against symbol drift — coherent, in-grain, no duplication.
- What it does: Adds one-call facades (supervise(), improve(), evalPersona()) that default raw seams a caller previously hand-wired; unifies the supervisor brain seam by replacing the DriverChat type zoo (a parallel type system used only by the driver) with the canonical ToolLoopChat used everywhere else, eliminating routerDriverChat (65 LOC → routerBrain at 4 LOC in src/runtime/router-client.ts:250); deletes the
- Goals it achieves: 1. A competent engineer can derive the call path in seconds — supervise(profile, task, {backend, budget}) instead of hand-wiring 5+ components across blobs/journal/executors/scopes. 2. The supervisor brain is resolved from profile data (profile.harness), not hand-built — the same backend-as-data resolution rule as workers (src/runtime/supervise/supervisor-agent.ts:62-122). 3. Dead/duplicated surfa
- Assessment: Good change on every dimension. The facades are thin (supervise() is ~120 LOC composing existing supervisorAgent + createInMemoryRunContext + createSupervisor().run(); improve() is a facade over agent-eval's selfImprove; evalPersona() wraps runPersonaConversation) — they default seams without hiding the raw ones. The DriverChat→ToolLoopChat unification (src/runtime/supervise/coordination-driver.ts
- Better / existing approach: none — this is the right approach. The facades correctly compose existing infrastructure rather than reinventing it (supervise() calls supervisorAgent+createSupervisor+createInMemoryRunContext; improve() delegates to agent-eval's selfImprove; evalPersona() delegates to runPersonaConversation). The brain unification replaces a bespoke seam with the canonical one that runBrainLoop already uses — it'
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
Coherent usability overhaul: unifies supervisor brain resolution, collapses 13→6 subpaths, adds one-call facades (supervise/improve/evalPersona) that layer cleanly over existing seams, and gates docs freshness — all reachable, no pattern competition, net −9.8k LOC.
- Integration: All new capabilities are reachable through the established barrel/subpath structure.
supervise()exported from./loopssubpath, called by 3 production examples + 2 test suites (tests/loops/supervise-convenience.test.ts,tests/supervisor-loop-example.test.ts).improve()exported from root barrel, called by 2 examples + 1 test (tests/improve.test.ts).evalPersona()exported from root b - Fit with existing patterns: Every new capability follows the codebase's established facade-over-substrate pattern.
supervise()facades oversupervisorAgent()+createSupervisor().run(), mirroring how the runtime already defaults blobs/perWorker/journal/executors.improve()facades over agent-eval'sselfImprove()(src/improvement/improve.ts:204), picking default drivers by surface exactly assupervise()defaults - Real-world viability: Core paths hold up.
supervise()enforces model-policy before compute spend (src/runtime/supervise/supervise.ts:88-94), defaults per-worker budget to pool/4 (src/runtime/supervise/supervise.ts:78-83), and requiresbackendormakeWorkerAgent(fail-loud, no silent stub).improve()throwsConfigErrorfor surfaces with no default driver (src/improvement/improve.ts:196-199— designed bou - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/improve/improve.ts
- console.log(
shipped: ${out.shipped} lift: ${out.lift.toFixed(3)} gate: ${out.gateDecision})
🟡 Cruft: commented out code scripts/check-docs-freshness.mjs
+// CLASS 1 version / substrate-peer pins != package.json
🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
🎯 Usefulness Audit
🟠 Bench prompt prose references stale MCP tool names (spawn_worker → spawn_agent) [integration] ``
The MCP server in
src/mcp/tools/coordination.ts:363,403,420exposesspawn_agent/observe_agent/steer_agent, but bench prompt prose still tells supervisor LLMs to callspawn_worker/observe_worker/steer_worker(bench/src/profiles.ts:22-24,96,98,bench/src/atom-humaneval.mts:96,bench/src/mcp-mount-probe.mts:89-90,bench/src/atom-mcp-e2e.mts:6,178). A supervisor bench run would fail because the LLM's tool calls wouldn't match the MCP server's tool registry. Tests use the correc
🟡 evalPersona uses as never casts bridging two AgentProfile types from different packages [robustness] ``
src/conversation/eval-persona.ts:86-89castsworker,persona,backendFor, andsystemPromptOfasneverto pass them torunPersonaConversation, which importsAgentProfilefrom@tangle-network/agent-evalwhileevalPersonausesAgentProfilefrom@tangle-network/agent-interface. The comment acknowledges this is a type-boundary cast. If the two packages'AgentProfiletypes diverge in structure, the facade silently passes mismatched data with no compiler error. Consider aligni
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…prove() JSON.parse, test flagged fns
- HIGH: contentAddress was dropped from the runtime barrel by WS3 → bench/atom-humaneval + atom-mcp-e2e fail to compile (a content-addressing helper bench legitimately uses). Re-exported from the barrel.
- MEDIUM: applyWinnerToProfile's JSON.parse threw a raw SyntaxError after a ship verdict on a malformed winner → parseWinnerJson guards it with a typed ConfigError + a test.
- MEDIUM: finalizeBestDelivered + runBrainLoop had no direct tests → added focused unit tests (the blob store's content-address invariant is exercised).
- LOW: supervise() decision-table/README rows implied backend is required (it's optional) → { budget, backend? }.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — b424ee2c
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T22:20:16Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (1 medium-concern, 3 low) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 297.2s (2 bridge agents) |
| Total | 297.3s |
💰 Value — sound
Usability/DX overhaul: unifies supervisor brain resolution to profile-driven (mirroring createExecutor), consolidates duplicated tool-loop code into shared runBrainLoop, shrinks surface 355→277 exports / 13→6 subpaths, adds supervis()/improve() one-call facades, adds CLASS-6 docs-symbol freshness ga
- What it does: This PR does seven things: (1) unifies the supervisor brain to resolve from
profile.harness—null→ in-process router tool-loop, a CLI harness → sandboxed MCP driver (supervisorAgent.tsis new, replacing hand-builtrouterDriverChat). (2) consolidates three copies of the while-for-turn tool-loop skeleton into one sharedrunBrainLoopinsrc/runtime/tool-loop.ts:53— both `routerToolLoop - Goals it achieves: Make the engine usable by a competent engineer in seconds rather than hours. Achieves this by: (a) eliminating the 'brain is router-only' gap — now any harness can be the brain (sandbox supervisor path proven). (b) making the call path derivable from docs — the canonical-api decision table's START HERE row is
supervise(profile, task, { budget }), and every backticked symbol it names is mechanica - Assessment: Good change, well-executed. The brain-unification (resolved from
profile.harness, mirroringcreateExecutor({backend})) is symmetric with the existing worker-resolution pattern and eliminates the 'driver brain is router-only' gap the PR targets. The tool-loop consolidation (runBrainLoop) removes genuine code duplication — before this PR,routerToolLoopandcoordinationDriverAgenteach had - Better / existing approach: none — this is the right approach. Searched for pre-existing patterns that do what
supervise()orimprove()orrunBrainLoopdo:supervise()is a new facade wrappingcreateSupervisor().run()(which was the raw primitive, not a convenience caller),improve()is a new facade over agent-eval'sselfImprove(which is the same loop but without the generator-defaulting/profile-apply convenie - Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
Well-executed usability overhaul: every new convenience facade (supervise, improve, evalPersona) follows the same one-call-with-defaults pattern, wraps existing lower-level engines without competing, and has real callers in tests + examples; the surface shrink and naming cleanup are complete and con
- Integration: All new exports are reachable. supervise() has 8 test callers + 3 examples + is called internally by supervisorAgent() (src/runtime/supervise/supervise.ts:110). improve() has 6 test callers + 2 runnable examples. evalPersona() has 3 test callers + 2 example sites. routerBrain has 4 tests + 2 examples + bench callers + internal use in supervisorAgent(). assertModelAllowed is called by both supervis
- Fit with existing patterns: The three convenience facades follow a cohesive pattern mirroring each other: one call with sensible defaults, power-user seams underneath accessible for full control. They don't compete with existing patterns — supervise() wraps createSupervisor/supervisorAgent(), improve() wraps agent-eval's selfImprove, evalPersona() wraps runPersonaConversation. The naming changes (depth/breadthDriver→Strategy
- Real-world viability: Config validation in supervise() and improve() fails before compute is spent — allowedModels guard, missing-backend check, missing-generator check all throw typed errors (ConfigError/ValidationError) at the top of the function. The JSON.parse fix in improve() (src/improvement/improve.ts:144-154) is properly try/catch-wrapped and tested (tests/improve.test.ts:138-159). The main robustness gap is th
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/improve/improve.ts
- console.log(
shipped: ${out.shipped} lift: ${out.lift.toFixed(3)} gate: ${out.gateDecision})
🟡 Cruft: commented out code scripts/check-docs-freshness.mjs
+// CLASS 1 version / substrate-peer pins != package.json
🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts
+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }
🎯 Usefulness Audit
🟠 evalPersona throws raw Error instead of ConfigError for missing credentials [ergonomics] ``
src/conversation/eval-persona.ts:70 throws
new Error('evalPersona: provide opts.{apiKey,baseUrl,model}...')for missing backend credentials. Per the error taxonomy contract at src/errors.ts:11-12, consumer-facing API errors should be typed (ConfigError or ValidationError, both AgentEvalError subclasses). A caller catchingConfigErrorto handle config failures programmatically would miss this one. The same pattern exists in runPersonaConversation (src/conversation/run-persona.ts:148,158), mak
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 16 | 41 | 16 |
| Confidence | 95 | 95 | 95 |
| Correctness | 16 | 41 | 16 |
| Security | 16 | 41 | 16 |
| Testing | 16 | 41 | 16 |
| Architecture | 16 | 41 | 16 |
Full multi-shot audit completed 8/8 planned shots over 93 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 93 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Four 'as never' casts bridge different AgentProfile types across packages — src/conversation/eval-persona.ts
Lines 86-89 cast
worker,persona,backendFor, andsystemPromptOfasneverto bypass the type mismatch between@tangle-network/agent-interface's AgentProfile (used by evalPersona) and@tangle-network/agent-eval's AgentProfile (expected by runPersonaConversation). The runtime behavior is correct — runPersonaConversation never inspects profiles, only passes them through callbacks — but if either AgentProfile type changes incompatibly, TypeScript will not catch it. Consider a thin adapter type at this boundary instead of 'as never' casts.
🟠 MEDIUM EvalPersona type not exported from barrel index — src/conversation/index.ts
eval-persona.ts:56 exports
EvalPersona(the discriminated union{kind:'scripted',turns:string[]} | {kind:'profile',profile:AgentProfile}). conversation/index.ts:32 only re-exportsEvalPersonaOptionsandevalPersona, omittingEvalPersona. Root src/index.ts:76 likewise only exportsEvalPersonaOptions. Consumers callingevalPersona(worker, persona, opts)cannot typepersonathrough the public barrel — they must use a path import. Fix: addEvalPersonato both barrel exports.
🟠 MEDIUM mergeBudget accepts negative budget values — can corrupt scope accounting — src/mcp/tools/coordination.ts
mergeBudget validates that budget fields are finite numbers (typeof v !== 'number' || !Number.isFinite(v)) but does NOT reject negative values. The scope's budget accounting in src/runtime/supervise/scope.ts:721-727 uses these values directly in <= comparisons (tokensOk = totalTokens <= budget.maxTokens) and ratio calculations (budget.maxTokens / totalTokens). A negative maxTokens produces a negative ratio, and results in all spawns being rejected since iterations > 0 but maxIterations < 0 → itersOk = false. While this is fail-closed (doesn't bypass the ceiling), it's a silent misbehavior rather than a clear error. The description says 'only set the ceilings this sub-task needs raised' — passing negative values violates the semantic contract but goes undetected. Recommend adding a >= 0 or
🟠 MEDIUM runBrainLoop returns wrong turns count when stopBefore hook halts the loop early — src/runtime/tool-loop.ts
Lines 69–113: When
stopBeforereturns true, theforloop executesbreak, then falls through toreturn { final: lastText, turns: maxTurns, ... }on line 113. IfstopBeforebreaks on turn 1 (before any inference call),turnsis reported asmaxTurns(e.g., 2000 for coordination-driver maxTurns=0) instead of 0 (zero turns consumed).finalis empty string. The only caller that setsstopBeforehooks (coordinationDriverAgent) ignoresrunBrainLoop's return value, so this is currently latent. But the exportedrunBrainLoopi
🟡 LOW agent.md references now-undocumented analyst-loop types as unlinked plain text — docs/api/agent.md
Lines ~1512-1592: createSurfaceImprovementAdapter(), createSurfaceKnowledgeAdapter(), and measureOutcome() document return/param types ImprovementAdapter, KnowledgeAdapter, RunAnalystLoopResult as plain text (no markdown link). In the base these linked to analyst-loop.md; after the analyst-loop entry point was removed from typedoc.json (out-of-shot), TypeDoc downgraded them to unlinked text. The types still exist and are exported at src/analyst-loop/types.ts:25,51,119, so they remain importable public surface. Impact: a reader of agent.md cannot navigate to these type definitions, and the functions' contracts are partially opaque. This is a usability regression inherent to the entry-point-removal decision, not a mechanical regen bug. Fix (pick one): (a) re-add src/analyst-loop/index.ts as
🟡 LOW agent.md sandboxOverrides docstring still mentions removed createSandboxForSpec — docs/api/agent.md
agent.mdline 1197 (inCreateSandboxActOptions.sandboxOverrides): docstring says 'forwarded tocreateSandboxForSpec' butcreateSandboxForSpecis no longer in the public API docs (it was removed fromdocs/api/runtime.mdand is not re-exported fromsrc/runtime/index.ts). The source commentagent/sandbox-act.tsmay need updating. Users reading docs will see a reference to an undocumented function. Low impact — the docstring is source-derived.
🟡 LOW Pre-existing broken anchor: #budget-9 resolves nowhere in GFM — docs/api/runtime.md
docs/api/mcp.md:4165anddocs/api/index.md:3175referenceruntime.md#budget-9, but runtime.md has exactly one### Budgetheading (line 8107), which generates anchor#budgetin standard GFM. The-9suffix is a TypeDoc global-reflection-counter artifact, not a GFM dedup. This predates the PR but cross-refs from mcp.md/index.md won't resolve in Markdown renderers. Fix: suppress TypeDoc's anchor suffixing for single-occurrence headings, or add an explicit<a id="budget-9">tag.
🟡 LOW Pre-existing broken anchor: #sandboxclient-1 resolves nowhere in GFM — docs/api/runtime.md
9 cross-references across
agent.md:1130,mcp.md:680,712,881,1650,1905,1955,profiles.md:1543,index.md:3085useruntime.md#sandboxclient-1. Single### SandboxClientheading at line 9748 → GFM anchor is#sandboxclient. The-1suffix prevents resolution. Same TypeDoc systemic issue as #budget-9.
🟡 LOW Pre-existing broken anchor: #scope-1 resolves nowhere in GFM — docs/api/runtime.md
mcp.md:4147referencesruntime.md#scope-1. Single### Scopeheading at line 8176 → GFM anchor is#scope. Same TypeDoc systemic issue.
🟡 LOW Pre-existing wrong-target anchor: #kind-3 resolves to LoopPlanDescription.kind, not LoopSandboxPlacement.kind — docs/api/runtime.md
mcp.md:2380usesruntime.md#kind-3in contextLoopSandboxPlacement.kind. Butkind-3resolves toLoopPlanDescription.kindat line 9575 (the 4thkindheading), whileLoopSandboxPlacement.kindat line 9924 should be#kind-4. Off-by-one from a###### kindsubheading at line 2861 that TypeDoc counted. Pre-existing semantic mis-target.
🟡 LOW Case-sensitive haltOn predicate may never fire — examples/product-eval/product-eval.ts
Line 78:
haltOn: (ctx) => ctx.lastTurn.text.includes('RESOLVED')is case-sensitive. The adversary system prompt (line 65) says 'Say the literal word RESOLVED' — matching intent. However, LLMs frequently vary casing (RESOLVED / Resolved / resolved). If the model emits lower- or mixed-case, the halt predicate never matches and the run always exhausts maxTurns: 8. The maxTurns backstop prevents infinite loops, so this is not a correctness bug, just a fragility that users copy-pasting the example may run into. Mitigati
🟡 LOW Example is not covered by an automated runtime smoke; only typecheck guards it — examples/product-eval/product-eval.ts
package.json typecheck:examples runs tsc --noEmit on examples (verified clean for this file after build), and biome lints it, but no test executes product-eval.ts. A behavioral regression in evalPersona/runPersonaDispatch/runProfileMatrix that preserves types would silently break the example. Low severity because the example is documentation-by-code, not a shipped code path; the substrate functions it calls ARE unit-tested (eval-persona.test.ts, run-persona.test.ts). Optional nit: add a vitest test that imports the three cell functions with a fake backendFor and asserts they resolve, mirroring the eval-persona.test.ts offline pattern the README already points to.
🟡 LOW scoredCell hardcodes commitSha:'example' and a single scenario, so matrix output is illustrative only — examples/product-eval/product-eval.ts
Line 103: commitSha: 'example' and one scenario/profile. This is appropriate for an example (runProfileMatrix requires commitSha as a non-optional paper-grade field), and the default integrity:'assert' posture will pass because a real createOpenAICompatibleBackend is wired. Not a defect — noting only that anyone copy-pasting this as a real eval template must replace the sha and expand the corpus. README could add a one-line note that commitSha should be a real git SHA for reproducible records.
🟡 LOW maxTurns set in run-bridge.ts but omitted in run-sandbox.ts (sibling runners diverge) — examples/supervisor-loop/run-sandbox.ts
run-bridge.ts:90 passes maxTurns: 12 to supervise(); run-sandbox.ts:58-74 does not, so it falls back to the driver default of 16 (src/runtime/supervise/coordination-driver.ts:106). Not a correctness bug — both are bounded — but the two sibling runners are supposed to be the IDENTICAL supervisor differing only in the worker seam, and this is a silent behavioral asymmetry. Fix: add maxTurns: 12 to run-sandbox.ts's supervise() call for parity, or document why sandbox intentionally differs.
🟡 LOW sandboxClient cast through unknown is TypeScript-unsafe — examples/supervisor-loop/run-sandbox.ts
Line 42:
new SandboxClient({ apiKey, baseUrl }) as unknown as RuntimeSandboxClient. If@tangle-network/sandboxdiverges itscreate()method signature from the runtime'sSandboxClientinterface, this cast silently passes at compile time but fails at runtime when the executor callssandboxClient.create(...). Pre-existing pattern (not introduced here), but the sandbox runner is billed as a runnable example — a runtime failure in the example erodes trust. Consider a runtime typeguard or documenting the exact sandbox SDK version required.
🟡 LOW Bridge error body embedded into thrown Error message (potential secret echo) — examples/supervisor-loop/run-supervisor-mcp.ts
run-supervisor-mcp.ts:106 throws
supervisor bridge ${res.status}: ${(await res.text()).slice(0, 300)}. If the cli-bridge ever echoed the authorization bearer or request headers in a non-OK response body, up to 300 chars would land in the error message and any log/console that prints it. Low likelihood (the bridge is local/trusted), standard pattern, but for a path that explicitly passesBearer ${bridgeBearer}(line 97) this is worth a 1-line sanitization if the bridge ever fronts non-local transports. Not blocking.
🟡 LOW workerFromBackend degrades worker-name uniqueness in traces — examples/supervisor-loop/run-supervisor-mcp.ts
Line 130:
workerFromBackend(backend, { check: demoCheck, ... })names non-explicitly-named workers as'worker'(the static default in supervise.ts:32). The oldmakeWorker(deleted loop.ts) used a counter (worker-${counter.n++}) so each spawn-1/spawn-2 trace entry was distinguishable. When the supervisor omits anamein itsspawn_agentprofile, all workers trace asworker— making event-bus logs harder to read. No correctness impact (scope.spawn already uses unique handle IDs). Fix: accept a per-worker counter or label generator in the workerFromBackend seam, or document that the supervisor must supply names for readable traces.
🟡 LOW demoCheck JSON.stringify fallback is O(n) over the full worker output — examples/supervisor-loop/shared.ts
shared.ts:22 — for non-{content} shapes (the sandbox serialized event-stream case), demoCheck runs JSON.stringify(out ?? '') on every worker output to substring-search 'ANSWER=42'. Fine for the short example outputs; on a real box's multi-MB event stream this would stringify the whole stream per worker per gate. Acceptable for an example, but the comment at shared.ts:16-17 advertises this branch as the box path, so the cost should be noted. No fix required for examples; flag for any fleet reuse.
🟡 LOW No dedicated unit test for createSandboxAct adapter (pre-existing) — src/agent/sandbox-act.ts
No sandbox-act.test.ts exists in src/agent/. The adapter's event-mapping, output-promise settlement, and abort-signal plumbing are only exercised indirectly through integration run-paths. This is pre-existing — the diff does not change behavior or reduce coverage — but worth flagging because the file owns the eval/prod parity contract (per its header docblock) and a regression in settle/fail wiring (lines 79-88) or the raw-events buffer passed to output.parse (line 105) would silently corrupt scorecard grading. Not a blocker for this imp
🟡 LOW No profile-driven persona test in evalPersona.test.ts — src/conversation/eval-persona.test.ts
All 3 tests use
{ kind: 'scripted', turns: [...] }. The{ kind: 'profile', profile: AgentProfile }path (which triggers the LLM user-sim persona and requires maxTurns) is exercised in run-persona.test.ts but not through the evalPersona facade. Theas nevercast onpersona as PersonaDriverfor a profile-kind persona is therefore never integration-tested from the facade level.
🟡 LOW No test for profile-kind persona through evalPersona facade — src/conversation/eval-persona.test.ts
All 3 tests use scripted personas. The profile-kind path through evalPersona (which exercises the default backendFor for BOTH worker and persona, plus the systemPromptOf default applied to the persona side) is only tested at the lower runPersonaConversation level (run-persona.test.ts:5 'runs a profile-driven persona end-to-end'). Since evalPersona adds its own defaulting layer and casts, a direct integration test for
{ kind: 'profile', profile }through the facade would close the gap. Coverage is acceptable given the lower-level tests, but not complete.
🟡 LOW Doc claim about maxTurns:0 is inaccurate — it throws, not zero-turns — src/conversation/eval-persona.ts
eval-persona.ts:43 doc says 'maxTurns: 0 is zero turns, not run-until-done'. But define-conversation.ts:50 throws ValidationError for maxTurns < 1, so maxTurns:0 actually aborts at conversation-definition time, not 'zero turns'. The fail-loud behavior is better than documented, but the doc text is technically wrong. Fix: change to 'maxTurns: 0 is rejected (must be >= 1), not run-until-done'.
🟡 LOW Dual AgentProfile type clash requires as never escape hatches — src/conversation/eval-persona.ts
eval-persona.ts:86-89 uses
as neveron worker, backendFor, and systemPromptOf because run-persona.ts imports AgentProfile from @tangle-network/agent-eval (role/environment/domain shape) while eval-persona.ts imports from @tangle-network/agent-interface (prompt.systemPrompt shape). The comment at lines 82-85 correctly argues runtime safety: the runner treats the profile as an opaque token, only the callbacks inspect it, and the callbacks are typed for the agent-interface shape. This is sound but fragile — a future change to runPersonaConversation that inspects the profile directly would silently break. Consider unifying on one AgentProfile type or addin
🟡 LOW EvalPersona type not exported from public barrel — src/conversation/index.ts
index.ts:32 exports
type EvalPersonaOptions, evalPersonabut NOTEvalPersona(the persona union type). The low-levelPersonaDriverIS exported (index.ts:58). Callers usingevalPersonafrom the public barrel cannot type their persona argument without a deep import from './eval-persona'. Fix: addtype EvalPersonato the export on line 32.
🟡 LOW mergeBudget does not reject negative budget ceilings — src/mcp/tools/coordination.ts
At coordination.ts:177-183, the
fieldhelper validatesNumber.isFinite(v)but notv >= 0. A caller passingbudget: { maxUsd: -1 }would pass validation and produce a Budget with a negative ceiling. Downstream budget-pool logic (supervise/budget.ts:142) initializesfreeUsd = root.maxUsd ?? 0and decrements, so a negative per-worker ceiling would cause the pool to deny admission (fail-closed), not an overspend. Impact is therefore low — no security or cost bypass — but a negative budget is nonsensical and the fail-loud philosophy of this function ('a malformed budget must never silently fall back') would be better served by rejecting sign. This matches existing behavior (the perWorker default is also unvalidated for sign), so it is consistent rather than a regression. Fix: add `||
🟡 LOW mergeBudget silently ignores unknown budget fields — src/mcp/tools/coordination.ts
At coordination.ts:184-187,
field()reads exactly 4 hardcoded keys (maxIterations/maxTokens/maxUsd/deadlineMs). A caller passing{ maxIteration: 5 }(typo, missing 's') or any unknown key would have it silently ignored — no validation error. The fail-loud contract stated in the comment (lines 170-172) is about malformed known fields, not unknown ones. Low impact: the conserved pool still fences, and the merge correctly applies only known ceilings. But a typo'd ceiling would silently fall back to the default — the exact 'nobody chose' scenario the comment warns against for other cases. Fix: optionally detect and reject keys not in the known set.
🟡 LOW Multiple public export removals (major API surface reduction) — src/runtime/index.ts
Removed exports:
FileResultBlobStore,FileSpawnJournal,materializeTreeView,replaySpawnTree,FileLoopJournal,InMemoryLoopJournal,LoopJournal,createFileRunContext,routerDriverChat, all driver-executor exports (driverChild,driverExecutorFactory,driverRuntime,isDriverSpec,withDriverExecutor), all patch-checks exports (countDiffLines,isNonEmptyPatch,runCoderChecks,touchedPathsFromPatch,touchesSecretPath,CoderCheckConstraints,CoderCheckInput), all trace-source decode/decoder types (decodeAnthropicPart,decodeOpenAiPart,decodeOpencodePart,SessionMessageLike,ToolPartDecoder,ToolStepInput,toolPartDecoders,toToolSpan,createPartsTraceSource), all scope seam exports (NestedScopeSeam,nestedScopeSeamKey), all runtime
🟡 LOW workerFromBackend builds executor with a dead-end abort signal the scope can't cascade through — src/runtime/supervise/supervise.ts
Line 36:
workerFromBackendconstructs the executor withctx.signal = new AbortController().signal. Whenscope.spawnlater resolves this BYO executor through the registry, theExecutorRegistry.resolvefactory (runtime.ts:1193-1195) returns the pre-built executor verbatim — it ignores the scope'schildAbort.signalin the new context. The executor's internal controller was linked at factory time to the dead-end signal. However,executor.execute(task, signal)receiveschildAbort.signalas itssignalparameter (scope.ts:585), and the built-in executors merge this with their internal controller, so execute-time abort propagation works correctly.
🟡 LOW supervise() defaults maxDepth to 8 while createSupervisor uses 4 — undocumented divergence — src/runtime/supervise/supervise.ts
Line 122:
maxDepth: opts.maxDepth ?? 8. The supervisor's owndefaultMaxDepth = 4(supervisor.ts:68) is documented as 'paired with the conserved pool so a runaway recursion hits budget-exhaustion first and depth-exceeded second (R3)'. The supervise() one-call API doubles this to 8 without a comment explaining why. This is likely intentional (the convenience API allows deeper decomposition trees) but a reader would expect the defaults to match. Impact: low — the conserved pool is still the primary bound; maxDepth is a tripwire. But a user calling supervise() gets a different safety ceiling than one calling createSupervisor().run() directly. Fix: add a o
🟡 LOW workerFromBackend creates a detached AbortController signal that is never abortable — src/runtime/supervise/supervise.ts
Line 28:
const ctx: ExecutorContext = { signal: new AbortController().signal, seams: {} }. This signal is created, passed to the executor factory, and then discarded — nobody holds a reference to the controller, so it can never be aborted. Verified harmless: scope.ts:585 (runChild) passes the REALchildAbort.signaltoexecutor.execute(task, signal), which is the signal that actually controls cancellation. The constructor signal is only used if an executor captures it internally beforeexecute()is called. No built-in executor does this. Impact: none in practice, but the dead signal is misleading — a reader might think the worker is abortable thro
🟡 LOW runBrainLoop accesses r.toolCalls.length without null-guard (regression from old defensive code) — src/runtime/tool-loop.ts
At line 80 (
if (r.toolCalls.length === 0)) and line 86 (r.toolCalls.map(...)), runBrainLoop accessestoolCallsdirectly. The old inline code in coordination-driver usedconst calls = res.toolCalls ?? []and routerToolLoop trusted routerChatWithTools (which always returns an array). The ToolLoopChat type contract requirestoolCalls: RouterToolCall[], so well-typed implementations are safe. But a runtime type violation (e.g., a brain returning{ content: 'done' }without toolCalls) would crash with TypeError instead of treating it a
🟡 LOW Dead 'seen' variable in maxTurns=0 inference-bounded test (driver-inference-metering.test.ts:294-335) — tests/loops/driver-inference-metering.test.ts
Line 294 declares
const seen: Array<...> = [], line 300 pushes to it, and line 333 assertsexpect(seen.length).toBe(3). But since seen captures the messages array by reference (same issue as above), seen.length is just the brain-call count — equivalent to the existingexpect(n).toBe(3)on [line 335](https://github.com/tangle-network/agent-runtime/b
🟡 LOW scriptedBrain captures messages array by reference, not copy — weakens per-turn 'seen' assertions — tests/loops/scripted-brain.ts
Line 22:
seen?.push(messages)pushes a REFERENCE to the same array that runBrainLoop mutates across all turns (tool-loop.ts:63const messages: Msg[] = [...], then pushes at lines 84 and 109). The old scriptedChat usedseen.push([...input.messages])(shallow copy per turn). Result: all seen[i] entries alias the same final-state array. Tests that check seen[i] content (coordination-driver.test.ts:136-139const turn2Convo = seen[2]!) still pass because the scripts stop at the checked turn, so the array's final state equals its
🟡 LOW scriptedBrain ignores ToolLoopChat's tools parameter — tests/loops/scripted-brain.ts
ToolLoopChat type is
(messages, tools) => Promise<...>but scriptedBrain returnsasync (messages) => ...(line 19). This is valid TypeScript (callbacks may accept fewer params), but means the scripted brain can't verify tool specs match. For a test helper this is intentional — the scripted turns don't need tool validation. No fix required unless you want to add atoolsassertion in debug builds.
🟡 LOW SANDBOX arm supervisor test has no explicit fetch cleanup — tests/loops/supervisor-agent.test.ts
Lines 91-110: the SANDBOX harness test drives real HTTP→MCP via global
fetchwithoutvi.stubGlobalsetup/teardown. This is actually the correct integration pattern (exercising real plumbing, not a mock), but noafterEach/afterAllcloses the MCP server explicitly —close()is called inside the agent'sact. If the server port leaks across tests, a subsequent test bind could fail. Current structure looks safe since each test creates its own scope+server, but adding an explicit close in cleanup would harden it.
tangletools · 2026-06-20T22:52:26Z · trace
|
Addressed in Fixed (real):
Intentional (by design, not a fix):
The remaining LOW findings are nitpicks or already stale (the |
…face-shrink (#352) The src/platform/ clients (PlatformAuthClient cross-site SSO, PlatformHubClient /v1/hub integrations) were still present but un-exported after the subpath collapse; 5 product agents import them. Re-add the export + tsup entry. 0.70.1.
What & why
The engine here is real (recursive supervisor, conserved-budget Scope, completion oracle, the eval/improvement substrate) — but it had become hard to use correctly: ~998 public exports, a canonical doc that named symbols which didn't exist, and a supervisor "brain" that only worked on the router. A competent integrator couldn't derive the call path from the docs.
This PR is a usability/DX overhaul. The bar: a competent engineer derives the call path in seconds, and capabilities stay intact.
What changed (7 workstreams)
DriverChatzoo.supervise(profile, task, opts)/supervisorAgent(profile, deps)resolve the brain fromprofile.harnessexactly likecreateExecutor({ backend })resolves a worker:null→ in-process router tool-loop,claude-code/opencode/codex→ a sandboxed harness driving the coordination verbs viaserveCoordinationMcp. Sandbox-supervisor proven offline. Closes the critique's A2 ("driver brain is router-only").spawn_worker→spawn_agent(+observe_agent/steer_agent— the verbs operate on any spawned agent, incl. a sub-supervisor);depth/breadthDriver→depth/breadthStrategy(they're strategies, not drivers);supervisorSkill→supervisorInstructions;createDriveTurnResumeDriver→createDetachedTurnResumeDriver.canonical-api.md984→76 lines; docs 26→17 (+5 archived; 4 architecture docs→1). New CLASS-6 prose-symbol gate (scripts/check-docs-freshness.mjs): every backticked symbol in the curated docs must resolve to a real export, or the build is RED — smoke-proven (injecting a fakerefineGepa()reddens it).supervise()with a single-sourced worker seam; a newexamples/supervise/one-call example + an offline ($0, no-creds) example test.improve()selfImprovethat defaults the generator from thesurface(prompt→GEPA, skills→skillOpt) and fails loud on surfaces with no default — and makes the previously-deadsrc/improvement/barrel reachable.runLoopkept (published primitive),runAgentic/runPersonifiedkept (distinct),AgentRunSpec→SandboxIterationSpecdeferred (public, 28-file blast radius → needs a major bump).Also fixed a pre-existing red gate:
verify:packagestill asserted the removed./workflowsubpath.Proof
All green from a clean tree:
lint(306 files) ·tsc0 ·typecheck:examples0 ·test1062 pass / 1 skip ·build0 ·docs:check0 ·verify:package0 — and merges cleanly intomain. The keystone (theDriverChat→ToolLoopChatunification) preserves the equal-k driver-inference metering byte-for-byte.Not in this PR (tracked, not forgotten)
docs/simplification-plan.md §7.5tables the multi-round supervisor/driver/worker design (retry = the driver's prompt-policy, real-time trace self-correction, completion = a real-state check) and the learned compute/model allocator (separate active research).workflowvm-timeout under load, atask-queuetemp-dir rename race) pass on isolation — candidates for a separate hardening ticket.The living tracker (every workstream, the scratch list, the decisions, the inventory) is
docs/simplification-plan.md.