refactor: usability overhaul — brain-from-profile, surface shrink, docs that can't lie, supervise() one-call by drewstone · Pull Request #347 · tangle-network/agent-runtime

drewstone · 2026-06-20T21:01:00Z

What & why

The engine here is real (recursive supervisor, conserved-budget Scope, completion oracle, the eval/improvement substrate) — but it had become hard to use correctly: ~998 public exports, a canonical doc that named symbols which didn't exist, and a supervisor "brain" that only worked on the router. A competent integrator couldn't derive the call path from the docs.

This PR is a usability/DX overhaul. The bar: a competent engineer derives the call path in seconds, and capabilities stay intact.

What changed (7 workstreams)

WS	Change
WS1 — brain unified	Deleted the `DriverChat` zoo. `supervise(profile, task, opts)` / `supervisorAgent(profile, deps)` resolve the brain from `profile.harness` exactly like `createExecutor({ backend })` resolves a worker: `null` → in-process router tool-loop, `claude-code`/`opencode`/`codex` → a sandboxed harness driving the coordination verbs via `serveCoordinationMcp`. Sandbox-supervisor proven offline. Closes the critique's A2 ("driver brain is router-only").
WS3 — surface shrink	Runtime barrel 355→277; package subpaths 13→6; the leaked recursion/seam/journal plumbing internalized.
WS4 — naming taxonomy	`spawn_worker`→`spawn_agent` (+ `observe_agent`/`steer_agent` — the verbs operate on any spawned agent, incl. a sub-supervisor); `depth/breadthDriver`→`depth/breadthStrategy` (they're strategies, not drivers); `supervisorSkill`→`supervisorInstructions`; `createDriveTurnResumeDriver`→`createDetachedTurnResumeDriver`.
WS5 — docs that can't lie	`canonical-api.md` 984→76 lines; docs 26→17 (+5 archived; 4 architecture docs→1). New CLASS-6 prose-symbol gate (`scripts/check-docs-freshness.mjs`): every backticked symbol in the curated docs must resolve to a real export, or the build is RED — smoke-proven (injecting a fake `refineGepa()` reddens it).
WS6 — examples	Supervisor-loop runners 5→4, all on `supervise()` with a single-sourced worker seam; a new `examples/supervise/` one-call example + an offline ($0, no-creds) example test.
WS7 — `improve()`	The one pluggable RSI verb: a thin facade over agent-eval `selfImprove` that defaults the generator from the `surface` (prompt→GEPA, skills→skillOpt) and fails loud on surfaces with no default — and makes the previously-dead `src/improvement/` barrel reachable.
WS2 — substrate	Decisions recorded (closure, not loose ends): `runLoop` kept (published primitive), `runAgentic`/`runPersonified` kept (distinct), `AgentRunSpec`→`SandboxIterationSpec` deferred (public, 28-file blast radius → needs a major bump).

Also fixed a pre-existing red gate: verify:package still asserted the removed ./workflow subpath.

Proof

All green from a clean tree: lint (306 files) · tsc 0 · typecheck:examples 0 · test 1062 pass / 1 skip · build 0 · docs:check 0 · verify:package 0 — and merges cleanly into main. The keystone (the DriverChat→ToolLoopChat unification) preserves the equal-k driver-inference metering byte-for-byte.

Not in this PR (tracked, not forgotten)

Run the benchmarks LIVE on the branch before merge — this is typecheck-revalidated, not bench-run-revalidated. That's the one open "capabilities survived" proof.
docs/simplification-plan.md §7.5 tables the multi-round supervisor/driver/worker design (retry = the driver's prompt-policy, real-time trace self-correction, completion = a real-state check) and the learned compute/model allocator (separate active research).
Two pre-existing flaky tests (a workflow vm-timeout under load, a task-queue temp-dir rename race) pass on isolation — candidates for a separate hardening ticket.

The living tracker (every workstream, the scratch list, the decisions, the inventory) is docs/simplification-plan.md.

…urnal the kernel loop (#346)" This reverts commit edc1d54.

…, full doc/module/example inventory + completion criteria

…tify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor

…enerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine'

…becomes a thin adapter (keystone 1/4)

…opChat seam (keystone) Delete DriverChat + routerDriverChat; the coordination-driver brain is now the canonical ToolLoopChat and its loop runs through runToolLoop (routerBrain = 4 lines, was 60). The equal-k driver-inference metering is preserved exactly. Three tool-loop copies collapse to one.

…done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next

…umbing from the public barrel

…es from the public barrel

…profiles; drop unused duplicates)

…ngExperiment, refineGepa label)

… 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes)

… in curated docs must resolve Scans canonical-api/concepts/architecture for backticked symbols outside code fences; reddens on any call-shaped or PascalCase symbol that resolves to no src/bench/substrate export or concept-whitelist entry. Walks every substrate dist/**/*.d.ts (not just index barrels). Closes the gap that let gepaDriver/refineGepa live in the docs unchecked.

…s (WS1b) A supervisor is now an AgentProfile: harness null -> the in-process router tool-loop (coordinationDriverAgent; routerBrain becomes an internal detail), a coding-CLI harness (claude-code/opencode/codex) -> a sandboxed harness driving the coordination verbs via serveCoordinationMcp. Both arms share makeWorkerAgent + the keep-best-delivered oracle. Closes the critique's A2 (driver brain was router-only). Proven offline both arms.

… profile.harness)

supervise(profile, task, { backend|makeWorkerAgent, budget }) defaults blobs/perWorker/ journal/executors/maxDepth so 'just invoke the supervisor' is a one-liner. workerFromBackend derives the worker seam from a backend config + an optional completion oracle (settled⟺delivered). The raw seams (supervisorAgent + createSupervisor().run) stay for power use.

…design (round vs turn, prompt-policy retry, real-time trace self-correction)

… — profile + goal, scaffolding defaulted)

… agent, incl. a sub-supervisor) The coordination verb always took a worker OR a driver profile and resolves a sub-supervisor via the role marker — the name lied. Renamed across the tool def, the LLM-facing descriptions, the scripted-brain tests, the examples, and the hand docs. WS4 (naming taxonomy).

…er_agent (consistent verb family) The coordination verbs operate on any spawned agent (a leaf worker OR a sub-supervisor), so the family is now spawn_agent / observe_agent / steer_agent. WS4 (naming taxonomy).

…ategy, supervisorSkill→supervisorInstructions (WS4) They are strategy combinators and a prompt-instruction builder, not 'drivers'/'skills' — reserving 'Driver' for the agent-orchestration layer (coordinationDriverAgent/driverChild).

…nResumeDriver (WS4)

…r selfImprove; generator defaulted from surface)

… no creds)

…ridge runners onto supervise() run-router.ts duplicated examples/supervise/supervise.ts (router brain + router-tools backend). loop.ts's runSupervisorLoop/makeWorkerAgent duplicated supervise()/workerFromBackend. The sandbox + bridge runners now call supervise() with only their load-bearing per-backend seam; the shared demo task + scripted brain move to shared.ts.

…() (single-sourced worker seam) Replaces the bespoke makeWorker (executor construction + per-worker file plumbing) with workerFromBackend(backend, deliverable); the deployable check now reads the worker's real output for ANSWER=42 (completion oracle, not a self-report). Keeps the cli-bridge harness supervisor arm that drives spawn_agent natively over the coordination MCP.

…ove() exports

…rvise() example test

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	3 (3 low)
Heuristic	0.0s
Duplication	0.0s
Interrogation	80.5s (2 bridge agents)
Total	80.5s

💰 Value — sound

Substantial usability/DX overhaul: unifies the supervisor brain onto one ToolLoopChat seam (deleting the DriverChat zoo), shrinks the public surface 13→6 subpaths, adds build-time doc freshness gates, and provides one-call supervise()/improve() facades — all in the grain of the existing recursive-su

What it does: Unifies the supervisor brain seam: the old DriverChat zoo (routerDriverChat, scriptedSupervisorChat as a separate type hierarchy) is replaced by a single ToolLoopChat contract — routerBrain resolves it from the router, and supervisorAgent resolves the brain from profile.harness (null→router, harness name→sandboxed MCP). Deletes src/runtime/supervise/router-driver-chat.ts (65 lines of translation c
Goals it achieves: 1) Make supervisors brain-resolvable from their profile (backend-as-data), same as workers via createExecutor — no hand-built driver brain. 2) Reduce the public API surface so a competent engineer can derive the call path from inspection, not by reverse-engineering 998 exports. 3) Make docs self-verifying — every backticked symbol must resolve to a real export, so docs can never claim a non-existe
Assessment: This is a coherent, well-executed usability overhaul that reinforces rather than fights the codebase's grain. The brain-unification (WS1) is the keystone architectural improvement — it makes the supervisor follow the same backend-as-data resolution rule as every other agent (profile.harness null→router, set→sandbox), eliminating the parallel DriverChat type hierarchy. The surface shrink is surgica
Better / existing approach: none — this is the right approach. Searched for existing alternatives: (1) the old DriverChat type hierarchy and routerDriverChat adapter in router-driver-chat.ts were the exact zoo being eliminated — the new ToolLoopChat seam IS the simpler design; (2) improve() properly delegates to agent-eval's selfImprove and is a thin facade, not a reinvention; (3) supervise() properly delegates to supervisor
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound

The supervise() one-call + brain-from-profile unification + surface shrink + verifiable docs is a coherent usability overhaul that canonicalizes existing patterns without competing or dead-ending.

Integration: Wires correctly. supervise() (src/runtime/supervise/supervise.ts:80) composes existing primitives (supervisorAgent→coordinationDriverAgent, createSupervisor, createInMemoryRunContext). All 5 examples (supervise.ts, run-sandbox.ts, run-bridge.ts, run-supervisor-mcp.ts, atom-humaneval.mts) import from @tangle-network/agent-runtime/loops — the verified subpath. routerBrain replaces the deleted
Fit with existing patterns: Aligns with the codebase's grain. The backend-as-data pattern — createExecutor({backend}) resolves a worker by data; supervisorAgent(profile, deps) resolves a brain from profile.harness the same way. routerBrain is a 4-line thin wrapper over the existing routerChatWithTools, emitting the existing ToolLoopChat seam. supervise() is a convenience, not a replacement — the raw `supervisor
Real-world viability: Proven offline at multiple layers: both brain arms tested without creds (supervisor-agent.test.ts:68-131 covers harness=null router arm and harness='opencode' sandbox arm via real HTTP MCP), budget conservation gates (poolStarved, deadline, abort) all test-covered (coordination-driver.test.ts:258-352), completion oracle (gateOnDeliverable) ensures settled≠delivered without a real check, the exam
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/supervise/supervise.ts

+console.log(result.kind === 'winner' ? '✓ delivered' : ✗ no winner (${result.kind}))

🟡 Cruft: commented out code scripts/check-docs-freshness.mjs

+// CLASS 1 version / substrate-peer pins != package.json

🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260620T210444Z}

…oop, fix improve() default model - runToolLoop name collided with the public streaming runToolLoop; the internal brain-loop seam is now runBrainLoop (one grep = one concept). - improve()'s zero-config default reflection model was the dead anthropic/claude-sonnet-4.6 → deepseek-v4-flash (router-served).

The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn / supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE row and routes self-improvement to improve().

tangletools

✅ Auto-approved PR — `bf075234`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T21:19:52Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	6 (3 low, 3 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	205.3s (2 bridge agents)
Total	205.3s

💰 Value — sound-with-nits

Large, coherent usability overhaul that unifies the supervisor brain seam onto the canonical ToolLoopChat, adds convenient one-call facades (supervise/improve), shrinks the public surface 12→5 subpaths, renames confusing symbols, and adds a self-verifying docs freshness gate — all in the grain of ex

What it does: Consolidates the supervisor's brain seam: deletes the parallel DriverChat interface + routerDriverChat adapter (src/runtime/supervise/router-driver-chat.ts:1-65), and routes the coordination driver through the canonical ToolLoopChat seam instead. Adds supervise(profile, task, opts) as a one-call convenience composing supervisorAgent() + createSupervisor().run() with sensible defaults (src/runtime/
Goals it achieves: Make a competent engineer derive the supervisor call path in seconds (supervise(profile, task, { backend, budget }) instead of hand-wiring coordinationDriverAgent + createInMemoryRunContext + createSupervisor + blobs + perWorker + journal + executors + maxDepth). Eliminate the confusing parallel type hierarchy (DriverChat vs ToolLoopChat — identical purpose, different shapes). Shrink the public AP
Assessment: The change is well-designed and built in the grain of the existing codebase. The brain unification (DriverChat→ToolLoopChat) is a genuine simplification: the old code had a parallel interface with its own message format and a dedicated adapter (routerDriverChat) that did hand-rolled OpenAI message translation — now the 3-line routerBrain function satisfies the same ToolLoopChat seam every other to
Better / existing approach: none — this is the right approach. The ToolLoopChat unification was the correct call (checked: no other DriverChat-like parallel type exists; routerBrain is now the canonical chain routerChatWithTools→routerBrain→ToolLoopChat). The supervise() convenience correctly composes existing primitives rather than reinventing them. The naming changes fix real confusion. The subpath reduction removes genuin
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound-with-nits

supervise() delivers a real DX win (profile+goal → one call), brain-from-profile coheres with the executor pattern, and the docs freshness gate hardens the surface — one stale bench reference and a partially-wired code surface in improve() are the only nits.

Integration: All new surfaces are exported through the 6-subpath public API and reachable. supervise() is called by 3 examples (examples/supervise/supervise.ts:31, examples/supervisor-loop/run-sandbox.ts:58, run-bridge.ts:75) and 2 test files (tests/loops/supervise-convenience.test.ts:51, tests/supervisor-loop-example.test.ts:65). supervisorAgent() is called internally by supervise() (src/runtime/supervise/sup
Fit with existing patterns: The pattern matches the codebase's grain. createSupervisor().run() and coordinationDriverAgent remain as raw seams for power users (the bench and personify layer use them directly); supervise() wraps the common case with sensible defaults (blobs, perWorker=budget/4, journal, executors, maxDepth=8). The brain-from-profile pattern in supervisorAgent (harness:null → router brain, harness:'opencode' →
Real-world viability: Core paths are tested: router arm (tests/loops/supervisor-agent.test.ts:68), sandbox arm (same file:83), offline scripted-brain integration (tests/supervisor-loop-example.test.ts:63), fail-loud for missing deps (supervise-convenience.test.ts:71, supervisor-agent.test.ts:104,115). The docs freshness gate (scripts/check-docs-freshness.mjs) smoke-proven to detect stale symbols. Two real-world gaps: (
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/supervise/supervise.ts

+console.log(result.kind === 'winner' ? '✓ delivered' : ✗ no winner (${result.kind}))

🟡 Cruft: commented out code scripts/check-docs-freshness.mjs

+// CLASS 1 version / substrate-peer pins != package.json

🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }

💰 Value Audit

🟡 Bench system prompts reference old tool name spawn_worker after rename to spawn_agent [maintenance] ``

6 files still reference spawn_worker in LLM system prompts: bench/src/atom-humaneval.mts:96, bench/src/mcp-mount-probe.mts:89-90, bench/src/atom-mcp-e2e.mts:178, bench/src/profiles.ts:23/98, skills/supervise/SKILL.md:13/24, skills/loop-writer/SKILL.md:113. The MCP tools were renamed from spawn_worker to spawn_agent (src/mcp/tools/coordination.ts:336) and the LLM would fail trying to call a non-existent tool. Also bench/src/profiles.ts:22/24/96/98 references observe_worker/steer_worker (now obser

🎯 Usefulness Audit

🟡 bench/atom-humaneval.mts system prompt references stale tool name spawn_worker [robustness] ``

bench/src/atom-humaneval.mts:96 tells the LLM 'Tools: spawn_worker ... await_event ...' but the coordination tools were renamed to spawn_agent (src/mcp/tools/coordination.ts:336). The import rename (routerDriverChat→routerBrain, line 31-34) was done but the system prompt string wasn't updated. The bench would fail at runtime when the LLM calls a tool name that doesn't exist. Fix: update the system prompt string to reference spawn_agent.

🟡 improve() code surface is half-wired (empty baseline, winner not applied) [ergonomics] ``

At src/improvement/improve.ts:128-133, baselineSurfaceFor('code') returns '' (no worktree ref), and at line 159-160 applyWinnerToProfile('code') returns the profile unchanged. The caller must fish the actual code winner from raw.winner.surface. The surface exists in the type system (ImproveSurface includes 'code') and a generator can be injected, but the facade provides no load-bearing integration — a caller using surface:'code' with a generator gets a profile-back-is-input result and must read

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260620T212506Z}

… 2775 LOC + tests) A third orchestration substrate (a workflow-as-a-script DSL runner with its own checkpoints/budget/ delegates) that does NOT use the supervisor and is NOT self-improving — redundant with the Scope/Supervisor + supervise() path (the architecture's 'two substrates, do not invent a third'). Zero in-repo or fleet consumers; its ./workflow subpath was already dropped in WS3.

…or can vary budget per worker)

…d model-subset restriction)

…ey (runStrategyEvolution + promotionGate)

tangletools · 2026-06-20T21:56:55Z

❌ Needs Work — `bf075234`

Readiness 0/100 · Confidence 95/100 · 33 findings (1 high, 5 medium, 27 low)

	deepseek	glm	aggregate
Readiness	0	37	0
Confidence	95	95	95
Correctness	0	37	0
Security	0	37	0
Testing	0	37	0
Architecture	0	37	0

Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH contentAddress dropped from runtime barrel export — src/runtime/index.ts

The old barrel (edc1d54) exported contentAddress at line 30. The new barrel (bf07523) only exports InMemoryResultBlobStore and InMemorySpawnJournal from ../durable/spawn-journal. Two bench files import contentAddress from the runtime barrel: bench/src/atom-humaneval.mts:23 and bench/src/atom-mcp-e2e.mts:24. These will fail to compile. Fix: add contentAddress back to the re-export on line 29: export { contentAddress, InMemoryResultBlobStore, InMemorySpawnJournal } from '../durable/spawn-journal'.

Other

🟠 MEDIUM Unguarded JSON.parse in applyWinnerToProfile can crash after a successful ship verdict — src/improvement/improve.ts

Lines 152, 154, 156, 158 call JSON.parse(winner) unconditionally for skills/tools/mcp/hooks. tools/mcp/hooks REQUIRE a caller-supplied driver (no default), and skillOptDriver mutates a JSON-stringified blob. If any such driver produces a non-JSON string (a realistic LLM failure mode — the driver parses model output and a malformed reflection slips through), applyWinnerToProfile throws SyntaxError AFTER selfImprove returned gateDecision==='ship'. The throw discards the entire improvement result: the caller never receives out.raw, out.lift, or out.gateDecision, losing the provenance record and the winner surface. The facade's documented contract ('ret

🟠 MEDIUM Unprotected JSON.parse in applyWinnerToProfile can throw raw SyntaxError — src/improvement/improve.ts

Lines 151-158: applyWinnerToProfile calls JSON.parse(winner) for skills/tools/mcp/hooks surfaces without try/catch. If the ImprovementDriver (including the default skillOptDriver) produces a winner surface string that is not valid JSON — a known LLM failure mode — this throws a raw SyntaxError after selfImprove already shipped the result. This violates the error taxonomy in src/errors.ts:6-20 which requires all consumer-facing errors to extend AgentEvalError. The risk is partially mitigated because the agent function must have used the surface successfully during evaluation for the gate to ship, but if the agent is lenient (accepts an

🟠 MEDIUM MCP tool wire-format rename breaks external references — src/mcp/tools/coordination.ts

Lines 336,360,377: Tool names changed from spawn_worker→spawn_agent, observe_worker→observe_agent, steer_worker→steer_agent. These are wire-format identifiers that MCP clients use to call tools. Tests updated (tests/loops/coordination.test.ts:65-119 use new names, 54 tests pass). However bench/src/profiles.ts:22-98, bench/src/atom-humaneval.mts:96, bench/src/atom-mcp-e2e.mts:178, and bench/src/mcp-mount-probe.mts:89-119 still reference old tool name strings in LLM prompts — these bench files will silently fail when mounting the coordination MCP after this rename lands. They may be updated in other shots of this PR (80 files total). Verify all bench references

🟠 MEDIUM finalizeBestDelivered has zero tests — src/runtime/supervise/coordination-driver.ts

finalizeBestDelivered (line 198) is a new public export used by both coordination-driver.ts and supervisor-agent.ts for the completion-oracle result selection. It filters settled children by status==='done' AND valid===true, then argmax on score. Zero direct tests. Covered only indirectly through coordinationDriverAgent and supervisorAgent integration tests. Missing cases: no delivered children (returns undefined), tie-breaking on equal scores, undefined score handling, missing outRef, and the valid===true gate (a worker that settled 'done' but wasn't valid should NOT be selected). Add unit tests for these scenarios.

🟠 MEDIUM runBrainLoop has zero direct tests — src/runtime/tool-loop.ts

runBrainLoop (114 lines) is the new shared tool-loop skeleton extracted from router-client.ts and coordination-driver.ts. It is the canonical agentic tool-loop — every brain drives through it. Yet it has zero dedicated tests. The existing src/tool-loop.test.ts tests runToolLoop/streamToolLoop (the OLD functions, not the new runBrainLoop). Indirect coverage comes from tests/loops/router-brain.test.ts (which tests routerBrain, the thin adapter over runBrainLoop) and tests/loops/coordination-driver.test.ts (which exercises runBrainLoop through the driver agent). Missing coverage: stopBefore/beforeTurn/onUsage hook semantics, malformed-arg degradation, message format correctness, edge cases like maxTurns=0 with hooks, and the final text fallback behavior. Add tests covering: (1) hook invocatio

🟡 LOW Cross-doc ref to removed pages verified clean but module removal may surprise consumers — docs/api/README.md

Six module pages removed from the module listing (analyst-loop, audit, improvement, platform, topology, workflow) without deprecation notice. The typedoc.json entryPoint array was reduced from 12 to 6 entries. Source modules still exist on disk and are importable — this is purely a documentation visibility change. No stale cross-references to removed pages found. Consumers navigating from old bookmarks to these pages will get 404s; consider a redirect or deprecation note.

🟡 LOW Public API surface reduction not surfaced as a finding in docs — docs/api/runtime.md

runtime.md drops documentation for ~20 previously-public symbols (RootHandle, RootSignal, SpawnEvent, Runtime, ExecutorFactory, Restart, NodeStatus, NodeId, ToolPartDecoder, contentAddress, replaySpawnTree, materializeTreeView, createSandboxForSpec, depthDriver, breadthDriver, driverChild, isDriverSpec, withDriverExecutor, routerDriverChat, driverRuntime, driverExecutorFactory, createRootHandle, toToolSpan, touchedPathsFromPatch, countDiffLines, isNonEmptyPatch, touchesSecretPath, runCoderChecks, createPartsTraceSource, decodeOpencodePart, decodeAnthropicPart, decodeOpenAiPart, toolPartDecoders, supervisorSkill). Verified these are no longer exported from src/runtime/index.ts at HEAD (base exported them at [lines 30](https://github.com/tangle-network/agent-runtime/blob/bf0752345cea05968500

🟡 LOW Undocumented SpawnJournal interface referenced as plain text in runtime.md — docs/api/runtime.md

InMemorySpawnJournal (line 85) states 'Implements SpawnJournal' but SpawnJournal is no longer documented on the runtime page (removed from src/runtime/index.ts type re-exports at line 371-388). The reference renders as plain text without a link — not broken, but the interface's contract is invisible to readers. Consider either re-exporting SpawnJournal from the runtime entry point or documenting its contract in InMemorySpawnJournal's own docs.

🟡 LOW No trailing newline at EOF — docs/canonical-api.md

The file ends without a final newline (line 78 is the last line, matching the prior version's 'No newline at end of file'). Trivial; most renderers are unaffected, but some linters/formatters flag it. Not blocking.

🟡 LOW runPersonaConversation location claim inaccurate — docs/canonical-api.md

Line 38: doc claims runPersonaConversation is at 'root . (also /loops)' but the function is only exported from src/index.ts (root '.'), NOT from src/runtime/index.ts (./loops). Verified: grep -c 'runPersonaConversation' src/runtime/index.ts returns 0. Fix: remove '(also /loops)' or export it from the runtime barrel.

🟡 LOW supervise() decision-table row implies backend is required; it is optional — docs/canonical-api.md

Line 35 shows supervise(profile, task, { backend, budget }) as the canonical call, but src/runtime/supervise/supervise.ts:87-93 throws only when NEITHER opts.backend NOR opts.makeWorkerAgent is provided — backend is the happy-path default, not strictly required. Acceptable framing for a decision table ('scaffolding defaulted'), and the precise signature now lives in generated docs/api/. Informational only; no action required unless you want to note 'backend | makeWorkerAgent'.

🟡 LOW Force-cast of SandboxClient breaks type safety at the example seam — examples/supervisor-loop/run-sandbox.ts

line 42: new SandboxClient(...) as unknown as RuntimeSandboxClient — if the sandbox SDK's SandboxClient shape diverges from the runtime's internal port, this compiles silently but fails at runtime when createExecutor calls the client. Same pre-existing pattern; a proper type guard or import from the runtime's public SandboxClient would catch mismatches at build time.

🟡 LOW run-supervisor-mcp.ts workerFromBackend+MCP path has no offline test — examples/supervisor-loop/run-supervisor-mcp.ts

run-supervisor-mcp.ts:130 wires workerFromBackend(backend, {check: demoCheck,...}) into serveCoordinationMcp's makeWorkerAgent. Unlike the supervise()+scripted-brain path (covered by tests/supervisor-loop-example.test.ts), this real-MCP path (supervisor harness calling spawn_agent over HTTP) is only exercisable against a live cli-bridge, so the workerFromBackend+gateOnDeliverable composition for the 'bridge' backend is unverified by CI. The substrate pieces are individually tested (coordination-mcp.test.ts, completion-gate tests), and the example throws clear errors on missing env, so this is a coverage gap not a defect. No fix required for merge; flagging so the gap is visible.

🟡 LOW demoCheck object-with-content branch only stringifies content one level deep — examples/supervisor-loop/shared.ts

shared.ts:18-23 — the check does String(content).includes(expectedAnswer) when out is an object with a content field, else JSON.stringify(out). If a backend ever settles out = { content: { text: 'ANSWER=42' } } (content itself an object), String({text:...}) yields '[object Object]' and the marker is missed, returning false even though the answer is nested inside. The fall-through JSON branch never runs because the content key short-circuits. Impact is nil for the current demo (bridge executor yields content as a string; the test asserts exactly {content: expectedAnswer}), and a throwing/missing check is fail-closed via gateOnDeliverable. Note only: if the sandbox event-stream shape ever nests the marker under a non-string content field, this oracle would false-negative. Fix if it

🟡 LOW scriptedSupervisorChat ignores tools parameter from ToolLoopChat signature — examples/supervisor-loop/shared.ts

line 70: The ToolLoopChat type expects (messages, tools) => Promise<...> but scriptedSupervisorChat returns (messages) => Promise.resolve(...), dropping the tools parameter. TypeScript allows this but it means the scripted brain can never inspect tool schemas. For a fixed plan this is fine, but consumers porting this pattern to a real brain may miss that the contract includes tool awareness. Non-blocking.

🟡 LOW Default reflection model triggers real LLM spend with no caller opt-in — src/improvement/improve.ts

Line 88 defaultReflectionModel = 'deepseek-v4-flash' + line 102 llm?.model ?? defaultReflectionModel means a caller running improve(profile, [], { surface: 'prompt', scenarios, judge, agent }) with no llm and OPENAI_API_KEY in env silently triggers real router calls on a model they didn't choose. The substrate's expectUsage: 'assert' default only catches stub cells, not unwanted spend. For an @experimental API this is borderline acceptable, but the module docstring should state that omitting opts.llm for 'prompt'/'skills' s

🟡 LOW No runtime shape validation of JSON.parse winner against the target profile field — src/improvement/improve.ts

Even when JSON.parse(winner) succeeds (lines 152-158), the parsed value is assigned directly to profile.tools / profile.mcp / profile.hooks / profile.resources.skills with no structural validation against the AgentProfile contracts (e.g. Record<string, AgentProfileMcpServer>, Record<string, AgentProfileHookCommand[]>, AgentProfileResourceRef[]). A driver producing syntactically-valid-but-semantically-wrong JSON writes a corrupt value into the returned profile. Low severity because the default drivers only cover prompt (string) and skills (round-trips through JSON.stringify/parse of the same array); the risk is confined to caller-supplied driv

🟡 LOW Test coverage limited to prompt surface; 4 of 5 surfaces and the JSON.parse failure path untested — src/improvement/improve.ts

tests/improve.test.ts has 3 tests: gate:none baseline-only (prompt), ship-verbatim writeback (prompt), ConfigError for tools-without-generator. Not covered: applyWinnerToProfile for skills/tools/mcp/hooks, the JSON.parse throw path on a malformed winner, the CodeSurface-winner branch (typeof !== 'string'), the defaultReflectionModel fallback when opts.llm is unset, and llmClientOptions() field projection. The surface matrix the facade exposes is larger than what is verified. Add at minimum: (a) a test that a malformed winner string does not discard raw/lift/gateDecision, (b) one skills-surface writeback round-trip.

🟡 LOW applyWinnerToProfile returns input reference for CodeSurface, not a copy — src/improvement/improve.ts

Lines 136-137 docstring: 'Returns a shallow copy; never mutates the input profile.' Line 147: if (typeof winner !== 'string') return profile returns the raw input reference for CodeSurface winners, not a copy. Line 160: case 'code': return profile same issue. While documented in the inline comment on [lines 143-146](https://github.com/tangle-network/agent-runtime/blob/bf0752345cea05968500b158ecf9cab

🟡 LOW Import path narrowed from barrel to direct module — src/mcp/detached-turn.ts

Line 39: Import changed from 'import { createSandboxForSpec } from '../runtime'' to 'import { createSandboxForSpec } from '../runtime/run-loop''. This matches the removal of createSandboxForSpec from the barrel re-export in src/runtime/index.ts (confirmed via git diff: -export { createSandboxForSpec, defaultSelectWinner, runLoop } → +export { defaultSelectWinner, runLoop }). Coordinated, correct — no other src/ consumers imported createSandboxForSpec from the barrel (verified via rg across src/). The narrowing is good practice for tree-shaking.

🟡 LOW No direct test coverage for the renamed resume driver symbol — src/mcp/detached-turn.ts

grep across test/ finds zero references to either createDetachedTurnResumeDriver or createDriveTurnResumeDriver. The function is exercised indirectly through bin.ts wiring, but the rename would not be caught by a direct unit test if a regression were introduced. This is pre-existing (base also had no direct tests), not introduced by this diff. Low risk since the change is identifier-only and typecheck confirms consistency.

🟡 LOW MCP tool names renamed without backward-compat alias — breaking for external clients referencing old names — src/mcp/tools/coordination.ts

Tool names changed from spawn_worker/observe_worker/steer_worker to spawn_agent/observe_agent/steer_agent (lines 336, 360, 377). Any external MCP client or persisted prompt string that calls the old tool names will get a tool-not-found error. This is an intentional vocabulary migration (the entire PR renames worker→agent across 80 files), and src/runtime/supervise/* already uses *_agent consistently, so it is architecturally coherent. Since these tools are @experimental and served by an in-process MCP server (not a public stable API), the break is acceptable. No action required for this PR, but worth noting if any downstream consumer persists tool-call transcr

🟡 LOW Public API feature removal: durable resume and loop journal removed without deprecation — src/runtime/index.ts

The PR removes from the public surface: createFileRunContext, FileLoopJournal, InMemoryLoopJournal, LoopJournal, LoopJournalEntry, FileSpawnJournal, FileResultBlobStore, materializeTreeView, replaySpawnTree, contentAddress, FileLoopJournal, driver-executor exports (driverChild, driverExecutorFactory, driverRuntime, isDriverSpec, withDriverExecutor), patch-checks exports (runCoderChecks, touchesSecretPath, countDiffLines, isNonEmptyPatch, touchedPathsFromPatch), trace-source exports (createPartsTraceSource, decodeAnthropicPart, etc.), and RunLoopOptions.journal. The internal implementations still exist (used by supervisor.ts, scope.ts, strategy.ts, persona.ts, etc.) — only the re-exports were removed. This is a major public API contraction. Consumers importing any of these from the package

🟡 LOW supervise() default runId collision risk — src/runtime/supervise/supervise.ts

supervise() defaults runId to 'supervise' (line 107). Two concurrent calls using defaults will both beginTree('supervise') on the InMemorySpawnJournal. The journal's uniqueness guard would reject the second beginTree. While concurrent in-memory calls are unusual, the failure mode is opaque. Consider generating a unique default runId (e.g., supervise-${randomUUID()}) or documenting that callers must provide a unique runId for concurrent runs.

🟡 LOW supervise() hardcodes in-memory stores, no crash durability — src/runtime/supervise/supervise.ts

supervise() always calls createInMemoryRunContext({ withDriver: true }) on line 81. The old path had createFileRunContext(dir) for file-backed journal/blob durability. The supervise() convenience function cannot survive a process crash. While this is a deliberate simplification (the raw createSupervisor().run() path still accepts any SpawnJournal/ResultBlobStore), callers who relied on createFileRunContext for durable supervised runs now have no convenience function. Document this limitation, or add an optional dir parameter to SuperviseOptions that routes to file-backed stores.

🟡 LOW workerFromBackend constructs executor with throwaway signal and empty seams — src/runtime/supervise/supervise.ts

Lines 26-29: const ctx: ExecutorContext = { signal: new AbortController().signal, seams: {} }. The AbortController is discarded, so the construction-time signal can never fire. This is safe for current executors because runChild (scope.ts:585) passes the real childAbort.signal to executor.execute(task, signal) at run time, and createExecutor adds the backend-specific seam internally. But cliExecutor (runtime.ts:692-695) wires abort-detection at CONSTRUCTION time via ctx.signal.addEventListener('abort', ...) — that listener will never fire through this path. For a cli-backend workerFromBackend, the execute-time signal still drives the kill via s

🟡 LOW workerFromBackend pre-constructs executor with stale context — src/runtime/supervise/supervise.ts

workerFromBackend (line 25) pre-constructs the executor via createExecutor(backend)(spec, ctx) with ctx = { signal: new AbortController().signal, seams: {} }. This executor is cached in executorSpec and later returned as-is by the registry's BYO factory path (ignoring the scope's proper abort signal and seams). Currently benign because executor.execute(task, signal) receives the proper scope signal at execution time and the built-in sandbox executor doesn't depend on construction-time seams. However, this is fragile — if a future executor implementation reads its abort signal or seams at construction time, it will see a never-aborted signal and empty seams

🟡 LOW Naming collision: src/runtime/tool-loop.ts vs existing src/tool-loop.ts — src/runtime/tool-loop.ts

The new src/runtime/tool-loop.ts exports runBrainLoop/ToolLoopChat (the supervisor brain seam). The pre-existing src/tool-loop.ts exports runToolLoop/streamToolLoop (the interactive chat tool-dispatch loop). Two files named tool-loop.ts in adjacent directories with overlapping vocabulary but different purposes is a navigation/maintainability trap. Not a bug. Consider renaming to brain-loop.ts or agent-tool-loop.ts to disambiguate.

🟡 LOW No dedicated tests for new runBrainLoop, supervise, supervisorAgent, or routerBrain — src/runtime/tool-loop.ts

All three new public API entrypoints (runBrainLoop in tool-loop.ts, supervise/workerFromBackend in supervise.ts, supervisorAgent in supervisor-agent.ts) and the new routerBrain export have no test files. Existing tests (supervise.test.ts: 35 tests) exercise the underlying createSupervisor/createScope directly, not the new convenience APIs. The runBrainLoop hooks contract (stopBefore/beforeTurn/onUsage ordering), the supervise() default-perWorker math, and the supervisorAgent sandbox-arm MCP lifecycle (serveCoordinationMcp → driveHarness → finalizeBestDelivered → mcp.close) are all untested at the integration level.

🟡 LOW runBrainLoop reports turns:maxTurns when stopBefore hook breaks the loop early — src/runtime/tool-loop.ts

Lines 93-94: the post-loop return is return { ..., turns: maxTurns, ... }. When opts.hooks?.stopBefore?.(turn) breaks at turn N (where N < maxTurns), the returned turns field says maxTurns — claiming more inference turns ran than actually did. The coordination-driver.ts consumer discards the return value (only calls finalize), and routerToolLoop passes no hooks, so no current consumer is affected. But runBrainLoop is an exported public API; any future consumer using hooks + reading turns gets an inaccurate equal-compute count. Fix: track actual turns executed (let turnsTaken = 0; turnsTaken += 1 after each chat call) and return turns: turnsTaken po

🟡 LOW Weakened assertion on tool result feedback in coordination-driver.test.ts — tests/loops/coordination-driver.test.ts

Line 139: Old assertion toolMsgs.some((m) => m.name === 'await_event' && m.content.includes('done')) changed to toolMsgs.some((m) => String(m.content).includes('done')) — drops the tool-name check. The toolMsgs.length >= 2 guard on line 138 still proves both tool results exist, but the name check gave extra defense against a mis-plumbed tool role. Low risk since the length guard + journal tree assertions ([lines 142-144](https://github.com/tangle-network/agent-runtime/blob/bf0752345cea05968500b158ecf9cab1

🟡 LOW scriptedBrain captures live message references — 'by turn N' feed-back proof is now testing final state, not per-turn snapshot — tests/loops/scripted-brain.ts

scriptedBrain (scripted-brain.ts:27) does seen?.push(messages) with NO spread/copy. The old per-file scriptedChat helpers did seen.push([...input.messages]). Because runBrainLoop (tool-loop.ts:63) reuses ONE growing messages array across all turns (passed by reference to chat at line 72), every seen[N] entry is the SAME array object. I proved this empirically: a 3-turn loop yields seen[0].length === seen[1].length === seen[2].length === 5 (final state) and seen[0] === seen[1] === seen[2] === true (same reference). Impact on coordination-driver.test.ts:136-139: the comment claims 'by turn 2 (the 3rd chat call), the conversation ... contains tool messages' bu

_{tangletools · 2026-06-20T21:56:52Z · trace}

tangletools

❌ 1 Blocking Finding — `bf075234`

Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 80 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-20T21:56:52Z · immutable trace}

…r-sim product evals, one-call)

…ove() (the intelligence loop)

…sona facade

tangletools

✅ Auto-approved PR — `504f37e7`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T22:05:13Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	5 (1 medium-concern, 3 low, 1 weak-concern)
Heuristic	0.1s
Duplication	0.0s
Interrogation	229.3s (2 bridge agents)
Total	229.4s

💰 Value — sound

Usability/DX overhaul that adds one-call facades (supervise/improve/evalPersona), unifies the supervisor brain seam (DriverChat→ToolLoopChat), deletes dead workflow+journal code (~3500 LOC), and gates docs against symbol drift — coherent, in-grain, no duplication.

What it does: Adds one-call facades (supervise(), improve(), evalPersona()) that default raw seams a caller previously hand-wired; unifies the supervisor brain seam by replacing the DriverChat type zoo (a parallel type system used only by the driver) with the canonical ToolLoopChat used everywhere else, eliminating routerDriverChat (65 LOC → routerBrain at 4 LOC in src/runtime/router-client.ts:250); deletes the
Goals it achieves: 1. A competent engineer can derive the call path in seconds — supervise(profile, task, {backend, budget}) instead of hand-wiring 5+ components across blobs/journal/executors/scopes. 2. The supervisor brain is resolved from profile data (profile.harness), not hand-built — the same backend-as-data resolution rule as workers (src/runtime/supervise/supervisor-agent.ts:62-122). 3. Dead/duplicated surfa
Assessment: Good change on every dimension. The facades are thin (supervise() is ~120 LOC composing existing supervisorAgent + createInMemoryRunContext + createSupervisor().run(); improve() is a facade over agent-eval's selfImprove; evalPersona() wraps runPersonaConversation) — they default seams without hiding the raw ones. The DriverChat→ToolLoopChat unification (src/runtime/supervise/coordination-driver.ts
Better / existing approach: none — this is the right approach. The facades correctly compose existing infrastructure rather than reinventing it (supervise() calls supervisorAgent+createSupervisor+createInMemoryRunContext; improve() delegates to agent-eval's selfImprove; evalPersona() delegates to runPersonaConversation). The brain unification replaces a bespoke seam with the canonical one that runBrainLoop already uses — it'
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound-with-nits

Coherent usability overhaul: unifies supervisor brain resolution, collapses 13→6 subpaths, adds one-call facades (supervise/improve/evalPersona) that layer cleanly over existing seams, and gates docs freshness — all reachable, no pattern competition, net −9.8k LOC.

Integration: All new capabilities are reachable through the established barrel/subpath structure. supervise() exported from ./loops subpath, called by 3 production examples + 2 test suites (tests/loops/supervise-convenience.test.ts, tests/supervisor-loop-example.test.ts). improve() exported from root barrel, called by 2 examples + 1 test (tests/improve.test.ts). evalPersona() exported from root b
Fit with existing patterns: Every new capability follows the codebase's established facade-over-substrate pattern. supervise() facades over supervisorAgent() + createSupervisor().run(), mirroring how the runtime already defaults blobs/perWorker/journal/executors. improve() facades over agent-eval's selfImprove() (src/improvement/improve.ts:204), picking default drivers by surface exactly as supervise() defaults
Real-world viability: Core paths hold up. supervise() enforces model-policy before compute spend (src/runtime/supervise/supervise.ts:88-94), defaults per-worker budget to pool/4 (src/runtime/supervise/supervise.ts:78-83), and requires backend or makeWorkerAgent (fail-loud, no silent stub). improve() throws ConfigError for surfaces with no default driver (src/improvement/improve.ts:196-199 — designed bou
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/improve/improve.ts

console.log(shipped: ${out.shipped} lift: ${out.lift.toFixed(3)} gate: ${out.gateDecision})

🟡 Cruft: commented out code scripts/check-docs-freshness.mjs

+// CLASS 1 version / substrate-peer pins != package.json

🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }

🎯 Usefulness Audit

🟠 Bench prompt prose references stale MCP tool names (spawn_worker → spawn_agent) [integration] ``

The MCP server in src/mcp/tools/coordination.ts:363,403,420 exposes spawn_agent/observe_agent/steer_agent, but bench prompt prose still tells supervisor LLMs to call spawn_worker/observe_worker/steer_worker (bench/src/profiles.ts:22-24,96,98, bench/src/atom-humaneval.mts:96, bench/src/mcp-mount-probe.mts:89-90, bench/src/atom-mcp-e2e.mts:6,178). A supervisor bench run would fail because the LLM's tool calls wouldn't match the MCP server's tool registry. Tests use the correc

🟡 evalPersona uses as never casts bridging two AgentProfile types from different packages [robustness] ``

src/conversation/eval-persona.ts:86-89 casts worker, persona, backendFor, and systemPromptOf as never to pass them to runPersonaConversation, which imports AgentProfile from @tangle-network/agent-eval while evalPersona uses AgentProfile from @tangle-network/agent-interface. The comment acknowledges this is a type-boundary cast. If the two packages' AgentProfile types diverge in structure, the facade silently passes mismatched data with no compiler error. Consider aligni

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260620T221053Z}

…prove() JSON.parse, test flagged fns - HIGH: contentAddress was dropped from the runtime barrel by WS3 → bench/atom-humaneval + atom-mcp-e2e fail to compile (a content-addressing helper bench legitimately uses). Re-exported from the barrel. - MEDIUM: applyWinnerToProfile's JSON.parse threw a raw SyntaxError after a ship verdict on a malformed winner → parseWinnerJson guards it with a typed ConfigError + a test. - MEDIUM: finalizeBestDelivered + runBrainLoop had no direct tests → added focused unit tests (the blob store's content-address invariant is exercised). - LOW: supervise() decision-table/README rows implied backend is required (it's optional) → { budget, backend? }.

tangletools

✅ Auto-approved PR — `b424ee2c`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T22:20:16Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	4 (1 medium-concern, 3 low)
Heuristic	0.1s
Duplication	0.0s
Interrogation	297.2s (2 bridge agents)
Total	297.3s

💰 Value — sound

Usability/DX overhaul: unifies supervisor brain resolution to profile-driven (mirroring createExecutor), consolidates duplicated tool-loop code into shared runBrainLoop, shrinks surface 355→277 exports / 13→6 subpaths, adds supervis()/improve() one-call facades, adds CLASS-6 docs-symbol freshness ga

What it does: This PR does seven things: (1) unifies the supervisor brain to resolve from profile.harness — null → in-process router tool-loop, a CLI harness → sandboxed MCP driver (supervisorAgent.ts is new, replacing hand-built routerDriverChat). (2) consolidates three copies of the while-for-turn tool-loop skeleton into one shared runBrainLoop in src/runtime/tool-loop.ts:53 — both `routerToolLoop
Goals it achieves: Make the engine usable by a competent engineer in seconds rather than hours. Achieves this by: (a) eliminating the 'brain is router-only' gap — now any harness can be the brain (sandbox supervisor path proven). (b) making the call path derivable from docs — the canonical-api decision table's START HERE row is supervise(profile, task, { budget }), and every backticked symbol it names is mechanica
Assessment: Good change, well-executed. The brain-unification (resolved from profile.harness, mirroring createExecutor({backend})) is symmetric with the existing worker-resolution pattern and eliminates the 'driver brain is router-only' gap the PR targets. The tool-loop consolidation (runBrainLoop) removes genuine code duplication — before this PR, routerToolLoop and coordinationDriverAgent each had
Better / existing approach: none — this is the right approach. Searched for pre-existing patterns that do what supervise() or improve() or runBrainLoop do: supervise() is a new facade wrapping createSupervisor().run() (which was the raw primitive, not a convenience caller), improve() is a new facade over agent-eval's selfImprove (which is the same loop but without the generator-defaulting/profile-apply convenie
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🎯 Usefulness — sound-with-nits

Well-executed usability overhaul: every new convenience facade (supervise, improve, evalPersona) follows the same one-call-with-defaults pattern, wraps existing lower-level engines without competing, and has real callers in tests + examples; the surface shrink and naming cleanup are complete and con

Integration: All new exports are reachable. supervise() has 8 test callers + 3 examples + is called internally by supervisorAgent() (src/runtime/supervise/supervise.ts:110). improve() has 6 test callers + 2 runnable examples. evalPersona() has 3 test callers + 2 example sites. routerBrain has 4 tests + 2 examples + bench callers + internal use in supervisorAgent(). assertModelAllowed is called by both supervis
Fit with existing patterns: The three convenience facades follow a cohesive pattern mirroring each other: one call with sensible defaults, power-user seams underneath accessible for full control. They don't compete with existing patterns — supervise() wraps createSupervisor/supervisorAgent(), improve() wraps agent-eval's selfImprove, evalPersona() wraps runPersonaConversation. The naming changes (depth/breadthDriver→Strategy
Real-world viability: Config validation in supervise() and improve() fails before compute is spent — allowedModels guard, missing-backend check, missing-generator check all throw typed errors (ConfigError/ValidationError) at the top of the function. The JSON.parse fix in improve() (src/improvement/improve.ts:144-154) is properly try/catch-wrapped and tested (tests/improve.test.ts:138-159). The main robustness gap is th
Model: opencode/deepseek/deepseek-v4-pro
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/improve/improve.ts

console.log(shipped: ${out.shipped} lift: ${out.lift.toFixed(3)} gate: ${out.gateDecision})

🟡 Cruft: commented out code scripts/check-docs-freshness.mjs

+// CLASS 1 version / substrate-peer pins != package.json

🟡 Cruft: magic number added tests/loops/supervisor-agent.test.ts

+const perWorker: Budget = { maxIterations: 4, maxTokens: 1000 }

🎯 Usefulness Audit

🟠 evalPersona throws raw Error instead of ConfigError for missing credentials [ergonomics] ``

src/conversation/eval-persona.ts:70 throws new Error('evalPersona: provide opts.{apiKey,baseUrl,model}...') for missing backend credentials. Per the error taxonomy contract at src/errors.ts:11-12, consumer-facing API errors should be typed (ConfigError or ValidationError, both AgentEvalError subclasses). A caller catching ConfigError to handle config failures programmatically would miss this one. The same pattern exists in runPersonaConversation (src/conversation/run-persona.ts:148,158), mak

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260620T222703Z}

tangletools · 2026-06-20T22:52:29Z

✅ No Blockers — `b424ee2c`

Readiness 16/100 · Confidence 95/100 · 35 findings (4 medium, 31 low)

	deepseek	glm	aggregate
Readiness	16	41	16
Confidence	95	95	95
Correctness	16	41	16
Security	16	41	16
Testing	16	41	16
Architecture	16	41	16

Full multi-shot audit completed 8/8 planned shots over 93 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 93 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Four 'as never' casts bridge different AgentProfile types across packages — src/conversation/eval-persona.ts

Lines 86-89 cast worker, persona, backendFor, and systemPromptOf as never to bypass the type mismatch between @tangle-network/agent-interface's AgentProfile (used by evalPersona) and @tangle-network/agent-eval's AgentProfile (expected by runPersonaConversation). The runtime behavior is correct — runPersonaConversation never inspects profiles, only passes them through callbacks — but if either AgentProfile type changes incompatibly, TypeScript will not catch it. Consider a thin adapter type at this boundary instead of 'as never' casts.

🟠 MEDIUM EvalPersona type not exported from barrel index — src/conversation/index.ts

eval-persona.ts:56 exports EvalPersona (the discriminated union {kind:'scripted',turns:string[]} | {kind:'profile',profile:AgentProfile}). conversation/index.ts:32 only re-exports EvalPersonaOptions and evalPersona, omitting EvalPersona. Root src/index.ts:76 likewise only exports EvalPersonaOptions. Consumers calling evalPersona(worker, persona, opts) cannot type persona through the public barrel — they must use a path import. Fix: add EvalPersona to both barrel exports.

🟠 MEDIUM mergeBudget accepts negative budget values — can corrupt scope accounting — src/mcp/tools/coordination.ts

mergeBudget validates that budget fields are finite numbers (typeof v !== 'number' || !Number.isFinite(v)) but does NOT reject negative values. The scope's budget accounting in src/runtime/supervise/scope.ts:721-727 uses these values directly in <= comparisons (tokensOk = totalTokens <= budget.maxTokens) and ratio calculations (budget.maxTokens / totalTokens). A negative maxTokens produces a negative ratio, and results in all spawns being rejected since iterations > 0 but maxIterations < 0 → itersOk = false. While this is fail-closed (doesn't bypass the ceiling), it's a silent misbehavior rather than a clear error. The description says 'only set the ceilings this sub-task needs raised' — passing negative values violates the semantic contract but goes undetected. Recommend adding a >= 0 or

🟠 MEDIUM runBrainLoop returns wrong turns count when stopBefore hook halts the loop early — src/runtime/tool-loop.ts

Lines 69–113: When stopBefore returns true, the for loop executes break, then falls through to return { final: lastText, turns: maxTurns, ... } on line 113. If stopBefore breaks on turn 1 (before any inference call), turns is reported as maxTurns (e.g., 2000 for coordination-driver maxTurns=0) instead of 0 (zero turns consumed). final is empty string. The only caller that sets stopBefore hooks (coordinationDriverAgent) ignores runBrainLoop's return value, so this is currently latent. But the exported runBrainLoop i

🟡 LOW agent.md references now-undocumented analyst-loop types as unlinked plain text — docs/api/agent.md

Lines ~1512-1592: createSurfaceImprovementAdapter(), createSurfaceKnowledgeAdapter(), and measureOutcome() document return/param types ImprovementAdapter, KnowledgeAdapter, RunAnalystLoopResult as plain text (no markdown link). In the base these linked to analyst-loop.md; after the analyst-loop entry point was removed from typedoc.json (out-of-shot), TypeDoc downgraded them to unlinked text. The types still exist and are exported at src/analyst-loop/types.ts:25,51,119, so they remain importable public surface. Impact: a reader of agent.md cannot navigate to these type definitions, and the functions' contracts are partially opaque. This is a usability regression inherent to the entry-point-removal decision, not a mechanical regen bug. Fix (pick one): (a) re-add src/analyst-loop/index.ts as

🟡 LOW agent.md sandboxOverrides docstring still mentions removed createSandboxForSpec — docs/api/agent.md

agent.md line 1197 (in CreateSandboxActOptions.sandboxOverrides): docstring says 'forwarded to createSandboxForSpec' but createSandboxForSpec is no longer in the public API docs (it was removed from docs/api/runtime.md and is not re-exported from src/runtime/index.ts). The source comment agent/sandbox-act.ts may need updating. Users reading docs will see a reference to an undocumented function. Low impact — the docstring is source-derived.

🟡 LOW Pre-existing broken anchor: #budget-9 resolves nowhere in GFM — docs/api/runtime.md

docs/api/mcp.md:4165 and docs/api/index.md:3175 reference runtime.md#budget-9, but runtime.md has exactly one ### Budget heading (line 8107), which generates anchor #budget in standard GFM. The -9 suffix is a TypeDoc global-reflection-counter artifact, not a GFM dedup. This predates the PR but cross-refs from mcp.md/index.md won't resolve in Markdown renderers. Fix: suppress TypeDoc's anchor suffixing for single-occurrence headings, or add an explicit <a id="budget-9"> tag.

🟡 LOW Pre-existing broken anchor: #sandboxclient-1 resolves nowhere in GFM — docs/api/runtime.md

9 cross-references across agent.md:1130, mcp.md:680,712,881,1650,1905,1955, profiles.md:1543, index.md:3085 use runtime.md#sandboxclient-1. Single ### SandboxClient heading at line 9748 → GFM anchor is #sandboxclient. The -1 suffix prevents resolution. Same TypeDoc systemic issue as #budget-9.

🟡 LOW Pre-existing broken anchor: #scope-1 resolves nowhere in GFM — docs/api/runtime.md

mcp.md:4147 references runtime.md#scope-1. Single ### Scope heading at line 8176 → GFM anchor is #scope. Same TypeDoc systemic issue.

🟡 LOW Pre-existing wrong-target anchor: #kind-3 resolves to LoopPlanDescription.kind, not LoopSandboxPlacement.kind — docs/api/runtime.md

mcp.md:2380 uses runtime.md#kind-3 in context LoopSandboxPlacement.kind. But kind-3 resolves to LoopPlanDescription.kind at line 9575 (the 4th kind heading), while LoopSandboxPlacement.kind at line 9924 should be #kind-4. Off-by-one from a ###### kind subheading at line 2861 that TypeDoc counted. Pre-existing semantic mis-target.

🟡 LOW Case-sensitive haltOn predicate may never fire — examples/product-eval/product-eval.ts

Line 78: haltOn: (ctx) => ctx.lastTurn.text.includes('RESOLVED') is case-sensitive. The adversary system prompt (line 65) says 'Say the literal word RESOLVED' — matching intent. However, LLMs frequently vary casing (RESOLVED / Resolved / resolved). If the model emits lower- or mixed-case, the halt predicate never matches and the run always exhausts maxTurns: 8. The maxTurns backstop prevents infinite loops, so this is not a correctness bug, just a fragility that users copy-pasting the example may run into. Mitigati

🟡 LOW Example is not covered by an automated runtime smoke; only typecheck guards it — examples/product-eval/product-eval.ts

package.json typecheck:examples runs tsc --noEmit on examples (verified clean for this file after build), and biome lints it, but no test executes product-eval.ts. A behavioral regression in evalPersona/runPersonaDispatch/runProfileMatrix that preserves types would silently break the example. Low severity because the example is documentation-by-code, not a shipped code path; the substrate functions it calls ARE unit-tested (eval-persona.test.ts, run-persona.test.ts). Optional nit: add a vitest test that imports the three cell functions with a fake backendFor and asserts they resolve, mirroring the eval-persona.test.ts offline pattern the README already points to.

🟡 LOW scoredCell hardcodes commitSha:'example' and a single scenario, so matrix output is illustrative only — examples/product-eval/product-eval.ts

Line 103: commitSha: 'example' and one scenario/profile. This is appropriate for an example (runProfileMatrix requires commitSha as a non-optional paper-grade field), and the default integrity:'assert' posture will pass because a real createOpenAICompatibleBackend is wired. Not a defect — noting only that anyone copy-pasting this as a real eval template must replace the sha and expand the corpus. README could add a one-line note that commitSha should be a real git SHA for reproducible records.

🟡 LOW maxTurns set in run-bridge.ts but omitted in run-sandbox.ts (sibling runners diverge) — examples/supervisor-loop/run-sandbox.ts

run-bridge.ts:90 passes maxTurns: 12 to supervise(); run-sandbox.ts:58-74 does not, so it falls back to the driver default of 16 (src/runtime/supervise/coordination-driver.ts:106). Not a correctness bug — both are bounded — but the two sibling runners are supposed to be the IDENTICAL supervisor differing only in the worker seam, and this is a silent behavioral asymmetry. Fix: add maxTurns: 12 to run-sandbox.ts's supervise() call for parity, or document why sandbox intentionally differs.

🟡 LOW sandboxClient cast through unknown is TypeScript-unsafe — examples/supervisor-loop/run-sandbox.ts

Line 42: new SandboxClient({ apiKey, baseUrl }) as unknown as RuntimeSandboxClient. If @tangle-network/sandbox diverges its create() method signature from the runtime's SandboxClient interface, this cast silently passes at compile time but fails at runtime when the executor calls sandboxClient.create(...). Pre-existing pattern (not introduced here), but the sandbox runner is billed as a runnable example — a runtime failure in the example erodes trust. Consider a runtime typeguard or documenting the exact sandbox SDK version required.

🟡 LOW Bridge error body embedded into thrown Error message (potential secret echo) — examples/supervisor-loop/run-supervisor-mcp.ts

run-supervisor-mcp.ts:106 throws supervisor bridge ${res.status}: ${(await res.text()).slice(0, 300)}. If the cli-bridge ever echoed the authorization bearer or request headers in a non-OK response body, up to 300 chars would land in the error message and any log/console that prints it. Low likelihood (the bridge is local/trusted), standard pattern, but for a path that explicitly passes Bearer ${bridgeBearer} (line 97) this is worth a 1-line sanitization if the bridge ever fronts non-local transports. Not blocking.

🟡 LOW workerFromBackend degrades worker-name uniqueness in traces — examples/supervisor-loop/run-supervisor-mcp.ts

Line 130: workerFromBackend(backend, { check: demoCheck, ... }) names non-explicitly-named workers as 'worker' (the static default in supervise.ts:32). The old makeWorker (deleted loop.ts) used a counter (worker-${counter.n++}) so each spawn-1/spawn-2 trace entry was distinguishable. When the supervisor omits a name in its spawn_agent profile, all workers trace as worker — making event-bus logs harder to read. No correctness impact (scope.spawn already uses unique handle IDs). Fix: accept a per-worker counter or label generator in the workerFromBackend seam, or document that the supervisor must supply names for readable traces.

🟡 LOW demoCheck JSON.stringify fallback is O(n) over the full worker output — examples/supervisor-loop/shared.ts

shared.ts:22 — for non-{content} shapes (the sandbox serialized event-stream case), demoCheck runs JSON.stringify(out ?? '') on every worker output to substring-search 'ANSWER=42'. Fine for the short example outputs; on a real box's multi-MB event stream this would stringify the whole stream per worker per gate. Acceptable for an example, but the comment at shared.ts:16-17 advertises this branch as the box path, so the cost should be noted. No fix required for examples; flag for any fleet reuse.

🟡 LOW No dedicated unit test for createSandboxAct adapter (pre-existing) — src/agent/sandbox-act.ts

No sandbox-act.test.ts exists in src/agent/. The adapter's event-mapping, output-promise settlement, and abort-signal plumbing are only exercised indirectly through integration run-paths. This is pre-existing — the diff does not change behavior or reduce coverage — but worth flagging because the file owns the eval/prod parity contract (per its header docblock) and a regression in settle/fail wiring (lines 79-88) or the raw-events buffer passed to output.parse (line 105) would silently corrupt scorecard grading. Not a blocker for this imp

🟡 LOW No profile-driven persona test in evalPersona.test.ts — src/conversation/eval-persona.test.ts

All 3 tests use { kind: 'scripted', turns: [...] }. The { kind: 'profile', profile: AgentProfile } path (which triggers the LLM user-sim persona and requires maxTurns) is exercised in run-persona.test.ts but not through the evalPersona facade. The as never cast on persona as PersonaDriver for a profile-kind persona is therefore never integration-tested from the facade level.

🟡 LOW No test for profile-kind persona through evalPersona facade — src/conversation/eval-persona.test.ts

All 3 tests use scripted personas. The profile-kind path through evalPersona (which exercises the default backendFor for BOTH worker and persona, plus the systemPromptOf default applied to the persona side) is only tested at the lower runPersonaConversation level (run-persona.test.ts:5 'runs a profile-driven persona end-to-end'). Since evalPersona adds its own defaulting layer and casts, a direct integration test for { kind: 'profile', profile } through the facade would close the gap. Coverage is acceptable given the lower-level tests, but not complete.

🟡 LOW Doc claim about maxTurns:0 is inaccurate — it throws, not zero-turns — src/conversation/eval-persona.ts

eval-persona.ts:43 doc says 'maxTurns: 0 is zero turns, not run-until-done'. But define-conversation.ts:50 throws ValidationError for maxTurns < 1, so maxTurns:0 actually aborts at conversation-definition time, not 'zero turns'. The fail-loud behavior is better than documented, but the doc text is technically wrong. Fix: change to 'maxTurns: 0 is rejected (must be >= 1), not run-until-done'.

🟡 LOW Dual AgentProfile type clash requires as never escape hatches — src/conversation/eval-persona.ts

eval-persona.ts:86-89 uses as never on worker, backendFor, and systemPromptOf because run-persona.ts imports AgentProfile from @tangle-network/agent-eval (role/environment/domain shape) while eval-persona.ts imports from @tangle-network/agent-interface (prompt.systemPrompt shape). The comment at lines 82-85 correctly argues runtime safety: the runner treats the profile as an opaque token, only the callbacks inspect it, and the callbacks are typed for the agent-interface shape. This is sound but fragile — a future change to runPersonaConversation that inspects the profile directly would silently break. Consider unifying on one AgentProfile type or addin

🟡 LOW EvalPersona type not exported from public barrel — src/conversation/index.ts

index.ts:32 exports type EvalPersonaOptions, evalPersona but NOT EvalPersona (the persona union type). The low-level PersonaDriver IS exported (index.ts:58). Callers using evalPersona from the public barrel cannot type their persona argument without a deep import from './eval-persona'. Fix: add type EvalPersona to the export on line 32.

🟡 LOW mergeBudget does not reject negative budget ceilings — src/mcp/tools/coordination.ts

At coordination.ts:177-183, the field helper validates Number.isFinite(v) but not v >= 0. A caller passing budget: { maxUsd: -1 } would pass validation and produce a Budget with a negative ceiling. Downstream budget-pool logic (supervise/budget.ts:142) initializes freeUsd = root.maxUsd ?? 0 and decrements, so a negative per-worker ceiling would cause the pool to deny admission (fail-closed), not an overspend. Impact is therefore low — no security or cost bypass — but a negative budget is nonsensical and the fail-loud philosophy of this function ('a malformed budget must never silently fall back') would be better served by rejecting sign. This matches existing behavior (the perWorker default is also unvalidated for sign), so it is consistent rather than a regression. Fix: add `||

🟡 LOW mergeBudget silently ignores unknown budget fields — src/mcp/tools/coordination.ts

At coordination.ts:184-187, field() reads exactly 4 hardcoded keys (maxIterations/maxTokens/maxUsd/deadlineMs). A caller passing { maxIteration: 5 } (typo, missing 's') or any unknown key would have it silently ignored — no validation error. The fail-loud contract stated in the comment (lines 170-172) is about malformed known fields, not unknown ones. Low impact: the conserved pool still fences, and the merge correctly applies only known ceilings. But a typo'd ceiling would silently fall back to the default — the exact 'nobody chose' scenario the comment warns against for other cases. Fix: optionally detect and reject keys not in the known set.

🟡 LOW Multiple public export removals (major API surface reduction) — src/runtime/index.ts

Removed exports: FileResultBlobStore, FileSpawnJournal, materializeTreeView, replaySpawnTree, FileLoopJournal, InMemoryLoopJournal, LoopJournal, createFileRunContext, routerDriverChat, all driver-executor exports (driverChild, driverExecutorFactory, driverRuntime, isDriverSpec, withDriverExecutor), all patch-checks exports (countDiffLines, isNonEmptyPatch, runCoderChecks, touchedPathsFromPatch, touchesSecretPath, CoderCheckConstraints, CoderCheckInput), all trace-source decode/decoder types (decodeAnthropicPart, decodeOpenAiPart, decodeOpencodePart, SessionMessageLike, ToolPartDecoder, ToolStepInput, toolPartDecoders, toToolSpan, createPartsTraceSource), all scope seam exports (NestedScopeSeam, nestedScopeSeamKey), all runtime

🟡 LOW workerFromBackend builds executor with a dead-end abort signal the scope can't cascade through — src/runtime/supervise/supervise.ts

Line 36: workerFromBackend constructs the executor with ctx.signal = new AbortController().signal. When scope.spawn later resolves this BYO executor through the registry, the ExecutorRegistry.resolve factory (runtime.ts:1193-1195) returns the pre-built executor verbatim — it ignores the scope's childAbort.signal in the new context. The executor's internal controller was linked at factory time to the dead-end signal. However, executor.execute(task, signal) receives childAbort.signal as its signal parameter (scope.ts:585), and the built-in executors merge this with their internal controller, so execute-time abort propagation works correctly.

🟡 LOW supervise() defaults maxDepth to 8 while createSupervisor uses 4 — undocumented divergence — src/runtime/supervise/supervise.ts

Line 122: maxDepth: opts.maxDepth ?? 8. The supervisor's own defaultMaxDepth = 4 (supervisor.ts:68) is documented as 'paired with the conserved pool so a runaway recursion hits budget-exhaustion first and depth-exceeded second (R3)'. The supervise() one-call API doubles this to 8 without a comment explaining why. This is likely intentional (the convenience API allows deeper decomposition trees) but a reader would expect the defaults to match. Impact: low — the conserved pool is still the primary bound; maxDepth is a tripwire. But a user calling supervise() gets a different safety ceiling than one calling createSupervisor().run() directly. Fix: add a o

🟡 LOW workerFromBackend creates a detached AbortController signal that is never abortable — src/runtime/supervise/supervise.ts

Line 28: const ctx: ExecutorContext = { signal: new AbortController().signal, seams: {} }. This signal is created, passed to the executor factory, and then discarded — nobody holds a reference to the controller, so it can never be aborted. Verified harmless: scope.ts:585 (runChild) passes the REAL childAbort.signal to executor.execute(task, signal), which is the signal that actually controls cancellation. The constructor signal is only used if an executor captures it internally before execute() is called. No built-in executor does this. Impact: none in practice, but the dead signal is misleading — a reader might think the worker is abortable thro

🟡 LOW runBrainLoop accesses r.toolCalls.length without null-guard (regression from old defensive code) — src/runtime/tool-loop.ts

At line 80 (if (r.toolCalls.length === 0)) and line 86 (r.toolCalls.map(...)), runBrainLoop accesses toolCalls directly. The old inline code in coordination-driver used const calls = res.toolCalls ?? [] and routerToolLoop trusted routerChatWithTools (which always returns an array). The ToolLoopChat type contract requires toolCalls: RouterToolCall[], so well-typed implementations are safe. But a runtime type violation (e.g., a brain returning { content: 'done' } without toolCalls) would crash with TypeError instead of treating it a

🟡 LOW Dead 'seen' variable in maxTurns=0 inference-bounded test (driver-inference-metering.test.ts:294-335) — tests/loops/driver-inference-metering.test.ts

Line 294 declares const seen: Array<...> = [], line 300 pushes to it, and line 333 asserts expect(seen.length).toBe(3). But since seen captures the messages array by reference (same issue as above), seen.length is just the brain-call count — equivalent to the existing expect(n).toBe(3) on [line 335](https://github.com/tangle-network/agent-runtime/b

🟡 LOW scriptedBrain captures messages array by reference, not copy — weakens per-turn 'seen' assertions — tests/loops/scripted-brain.ts

Line 22: seen?.push(messages) pushes a REFERENCE to the same array that runBrainLoop mutates across all turns (tool-loop.ts:63 const messages: Msg[] = [...], then pushes at lines 84 and 109). The old scriptedChat used seen.push([...input.messages]) (shallow copy per turn). Result: all seen[i] entries alias the same final-state array. Tests that check seen[i] content (coordination-driver.test.ts:136-139 const turn2Convo = seen[2]!) still pass because the scripts stop at the checked turn, so the array's final state equals its

🟡 LOW scriptedBrain ignores ToolLoopChat's tools parameter — tests/loops/scripted-brain.ts

ToolLoopChat type is (messages, tools) => Promise<...> but scriptedBrain returns async (messages) => ... (line 19). This is valid TypeScript (callbacks may accept fewer params), but means the scripted brain can't verify tool specs match. For a test helper this is intentional — the scripted turns don't need tool validation. No fix required unless you want to add a tools assertion in debug builds.

🟡 LOW SANDBOX arm supervisor test has no explicit fetch cleanup — tests/loops/supervisor-agent.test.ts

Lines 91-110: the SANDBOX harness test drives real HTTP→MCP via global fetch without vi.stubGlobal setup/teardown. This is actually the correct integration pattern (exercising real plumbing, not a mock), but no afterEach/afterAll closes the MCP server explicitly — close() is called inside the agent's act. If the server port leaks across tests, a subsequent test bind could fail. Current structure looks safe since each test creates its own scope+server, but adding an explicit close in cleanup would harden it.

_{tangletools · 2026-06-20T22:52:26Z · trace}

drewstone · 2026-06-20T23:49:38Z

Addressed in b424ee2c, triaged against current HEAD:

Fixed (real):

🔴 HIGH contentAddress dropped from the runtime barrel → re-exported. Confirmed the break via npx tsc in bench/ (atom-humaneval.mts + atom-mcp-e2e.mts now compile). Good catch.
🟠 applyWinnerToProfile's raw JSON.parse could throw a SyntaxError after a ship verdict → parseWinnerJson typed ConfigError guard + a failure-path test.
🟠 finalizeBestDelivered + runBrainLoop had no direct tests → focused unit tests added (the finalizeBestDelivered one exercises the blob store's content-address invariant).
🟡 supervise() rows implied backend was required → { budget, backend? }.

Intentional (by design, not a fix):

spawn_worker→spawn_agent is the WS4 taxonomy rename (documented in the PR body). Greenfield package — no back-compat alias, per repo house rules.

The remaining LOW findings are nitpicks or already stale (the src/runtime/tool-loop.ts naming collision was renamed to runBrainLoop earlier in the branch). All gates green, +4 tests (1052 total), clean merge into main.

…face-shrink (#352) The src/platform/ clients (PlatformAuthClient cross-site SSO, PlatformHubClient /v1/hub integrations) were still present but un-exported after the subpath collapse; 5 product agents import them. Re-add the export + tsup entry. 0.70.1.

drewstone added 30 commits June 20, 2026 08:30

Revert "feat(runtime): durable run loop — wire supervisor resume + jo…

cc4e633

…urnal the kernel loop (#346)" This reverts commit edc1d54.

docs(simplification): master tracker — converged design, scratch list…

bd37331

…, full doc/module/example inventory + completion criteria

docs(simplification): red-team corrections — 4 verbs (run/improve/cer…

1be8ac8

…tify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor

docs(simplification): improve is ONE verb with a PLUGGABLE CandidateG…

a2ef652

…enerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine'

refactor(runtime): extract the canonical runToolLoop; routerToolLoop …

0f40628

…becomes a thin adapter (keystone 1/4)

docs(simplification): keystone WS1 is two phases — 1a (seam unified) …

7c6c3d4

…done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next

refactor(runtime): internalize leaked recursion/seam/journal/trace pl…

20a6cd5

…umbing from the public barrel

refactor(runtime): internalize durable spawn-journal + spawn-tree typ…

b4090a1

…es from the public barrel

refactor(api): collapse public export subpaths 13→6 (fold audit into …

e6ff2a2

…profiles; drop unused duplicates)

docs: fix 3 stale/fabricated symbol references (DriverChat, runSteeri…

4876c21

…ngExperiment, refineGepa label)

docs: consolidate 26→19 + archive (shrink canonical-api 984→76, merge…

cd9f7e6

… 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes)

chore(profiles): sort barrel exports after the audit fold (biome)

164678d

docs(simplification): mark WS1a/WS3/WS5 shipped

cf58583

docs(simplification): mark WS1b shipped (supervisorAgent — brain from…

f5eaee3

… profile.harness)

docs(simplification): table the supervisor/driver/worker multi-round …

5c2b38e

…design (round vs turn, prompt-policy retry, real-time trace self-correction)

docs(examples): canonical supervise() one-call example (the DX payoff…

25340d5

… — profile + goal, scaffolding defaulted)

refactor(mcp): rename createDriveTurnResumeDriver → createDetachedTur…

7ed8833

…nResumeDriver (WS4)

feat(improvement): improve() — the one pluggable RSI verb (facade ove…

bbe77b6

…r selfImprove; generator defaulted from surface)

test(improvement): offline improve() facade test (scripted generator,…

743bdb8

… no creds)

style(improvement,mcp): biome import ordering after WS4 rename + impr…

4ab7dc9

…ove() exports

docs(examples): point READMEs at the pruned set + add an offline supe…

2a157f1

…rvise() example test

tangletools reviewed Jun 20, 2026

View reviewed changes

drewstone added 2 commits June 20, 2026 15:17

drewstone dismissed tangletools’s stale review via bf07523 June 20, 2026 21:19

tangletools approved these changes Jun 20, 2026

View reviewed changes

tangletools reviewed Jun 20, 2026

View reviewed changes

drewstone added 4 commits June 20, 2026 15:41

feat(mcp): spawn_agent accepts an optional per-spawn budget (supervis…

8b1a805

…or can vary budget per worker)

feat(runtime): allowedModels guard on supervise()/improve() (fail-lou…

0fc83cf

…d model-subset restriction)

docs(examples): strategy-evolution — the policy-search research journ…

15977c9

…ey (runStrategyEvolution + promotionGate)

tangletools requested changes Jun 20, 2026

View reviewed changes

drewstone added 5 commits June 20, 2026 15:57

feat(conversation): evalPersona() facade + examples/product-eval (use…

5fe2952

…r-sim product evals, one-call)

docs(examples): improve() — the RSI verb, offline scripted example

a4c28bd

docs(examples): intelligence-recommend — connect traces→findings→impr…

39777ca

…ove() (the intelligence loop)

docs(examples): list the 4 new examples in the index README

545016b

docs(api): regenerate API reference for allowedModels guard + evalPer…

504f37e

…sona facade

tangletools previously approved these changes Jun 20, 2026

View reviewed changes

tangletools reviewed Jun 20, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via b424ee2 June 20, 2026 22:20

tangletools approved these changes Jun 20, 2026

View reviewed changes

tangletools reviewed Jun 20, 2026

View reviewed changes

drewstone merged commit 301f632 into main Jun 20, 2026
1 check passed

This was referenced Jun 21, 2026

chore(release): 0.70.0 #349

Merged

feat(platform): restore ./platform export (0.70.1) #352

Merged

drewstone mentioned this pull request Jun 23, 2026

chore(cleanup): delete dead eval-persona facade + orphaned topology module #367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: usability overhaul — brain-from-profile, surface shrink, docs that can't lie, supervise() one-call#347

refactor: usability overhaul — brain-from-profile, surface shrink, docs that can't lie, supervise() one-call#347
drewstone merged 44 commits into
mainfrom
feat/usability-overhaul

drewstone commented Jun 20, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 20, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 20, 2026

Uh oh!

drewstone commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 20, 2026

What & why

What changed (7 workstreams)

Proof

Not in this PR (tracked, not forgotten)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — bf075234

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound-with-nits

🔎 Heuristic Signals

💰 Value Audit

🎯 Usefulness Audit

Uh oh!

tangletools commented Jun 20, 2026

❌ Needs Work — bf075234

Blocking

Other

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

❌ 1 Blocking Finding — bf075234

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 504f37e7

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound

🎯 Usefulness — sound-with-nits

🔎 Heuristic Signals

🎯 Usefulness Audit

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — b424ee2c

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound

🎯 Usefulness — sound-with-nits

🔎 Heuristic Signals

🎯 Usefulness Audit

Uh oh!

tangletools commented Jun 20, 2026

✅ No Blockers — b424ee2c

Uh oh!

drewstone commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `bf075234`

❌ Needs Work — `bf075234`

❌ 1 Blocking Finding — `bf075234`

✅ Auto-approved PR — `504f37e7`

✅ Auto-approved PR — `b424ee2c`

✅ No Blockers — `b424ee2c`