feat(loops): add observe and substrate loop proofs by drewstone · Pull Request #194 · tangle-network/agent-runtime

drewstone · 2026-06-08T09:39:08Z

Summary

add observe() as the trace-derived third-person watcher that turns worker behavior into findings, reports, and optional corpus facts
add the narrow gitWorkspace/Shell port for clone/commit/push durable workspace loops
add bench/src/observe-steer-workspace-loop.mts, the local substrate proof for: Supervisor/Scope -> coordination MCP tools -> git workspace -> observe() finding -> steer_worker/Scope.send -> corrective worker -> fresh-clone integration pass
keep loop authoring substrate-first: Scope, Supervisor, runLoop, validators, journals, MCP coordination, git workspace, and observe
remove the experimental defineLoop facade/protocol, its exports, docs, example, and tests
tighten MCP coordination back to substrate names: Question, QuestionDecision, QuestionPolicy, CoordinationEvent; no facade-era LoopQuestion types
split process rules out of agent bootloaders: AGENTS.md/CLAUDE.md now point to canonical docs/BUILDING.md and docs/ANTI_PATTERNS.md
update loop-writer, the facade postmortem, and the docs index with the proof command and the "do not relocate protocol and call it simplification" guardrail

Design Notes

The post-audit API stance is deliberate: loops are not a new runtime grammar. They are ordinary agent code over the existing substrate. observe() is the new load-bearing primitive; gitWorkspace is the durable workspace seam; MCP coordination is the sandbox binding for the same Scope verbs.

The decisive local join is now proven by:

pnpm exec tsx bench/src/observe-steer-workspace-loop.mts

That proof uses a mock ChatClient transport for the observer model call and local BYO worker executors so it is reproducible without cloud credentials. Honest remaining proof: run the same shape with openSandboxRun workers and a remote branch that a sandbox can clone and push.

Docs placement is now explicit: AGENTS.md and CLAUDE.md are bootloaders; durable build rules live in docs/BUILDING.md; named failure modes live in docs/ANTI_PATTERNS.md; evidence and postmortems stay under docs/research/* / memory.

Validation

pnpm lint
pnpm typecheck
pnpm test -- --runInBand (67 files / 678 tests)
pnpm exec vitest run tests/loops/workspace.test.ts tests/loops/coordination.test.ts --reporter=dot
pnpm exec vitest run tests/loops/workspace.test.ts --reporter=dot
pnpm exec tsx bench/src/observe-steer-workspace-loop.mts (INTEGRATION OK)
pnpm build
pnpm verify:package
python3 /home/drew/.codex/skills/.system/skill-creator/scripts/quick_validate.py skills/loop-writer
git diff --check
git fetch origin main && git merge-tree --write-tree origin/main HEAD

Notes

This PR remains draft. The next proof should be the cloud variant: openSandboxRun worker plus remote git branch, without adding a new loop facade first.

…loop The connective tissue that turns a one-way driver→worker pipe into a feedback loop. A worker can't see itself; observe() reads its TRACE and produces: - findings + an operator report (what to fix — split agent vs operator), fed back DOWN as a steer and OUT to the operator - durable corpus facts the NEXT run reads back (continuous self-improvement) Findings are trace-derived, never judge-derived (derived_from_judge:false) — the selector≠judge firewall. Harness-agnostic: reads a trace + output, so it watches opencode/codex/hermes/BYO identically. Built on agent-eval's ChatClient + AnalystFinding; persists to the existing Corpus. bench/src/fleet.mts: the whole vision end to end, runnable from a laptop — a thin local driver fans out N workers to CLOUD sandboxes, observes each trace, reports what to fix, banks learnings; run twice and the second run injects the first's learnings into the workers. Proven live (opencode × 2 cloud workers): the observer caught a real inefficiency (unbatched bash calls) and banked it.

The red-team flagged two FATAL design flaws: (1) parallel cloud workers share no filesystem, so accumulating loops (migration) integrate to nothing; (2) resume restores decisions, not the mutated workspace. This proves the fix — a git-backed durable workspace — on 3 dependency-ordered modules: - durable workspace = a bare git repo (models a remote branch); each worker is a FRESH clone (a fresh box's empty FS), torn down after commit+push. - PROVEN (a): worker b's fresh clone finds a.py ON DISK (git carried it, not a string of names); c finds b.py; the integration test imports a<-b<-c, links. - PROVEN (b): KILL after b, RESUME → a+b skipped (durable git has them), only c re-runs against a clone that already contains a+b's committed code, links. Verdict shift: migration moves from 'cut from the pitch' to 'buildable on a ctx.workspace handle'. The seam is git; the durable layer survives box teardown by construction. Next: a cloud variant uses a GitHub branch as the workspace.

tangletools

Approved after local release-gate verification: typecheck, tests, build, lint, package export verification, and merge-tree against origin/main. Runtime hook surface is additive; delegate harness/model support is covered by tests.

…naming + onboarding fixes The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged front door: runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget }) → runs each strategy, scores by the environment's own deployable check, returns the per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport gives the verdict. Resilient to transient per-task infra (skip, don't crash). Naming, made legible (public API; maps to internal depth/breadth — zero churn to the running internals): a task domain is an `Environment` (the AgenticSurface seam under the RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine` (attempt → critic reads trace → steer → repeat), named by what they DO, not the search tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction = the critic, Environment.score = the check) or drop to runAgentic for new strategies. Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194); fixed the dead examples/model-resolution link in docs/concepts.md.

* feat(bench): GEPA over the analyst/steerer prompt on the canonical stack The analyst IS the steerer (observe()'s findings → recommended_action → the depth steer), so optimizing the analyst prompt optimizes the loop. This evolves it with agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse + paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution in agent-eval 0.83, only the primitives, so the population loop is thin orchestration over them. - observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob); defaultAnalystInstruction exported. Firewall stays structural (input has no score). - agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer. - eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe gate; breadth computed ONCE per task (shared baseline, correct + halves cost); failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default (the +16.4pp instruction) FIRST, then the designer-panel population. Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect → mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight. * fix(bench): GEPA harness survives gym/router infra blips (skip failed tasks) The first real run died when the (long-lived) gym container wedged: breadth baselines returned 0% then runAgentic threw 'every rollout went down', killing the whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a fresh gym container + WIDTH<=2. * refactor(bench): delete eops-gate.mts — the throwaway flat-loop prototype (−433 LOC) It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the canonical Supervisor + a second copy of the gym client (6 functions duplicating gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam (agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the GEPA harness run on the canonical path; this prototype only de-risked the plumbing (gym standup, router-tools worker, depth-best scoring) and is now dead weight. * feat(bench): package the optimization suite (runBenchmark) + clarify naming + onboarding fixes The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged front door: runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget }) → runs each strategy, scores by the environment's own deployable check, returns the per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport gives the verdict. Resilient to transient per-task infra (skip, don't crash). Naming, made legible (public API; maps to internal depth/breadth — zero churn to the running internals): a task domain is an `Environment` (the AgenticSurface seam under the RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine` (attempt → critic reads trace → steer → repeat), named by what they DO, not the search tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction = the critic, Environment.score = the check) or drop to runAgentic for new strategies. Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194); fixed the dead examples/model-resolution link in docs/concepts.md. * feat(bench): make Strategy a first-class, OPEN abstraction (author your own) The question: when we collapse to "refine", can a dev create their OWN strategy? Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability existed (a strategy is an Agent) but the door wasn't cut. Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget) => Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine` and `sample` ship as instances AND the reference driver implementations (depthDriver/ breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own. What's under the words: sample = K independent attempts, keep the best-verifying (best-of-N / resample) refine = attempt → observe() reads the trace → steer the next → repeat (iterate) A multi-agent "team" is just a Strategy whose driver spawns several different agents — same recursive Agent atom, coordinated over the Scope. * feat(bench): defineStrategy + composable steps — author a loop in ~15 lines (skillifiable) The original goal: loops compact enough to skillify, so agents author them. A 70-line Supervisor driver isn't that. This adds the composable LEGO: defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... }) A strategy body gets two steps — shot() (one worker attempt over an artifact) and critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/ Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is the unit an agent or a skill can emit. Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure strategy logic, no plumbing. Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified; adaptiveRefine live-smoke pending the gym (GEPA has it). * docs(bench): strategy-demo example — the optimization suite in 3 layers (gym-free, runnable) The missing onboarding piece: a runnable demo of the whole suite on a toy "counter" Environment (needs only a router key — no dataset, no sandbox). Shows all three layers: 1. runBenchmark(env, …) — default strategies compared, free. 2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior. 3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(), zero Supervisor ceremony. The skillifiable unit. Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and score via the Environment's own check. README documents the model + the customization hooks. * chore(examples): clearer names — drop the confusing `with-` prefix; clarify intent Disciplined subset of the examples-naming audit (NOT the proposed 01-08 numbering / .deprecated quarantine — that's churn for throwaway examples and the README already orders them): with-knowledge-readiness → knowledge-gating (`with-` read as an optional toggle) with-intelligence-export → intelligence-export (same) agent-into-reviewer → pipe-into-reviewer (signals the 2-runtime piping) KEPT runtime-run (it teaches startRuntimeRun — the name matches the product API) and agents-of-all-shapes (memorable + has a test). git mv preserves history; README + docs/concepts + all internal self-references updated; zero stragglers.

…eal agent-eval primitives) (#205) * feat(bench): GEPA over the analyst/steerer prompt on the canonical stack The analyst IS the steerer (observe()'s findings → recommended_action → the depth steer), so optimizing the analyst prompt optimizes the loop. This evolves it with agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse + paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution in agent-eval 0.83, only the primitives, so the population loop is thin orchestration over them. - observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob); defaultAnalystInstruction exported. Firewall stays structural (input has no score). - agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer. - eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe gate; breadth computed ONCE per task (shared baseline, correct + halves cost); failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default (the +16.4pp instruction) FIRST, then the designer-panel population. Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect → mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight. * fix(bench): GEPA harness survives gym/router infra blips (skip failed tasks) The first real run died when the (long-lived) gym container wedged: breadth baselines returned 0% then runAgentic threw 'every rollout went down', killing the whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a fresh gym container + WIDTH<=2. * refactor(bench): delete eops-gate.mts — the throwaway flat-loop prototype (−433 LOC) It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the canonical Supervisor + a second copy of the gym client (6 functions duplicating gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam (agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the GEPA harness run on the canonical path; this prototype only de-risked the plumbing (gym standup, router-tools worker, depth-best scoring) and is now dead weight. * feat(bench): package the optimization suite (runBenchmark) + clarify naming + onboarding fixes The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged front door: runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget }) → runs each strategy, scores by the environment's own deployable check, returns the per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport gives the verdict. Resilient to transient per-task infra (skip, don't crash). Naming, made legible (public API; maps to internal depth/breadth — zero churn to the running internals): a task domain is an `Environment` (the AgenticSurface seam under the RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine` (attempt → critic reads trace → steer → repeat), named by what they DO, not the search tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction = the critic, Environment.score = the check) or drop to runAgentic for new strategies. Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194); fixed the dead examples/model-resolution link in docs/concepts.md. * feat(bench): make Strategy a first-class, OPEN abstraction (author your own) The question: when we collapse to "refine", can a dev create their OWN strategy? Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability existed (a strategy is an Agent) but the door wasn't cut. Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget) => Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine` and `sample` ship as instances AND the reference driver implementations (depthDriver/ breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own. What's under the words: sample = K independent attempts, keep the best-verifying (best-of-N / resample) refine = attempt → observe() reads the trace → steer the next → repeat (iterate) A multi-agent "team" is just a Strategy whose driver spawns several different agents — same recursive Agent atom, coordinated over the Scope. * feat(bench): defineStrategy + composable steps — author a loop in ~15 lines (skillifiable) The original goal: loops compact enough to skillify, so agents author them. A 70-line Supervisor driver isn't that. This adds the composable LEGO: defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... }) A strategy body gets two steps — shot() (one worker attempt over an artifact) and critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/ Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is the unit an agent or a skill can emit. Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure strategy logic, no plumbing. Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified; adaptiveRefine live-smoke pending the gym (GEPA has it). * docs(bench): strategy-demo example — the optimization suite in 3 layers (gym-free, runnable) The missing onboarding piece: a runnable demo of the whole suite on a toy "counter" Environment (needs only a router key — no dataset, no sandbox). Shows all three layers: 1. runBenchmark(env, …) — default strategies compared, free. 2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior. 3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(), zero Supervisor ceremony. The skillifiable unit. Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and score via the Environment's own check. README documents the model + the customization hooks. * feat(bench): GEPA frozen holdout — confirm the winner generalizes vs the baseline Adds a HOLDOUT=N option: after optimizing on the search tasks, score the winning analyst instruction AND the seeded baseline (observe default) on a DISJOINT slice (offset = search-set size). Holdout breadth computed once; winner+baseline depth scored against it. Reports whether GEPA GENERALIZED (winner > baseline on held-out tasks) — the frozen confirmation the discipline requires (guards against overfitting the search set). loadItsmTasks gains an offset param.

…dation (#304) * feat(intelligence): capability-delivery manifest — composeCertifiedProfile + resolver + ladder Add the unified, future-proof delivery structure: one certified unit of agent power = { interface, binding }. Interfaces are CLOSED (tool / mcp / context / retrieval / hook / subagent); bindings are OPEN (inline / file / http / sandbox-code / mcp-stdio / mcp-remote / process-on-infra / rag-index / memory-store / wasm / a2a). A single resolver lowers any binding into one uniform ResolvedSurface consumed identically by the host seam (RouterToolsSeam tools + executeToolCall) and the sandbox seam (AgentProfile). - src/intelligence/capability.ts: the manifest types + CapabilityNotAdmittedError + manifestFromProfile (lowers today's CertifiedProfile wire into capabilities[] with best-effort binding inference, so the spine delivers value before the plane changes). - src/intelligence/resolver.ts: composeCertifiedProfile — the spine resolves inline/file (byte-identical to composeCertifiedPrompt, the regression lock), mcp-stdio/mcp-remote (strict union widens to the SDK's flat AgentProfileMcpServer — an always-valid lowering), and http tools (the host seam). Ladder rungs that need infra (sandbox-code, process-on-infra) are injected ResolveCtx providers; rag-index/memory-store/wasm/a2a throw CapabilityNotAdmittedError (memory gated on the E3 admission bar). Fail-closed: null manifest -> base surface, per-capability failure -> drop (diagnostic via onDrop), post-resolve drift drops any tool/mcp whose live names diverge. - src/mcp/delegation-profile.ts: composeProductionAgentProfile now also merges tools box-flags, hooks, subagents, and injects ResolvedSurface.mcpConnections into AgentProfile.mcp (the sandbox-seam mapping). - exports + export gate + the two spec corrections (mcp lowers via always-valid widening; tools lower two ways since AgentProfile.tools is box flags). * docs(rsi): correct depth>breadth to the POWER-16 tie at n=48 (not the n=16 +16.4pp) The +16.4pp CI[+5.3,+29.8] n=16 depth-steered-continuation result did not replicate when powered: depth-breadth = +4.7pp CI[-1.9,+11.4] at n=48 (a tie; +4.1pp at n=72). architecture.md and roadmap-rsi.md advertised it as a cleared keystone; they now carry the retraction and point at .evolve/current.json. * chore(clean): remove dead mock loop + orphan re-exports/interface (432 LOC) - delete bench/src/observe-steer-workspace-loop.mts (the #194 mock anti-pattern; 0 inbound refs) - drop orphan pass-through re-exports CaptureIntegrityError/ReplayError/VerificationError (src/errors.ts) - drop orphan interface AgentTaskRunSummary (src/types.ts) - fix doc-rot in loop-facade-postmortem.md; gitignore stray test_repo/ - deletion-ledger.md tracks deletions + the deferred migrations (driver.ts 12 callers, AgentProfile superset) Gates verified by hand: typecheck 0, lint 0, 924 tests pass / 0 fail. Load-bearing fail-loud fences left intact (NOT dead code). * docs(research): atom-compression plan, harness-compat matrix, long-horizon map * chore(deps): bump @types/node 25.9.3 + playwright 1.61.0 (dev) Safe minor/patch dev bumps, gates verified (typecheck 0, lint 0, 924 tests pass). Deferred (need own careful pass): biome 2.5 (13 new lint warnings), typescript 6 + vitest 4 (majors), agent-eval 0.92 (substrate — sync with the AgentProfile superset work). * docs(research): RSI atom masterplan + build tracker (single source of truth) * docs(research): collapse N driver prompts → one cached generator (software 3.0) Replace the per-role hand-coded prompt builders with generateDriverSystemPrompt(spec): a (fused) router call generates the driver prompt from {role,goal,target,harness,stance}, cached for semantic reuse via PromptRegistry + hashContent key (file/JSON or DB). The hand-authored worker-driver prompt becomes the generator's seed + its tests the invariants. Single optimizable surface; depends on a tangle-router fusion primitive (separate issue). * docs(research): active push — RUN/DELETE/IMPROVE worklist (delete createDriver, run commit0, deep-clean, dedup) * docs(research): createDriver delete BLOCKED (paradigm diff, evidenced); commit0 RAN; the delete fork * refactor(runtime): full nuke of the createDriver/string-prompt measurement+eval paradigm DELETE the wrong abstraction (createDriver = a code TopologyPlanner driving runLoop over string-prompt→string-answer calls, judged by adapter.judge) and the entire old bench experiment + eval-gen apparatus built on it. The agent-driver (AgentProfile driving AgentProfile via coordination tools) replaces it; the runLoop KERNEL and the Scope/ Supervisor are untouched. Deleted (15): src/runtime/driver.ts; bench experiment.ts(+test)/steering-experiment(+test)/ improve-prompt/research-loop/finsearch-loop/rsi/generate-eval/run-benchmarks/run.ts/ skills-sandbox/profile-coord-sandbox; tests/loops/dynamic.test.ts. Survivors (search-bench/cloud-loop/fleet/commit0-gate) re-homed onto a new pure helper bench/src/sandbox-run.ts (answerOutput/sandboxAgentRun/WorkerBackendType/AnalystFn/llmAnalyst — no experiment shell). runLoop kernel tests kept via a scriptedDriver stub in refine-driver.ts. Gates (hand-verified): build 0, typecheck 0 (root+bench; also fixed a pre-existing bench BackendType red), lint 0, 905 tests pass. Zero dangling code refs. ACCEPTED casualties of the full nuke (rebuild on the agent-driver/Supervisor path when wanted): the generate-eval data engine, the AgentProfile-coordinate optimizer (profile-coord), and run.ts's non-experiment subcommands (preflight/verify-judge/solve-one/ui-review). * docs(research): full nuke DONE (-3492 LOC); doc/skill-rot follow-up tracked * docs(cleanup): retarget all docs+skills off the nuked createDriver/runExperiment to the agent-driver/Supervisor reality * feat(supervise): recursive driver-executor — agents driving agents driving agents A spawned child can now BE a driver. driverExecutorFactory mounts a NESTED Scope over the SAME conserved budget pool + shared journal (scope.ts's new NestedScopeSeam), one depth deeper, and runs the wrapped driver's act there. A child resolves to a LEAF (worker) OR — for a role:'driver' spec, via withDriverExecutor — this executor, recursively. So a driver spawns a driver spawns a worker on one budget-conserving tree. The persona/strategy spawn fences now route a driver child to the recursive executor (compose) instead of throwing; act still fails loud only if a child is run directly. Reuses the atom — builds NO new budget/journal/selection logic. Budget conserved across depth (reserve-on-spawn fails closed at any depth), spend bubbles to root, journal records each nested tree, maxDepth enforced across recursion. Proven OFFLINE (no creds; scripted drivers+workers) in tests/loops/driver-recursion.test.ts: depth-2 chain root->mid->inner->worker (node id rec:s0:s0:s0 — a non-recursive build cannot produce it), fail-closed budget conservation across depth, spend roll-up (spentTotal = the worker's exact spend), nested-journal sub-trees, depth-ceiling across recursion. Gates hand-verified: build 0, typecheck 0, lint 0, 911 tests pass. * refactor(bench)+docs: reclaim runKeystoneGate -> runGate; strip 4 docs to latest-only Rename the opaque 'keystone' jargon: runKeystoneGate->runGate (+ RunGateOptions/GateArmResult/ GateReport), bench/src/keystone-gate.ts->gate.ts (+ -cli, +test), all import paths and CLI banners. Strip the last historical createDriver/runExperiment 'was removed/nuked' breadcrumbs from architecture.md, architecture-interpretations.md, learning-flywheel.md, roadmap-rsi.md — upgrading agents now see only the current agent-driver/Supervisor reality (history lives in git). Gates green; 905->911 with the keystone test. * docs(research): keystone recursion ✅ (9d188e1); createDriver retire ✅ via nuke; #2b brain next * feat(supervise): coordinationDriverAgent — the cheap/offline driver (LLM tool-loop over the coordination verbs) The CHEAP, in-process, no-creds variant of the recursive driver: act() mounts createCoordinationTools over its scope and runs an LLM tool-loop (injected chat seam) so the driver REASONS spawn/steer/await/stop; composes with 2a recursion (a driver agent spawns a driver agent, via makeWorkerAgent -> driverChild). NOT the primary driver — the CAPABLE driver is a sandbox agent with the coordination verbs as an MCP. This one is the offline-testable + cheap-orchestration path. Prompt is INJECTED (decoupled from agent-eval). Proven OFFLINE (no creds, scripted mock chat) tests/loops/coordination-driver.test.ts: the tool-loop drives real Scope.spawn via the coordination verbs + folds results back; a driver AGENT spawns a driver AGENT (separate nested journal tree). typecheck 0, lint 0. * docs(research): dual-purpose resolution (one substrate serves product + proof); #2b cheap driver done, #2c capable sandbox driver + #3 completion-oracle next * feat(supervise): completion-oracle — settled ⟺ delivered (Foreman 0/18) The honest settle: a node counts as delivered only when a deployable check passes, never on self-report. - completion-gate.ts: gateOnDeliverable wraps any Executor so its settlement valid reflects a DeliverableSpec check (both execute shapes; fail-closed). - coordination-driver finalize: returns the best DELIVERED child; undefined when none delivered — a driver cannot self-declare done via prose. - driver-executor: derive the driver child's verdict from its direct settled events, so delivery composes UP the recursion (a sub-driver is valid only when it itself selected a delivered child). - supervisor: a winner MUST carry a real Out; a successful act that produced nothing is a no-winner, never a winner wrapping undefined. 8 offline tests: leaf gate (both execute shapes, fail-closed), ran-but-didn't- deliver yields no winner, the gate dominates score, delivery propagates up the recursion. * docs(research): completion-oracle #3 ✅ (bd58761) — settled ⟺ delivered, composes up the recursion * feat(bench): atom-humaneval — agents-driving-agents on a live deployable-checked domain A coordinationDriverAgent (real router brain) drives gated workers on HumanEval: each worker is settled valid ONLY when the local Docker test suite passes (completion-oracle, not self-report), against a blind best-of-K baseline. Proven live: the driver spawns, the worker solves, the checker gates, the supervisor returns a winner only on real delivery. Also exports gateOnDeliverable/DeliverableSpec from the runtime barrel (the #3 primitive was added to supervise/ but not surfaced on the package). * feat(topology): animated visual replay of a recursive agent run Fold the one runtime-hooks stream into a timestamped ReplayEvent[] (createReplayRecorder) and render a self-contained, scrubbable HTML player (renderReplayHtml) — the recursive agent tree animated over wall-clock, each node colored by the completion-oracle: delivered (valid) green, ran-but-not-delivered amber, failed red, with live token/cost counters. Synthesizes the unspawned root driver so the whole recursion renders. No server/build/deps. Wired into atom-humaneval (every driver run emits a replay.html). 4 offline recorder tests; proven on a live HumanEval run (driver -> worker -> delivered). * fix(bench): atom-humaneval blind arm survives transient router errors (a 502 is a failed attempt, not a crash — matches the driver arm's down-typing) * feat(supervise): the supervisor AUTHORS worker profiles from a skill (the intelligence, not the plumbing) The supervisor's job is to DESIGN the agents it spawns — read the task, decompose it, and author a tailored profile (instructions + model) per worker. supervisorSkill is the how-to it reads (its own system prompt) — THE optimizable self-improvement surface; authoredWorker builds a worker from an authored profile; asAuthoredProfile catches empty/placeholder profiles (a skill violation). Proven offline (no creds, no plumbing): a skill-guided supervisor authors DISTINCT, tailored worker recipes per sub-task and they flow to the workers. 3 tests. * feat(supervise): coordination MCP over a live Scope — the real keystone for in-box driving serveCoordinationMcp fronts a live Scope with an HTTP JSON-RPC MCP server: an in-box coding harness (opencode via cli-bridge) mounts mcp.mcpServers.coordination and calls spawn_worker as a native tool, landing on Scope.spawn — a real box driving real boxes, not emulated function-tools. Real test: HTTP tools/call spawn_worker -> Scope.spawn -> worker settles -> winner (no mock of the MCP path). Plus the standard supervise SKILL.md. * feat(bench): prove a coding harness drives the Scope via the coordination MCP (live) opencode (glm-5-turbo via cli-bridge) mounts mcp.mcpServers.coordination (type:http → opencode remote) and calls spawn_worker itself → real Scope.spawn → worker settles, and reads back the await_next result. The in-box driving path is REAL — a coding agent drives recursion as a native tool, not emulated. (Bridge wants mcp type:'http', not 'remote'.) * feat(bench): WHOLE real e2e — opencode supervisor drives opencode workers via the coordination MCP, real test gates delivery Live, no mock: the opencode supervisor (glm-5-turbo via cli-bridge) mounts the coordination MCP, authors worker profiles, calls spawn_worker -> real Scope.spawn -> real opencode workers code in a cwd -> python3 test gates valid -> supervisor settles on the delivered worker -> winner. The completion-oracle (deployable check, not LLM judge) decided delivery over the supervisor's confusion that it couldn't see the workers' isolated cwds (→ shared Workspace next). Proof artifact for the in-box-driving path; the law-compliant productionization is a substrate backend (tmux/bridge/sandbox) that runs authored profiles — not this harness-specific script. * docs(canonical-api): the AgentProfile law — author the profile, the substrate materializes it §1.5 + decision-table rows + CLAUDE.md §0 pointer. The thing we keep forgetting: an agent IS its full AgentProfile (prompt+skills+tools/mcp+subagents+hooks+permissions+model), not a prompt; change behavior by AUTHORING the profile and letting the sandbox substrate materialize it into harness shapes — never write a verify-loop or harness-specific config (self-verification is a hook/process; opencode is only the cli-bridge test target; a missing lever is a substrate gap). * docs(research): consolidate docs/research 28→14 — retire shipped/subsumed design docs Retired 14 design-research docs whose content is now shipped code, in .evolve/current.json, or self-declared subsumed/retracted (the recursion atom shipped; the optimization-space layer evidence landed; verdicts reached). Refreshed the research index, recorded the retirement + rationale in deletion-ledger.md (Pass 2), and fixed every inbound link (top index, the harvest-corpus.ts comment → current.json, optimization-space's suite links). Kept the SSOT masterplan, the canonical-referenced maps (optimization-space/leapfrog), the two gated belief specs, the postmortem guardrail, the build-lists, and the agent-lab tombstones. No broken links into the 14 remain from any canonical doc or src/. * docs(canonical-api): substrate pin 0.89 → 0.92 (matches the merged package.json)

drewstone added 5 commits June 7, 2026 16:55

feat(loops): add defineLoop authoring surface

41baa71

refactor(loops): collapse defineLoop to thin facade

aef9ec7

refactor(loops): remove defineLoop facade

ab2823d

drewstone changed the title ~~feat(loops): add defineLoop authoring surface~~ feat(loops): add observe and substrate loop proofs Jun 8, 2026

drewstone added 6 commits June 8, 2026 04:13

refactor(loops): tighten substrate coordination surface

7a2d64c

feat(loops): prove observe steer workspace join

7ef57a5

docs(process): split building rules from agent bootloaders

4560384

feat(runtime): expose sandbox box hooks for delegates

def4073

chore(release): 0.47.0

e5463bd

merge origin/main into feat/observe-closed-loop

37e373f

drewstone marked this pull request as ready for review June 8, 2026 12:42

tangletools approved these changes Jun 8, 2026

View reviewed changes

drewstone merged commit 9c371b8 into main Jun 8, 2026
1 check passed

drewstone mentioned this pull request Jun 8, 2026

feat(bench): live observe→steer join (real worker + real observer) #195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(loops): add observe and substrate loop proofs#194

feat(loops): add observe and substrate loop proofs#194
drewstone merged 11 commits into
mainfrom
feat/observe-closed-loop

drewstone commented Jun 8, 2026 •

edited

Loading

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Notes

Validation

Notes

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drewstone commented Jun 8, 2026 •

edited

Loading