feat(loops): add observe and substrate loop proofs#194
Merged
Conversation
…loop The connective tissue that turns a one-way driver→worker pipe into a feedback loop. A worker can't see itself; observe() reads its TRACE and produces: - findings + an operator report (what to fix — split agent vs operator), fed back DOWN as a steer and OUT to the operator - durable corpus facts the NEXT run reads back (continuous self-improvement) Findings are trace-derived, never judge-derived (derived_from_judge:false) — the selector≠judge firewall. Harness-agnostic: reads a trace + output, so it watches opencode/codex/hermes/BYO identically. Built on agent-eval's ChatClient + AnalystFinding; persists to the existing Corpus. bench/src/fleet.mts: the whole vision end to end, runnable from a laptop — a thin local driver fans out N workers to CLOUD sandboxes, observes each trace, reports what to fix, banks learnings; run twice and the second run injects the first's learnings into the workers. Proven live (opencode × 2 cloud workers): the observer caught a real inefficiency (unbatched bash calls) and banked it.
The red-team flagged two FATAL design flaws: (1) parallel cloud workers share no filesystem, so accumulating loops (migration) integrate to nothing; (2) resume restores decisions, not the mutated workspace. This proves the fix — a git-backed durable workspace — on 3 dependency-ordered modules: - durable workspace = a bare git repo (models a remote branch); each worker is a FRESH clone (a fresh box's empty FS), torn down after commit+push. - PROVEN (a): worker b's fresh clone finds a.py ON DISK (git carried it, not a string of names); c finds b.py; the integration test imports a<-b<-c, links. - PROVEN (b): KILL after b, RESUME → a+b skipped (durable git has them), only c re-runs against a clone that already contains a+b's committed code, links. Verdict shift: migration moves from 'cut from the pitch' to 'buildable on a ctx.workspace handle'. The seam is git; the durable layer survives box teardown by construction. Next: a cloud variant uses a GitHub branch as the workspace.
tangletools
approved these changes
Jun 8, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
Approved after local release-gate verification: typecheck, tests, build, lint, package export verification, and merge-tree against origin/main. Runtime hook surface is additive; delegate harness/model support is covered by tests.
drewstone
added a commit
that referenced
this pull request
Jun 9, 2026
…naming + onboarding fixes
The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:
runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
→ runs each strategy, scores by the environment's own deployable check, returns the
per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
gives the verdict. Resilient to transient per-task infra (skip, don't crash).
Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.
Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.
drewstone
added a commit
that referenced
this pull request
Jun 9, 2026
* feat(bench): GEPA over the analyst/steerer prompt on the canonical stack
The analyst IS the steerer (observe()'s findings → recommended_action → the depth
steer), so optimizing the analyst prompt optimizes the loop. This evolves it with
agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse
+ paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution
in agent-eval 0.83, only the primitives, so the population loop is thin orchestration
over them.
- observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob);
defaultAnalystInstruction exported. Firewall stays structural (input has no score).
- agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer.
- eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe
gate; breadth computed ONCE per task (shared baseline, correct + halves cost);
failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default
(the +16.4pp instruction) FIRST, then the designer-panel population.
Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect
→ mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.
* fix(bench): GEPA harness survives gym/router infra blips (skip failed tasks)
The first real run died when the (long-lived) gym container wedged: breadth
baselines returned 0% then runAgentic threw 'every rollout went down', killing the
whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task
whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the
depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a
fresh gym container + WIDTH<=2.
* refactor(bench): delete eops-gate.mts — the throwaway flat-loop prototype (−433 LOC)
It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the
canonical Supervisor + a second copy of the gym client (6 functions duplicating
gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind
depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam
(agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the
GEPA harness run on the canonical path; this prototype only de-risked the plumbing
(gym standup, router-tools worker, depth-best scoring) and is now dead weight.
* feat(bench): package the optimization suite (runBenchmark) + clarify naming + onboarding fixes
The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:
runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
→ runs each strategy, scores by the environment's own deployable check, returns the
per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
gives the verdict. Resilient to transient per-task infra (skip, don't crash).
Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.
Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.
* feat(bench): make Strategy a first-class, OPEN abstraction (author your own)
The question: when we collapse to "refine", can a dev create their OWN strategy?
Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability
existed (a strategy is an Agent) but the door wasn't cut.
Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget)
=> Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by
returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine`
and `sample` ship as instances AND the reference driver implementations (depthDriver/
breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for
back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own.
What's under the words:
sample = K independent attempts, keep the best-verifying (best-of-N / resample)
refine = attempt → observe() reads the trace → steer the next → repeat (iterate)
A multi-agent "team" is just a Strategy whose driver spawns several different agents —
same recursive Agent atom, coordinated over the Scope.
* feat(bench): defineStrategy + composable steps — author a loop in ~15 lines (skillifiable)
The original goal: loops compact enough to skillify, so agents author them. A 70-line
Supervisor driver isn't that. This adds the composable LEGO:
defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... })
A strategy body gets two steps — shot() (one worker attempt over an artifact) and
critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/
Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is
the unit an agent or a skill can emit.
Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered
shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure
motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure
strategy logic, no plumbing.
Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are
UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact
spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified;
adaptiveRefine live-smoke pending the gym (GEPA has it).
* docs(bench): strategy-demo example — the optimization suite in 3 layers (gym-free, runnable)
The missing onboarding piece: a runnable demo of the whole suite on a toy "counter"
Environment (needs only a router key — no dataset, no sandbox). Shows all three layers:
1. runBenchmark(env, …) — default strategies compared, free.
2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior.
3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(),
zero Supervisor ceremony. The skillifiable unit.
Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and
score via the Environment's own check. README documents the model + the customization hooks.
* chore(examples): clearer names — drop the confusing `with-` prefix; clarify intent
Disciplined subset of the examples-naming audit (NOT the proposed 01-08 numbering /
.deprecated quarantine — that's churn for throwaway examples and the README already
orders them):
with-knowledge-readiness → knowledge-gating (`with-` read as an optional toggle)
with-intelligence-export → intelligence-export (same)
agent-into-reviewer → pipe-into-reviewer (signals the 2-runtime piping)
KEPT runtime-run (it teaches startRuntimeRun — the name matches the product API) and
agents-of-all-shapes (memorable + has a test). git mv preserves history; README +
docs/concepts + all internal self-references updated; zero stragglers.
drewstone
added a commit
that referenced
this pull request
Jun 9, 2026
…eal agent-eval primitives) (#205) * feat(bench): GEPA over the analyst/steerer prompt on the canonical stack The analyst IS the steerer (observe()'s findings → recommended_action → the depth steer), so optimizing the analyst prompt optimizes the loop. This evolves it with agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse + paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution in agent-eval 0.83, only the primitives, so the population loop is thin orchestration over them. - observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob); defaultAnalystInstruction exported. Firewall stays structural (input has no score). - agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer. - eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe gate; breadth computed ONCE per task (shared baseline, correct + halves cost); failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default (the +16.4pp instruction) FIRST, then the designer-panel population. Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect → mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight. * fix(bench): GEPA harness survives gym/router infra blips (skip failed tasks) The first real run died when the (long-lived) gym container wedged: breadth baselines returned 0% then runAgentic threw 'every rollout went down', killing the whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a fresh gym container + WIDTH<=2. * refactor(bench): delete eops-gate.mts — the throwaway flat-loop prototype (−433 LOC) It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the canonical Supervisor + a second copy of the gym client (6 functions duplicating gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam (agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the GEPA harness run on the canonical path; this prototype only de-risked the plumbing (gym standup, router-tools worker, depth-best scoring) and is now dead weight. * feat(bench): package the optimization suite (runBenchmark) + clarify naming + onboarding fixes The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged front door: runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget }) → runs each strategy, scores by the environment's own deployable check, returns the per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport gives the verdict. Resilient to transient per-task infra (skip, don't crash). Naming, made legible (public API; maps to internal depth/breadth — zero churn to the running internals): a task domain is an `Environment` (the AgenticSurface seam under the RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine` (attempt → critic reads trace → steer → repeat), named by what they DO, not the search tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction = the critic, Environment.score = the check) or drop to runAgentic for new strategies. Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194); fixed the dead examples/model-resolution link in docs/concepts.md. * feat(bench): make Strategy a first-class, OPEN abstraction (author your own) The question: when we collapse to "refine", can a dev create their OWN strategy? Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability existed (a strategy is an Agent) but the door wasn't cut. Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget) => Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine` and `sample` ship as instances AND the reference driver implementations (depthDriver/ breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own. What's under the words: sample = K independent attempts, keep the best-verifying (best-of-N / resample) refine = attempt → observe() reads the trace → steer the next → repeat (iterate) A multi-agent "team" is just a Strategy whose driver spawns several different agents — same recursive Agent atom, coordinated over the Scope. * feat(bench): defineStrategy + composable steps — author a loop in ~15 lines (skillifiable) The original goal: loops compact enough to skillify, so agents author them. A 70-line Supervisor driver isn't that. This adds the composable LEGO: defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... }) A strategy body gets two steps — shot() (one worker attempt over an artifact) and critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/ Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is the unit an agent or a skill can emit. Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure strategy logic, no plumbing. Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified; adaptiveRefine live-smoke pending the gym (GEPA has it). * docs(bench): strategy-demo example — the optimization suite in 3 layers (gym-free, runnable) The missing onboarding piece: a runnable demo of the whole suite on a toy "counter" Environment (needs only a router key — no dataset, no sandbox). Shows all three layers: 1. runBenchmark(env, …) — default strategies compared, free. 2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior. 3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(), zero Supervisor ceremony. The skillifiable unit. Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and score via the Environment's own check. README documents the model + the customization hooks. * feat(bench): GEPA frozen holdout — confirm the winner generalizes vs the baseline Adds a HOLDOUT=N option: after optimizing on the search tasks, score the winning analyst instruction AND the seeded baseline (observe default) on a DISJOINT slice (offset = search-set size). Holdout breadth computed once; winner+baseline depth scored against it. Reports whether GEPA GENERALIZED (winner > baseline on held-out tasks) — the frozen confirmation the discipline requires (guards against overfitting the search set). loadItsmTasks gains an offset param.
drewstone
added a commit
that referenced
this pull request
Jun 16, 2026
…dation (#304) * feat(intelligence): capability-delivery manifest — composeCertifiedProfile + resolver + ladder Add the unified, future-proof delivery structure: one certified unit of agent power = { interface, binding }. Interfaces are CLOSED (tool / mcp / context / retrieval / hook / subagent); bindings are OPEN (inline / file / http / sandbox-code / mcp-stdio / mcp-remote / process-on-infra / rag-index / memory-store / wasm / a2a). A single resolver lowers any binding into one uniform ResolvedSurface consumed identically by the host seam (RouterToolsSeam tools + executeToolCall) and the sandbox seam (AgentProfile). - src/intelligence/capability.ts: the manifest types + CapabilityNotAdmittedError + manifestFromProfile (lowers today's CertifiedProfile wire into capabilities[] with best-effort binding inference, so the spine delivers value before the plane changes). - src/intelligence/resolver.ts: composeCertifiedProfile — the spine resolves inline/file (byte-identical to composeCertifiedPrompt, the regression lock), mcp-stdio/mcp-remote (strict union widens to the SDK's flat AgentProfileMcpServer — an always-valid lowering), and http tools (the host seam). Ladder rungs that need infra (sandbox-code, process-on-infra) are injected ResolveCtx providers; rag-index/memory-store/wasm/a2a throw CapabilityNotAdmittedError (memory gated on the E3 admission bar). Fail-closed: null manifest -> base surface, per-capability failure -> drop (diagnostic via onDrop), post-resolve drift drops any tool/mcp whose live names diverge. - src/mcp/delegation-profile.ts: composeProductionAgentProfile now also merges tools box-flags, hooks, subagents, and injects ResolvedSurface.mcpConnections into AgentProfile.mcp (the sandbox-seam mapping). - exports + export gate + the two spec corrections (mcp lowers via always-valid widening; tools lower two ways since AgentProfile.tools is box flags). * docs(rsi): correct depth>breadth to the POWER-16 tie at n=48 (not the n=16 +16.4pp) The +16.4pp CI[+5.3,+29.8] n=16 depth-steered-continuation result did not replicate when powered: depth-breadth = +4.7pp CI[-1.9,+11.4] at n=48 (a tie; +4.1pp at n=72). architecture.md and roadmap-rsi.md advertised it as a cleared keystone; they now carry the retraction and point at .evolve/current.json. * chore(clean): remove dead mock loop + orphan re-exports/interface (432 LOC) - delete bench/src/observe-steer-workspace-loop.mts (the #194 mock anti-pattern; 0 inbound refs) - drop orphan pass-through re-exports CaptureIntegrityError/ReplayError/VerificationError (src/errors.ts) - drop orphan interface AgentTaskRunSummary (src/types.ts) - fix doc-rot in loop-facade-postmortem.md; gitignore stray test_repo/ - deletion-ledger.md tracks deletions + the deferred migrations (driver.ts 12 callers, AgentProfile superset) Gates verified by hand: typecheck 0, lint 0, 924 tests pass / 0 fail. Load-bearing fail-loud fences left intact (NOT dead code). * docs(research): atom-compression plan, harness-compat matrix, long-horizon map * chore(deps): bump @types/node 25.9.3 + playwright 1.61.0 (dev) Safe minor/patch dev bumps, gates verified (typecheck 0, lint 0, 924 tests pass). Deferred (need own careful pass): biome 2.5 (13 new lint warnings), typescript 6 + vitest 4 (majors), agent-eval 0.92 (substrate — sync with the AgentProfile superset work). * docs(research): RSI atom masterplan + build tracker (single source of truth) * docs(research): collapse N driver prompts → one cached generator (software 3.0) Replace the per-role hand-coded prompt builders with generateDriverSystemPrompt(spec): a (fused) router call generates the driver prompt from {role,goal,target,harness,stance}, cached for semantic reuse via PromptRegistry + hashContent key (file/JSON or DB). The hand-authored worker-driver prompt becomes the generator's seed + its tests the invariants. Single optimizable surface; depends on a tangle-router fusion primitive (separate issue). * docs(research): active push — RUN/DELETE/IMPROVE worklist (delete createDriver, run commit0, deep-clean, dedup) * docs(research): createDriver delete BLOCKED (paradigm diff, evidenced); commit0 RAN; the delete fork * refactor(runtime): full nuke of the createDriver/string-prompt measurement+eval paradigm DELETE the wrong abstraction (createDriver = a code TopologyPlanner driving runLoop over string-prompt→string-answer calls, judged by adapter.judge) and the entire old bench experiment + eval-gen apparatus built on it. The agent-driver (AgentProfile driving AgentProfile via coordination tools) replaces it; the runLoop KERNEL and the Scope/ Supervisor are untouched. Deleted (15): src/runtime/driver.ts; bench experiment.ts(+test)/steering-experiment(+test)/ improve-prompt/research-loop/finsearch-loop/rsi/generate-eval/run-benchmarks/run.ts/ skills-sandbox/profile-coord-sandbox; tests/loops/dynamic.test.ts. Survivors (search-bench/cloud-loop/fleet/commit0-gate) re-homed onto a new pure helper bench/src/sandbox-run.ts (answerOutput/sandboxAgentRun/WorkerBackendType/AnalystFn/llmAnalyst — no experiment shell). runLoop kernel tests kept via a scriptedDriver stub in refine-driver.ts. Gates (hand-verified): build 0, typecheck 0 (root+bench; also fixed a pre-existing bench BackendType red), lint 0, 905 tests pass. Zero dangling code refs. ACCEPTED casualties of the full nuke (rebuild on the agent-driver/Supervisor path when wanted): the generate-eval data engine, the AgentProfile-coordinate optimizer (profile-coord), and run.ts's non-experiment subcommands (preflight/verify-judge/solve-one/ui-review). * docs(research): full nuke DONE (-3492 LOC); doc/skill-rot follow-up tracked * docs(cleanup): retarget all docs+skills off the nuked createDriver/runExperiment to the agent-driver/Supervisor reality * feat(supervise): recursive driver-executor — agents driving agents driving agents A spawned child can now BE a driver. driverExecutorFactory mounts a NESTED Scope over the SAME conserved budget pool + shared journal (scope.ts's new NestedScopeSeam), one depth deeper, and runs the wrapped driver's act there. A child resolves to a LEAF (worker) OR — for a role:'driver' spec, via withDriverExecutor — this executor, recursively. So a driver spawns a driver spawns a worker on one budget-conserving tree. The persona/strategy spawn fences now route a driver child to the recursive executor (compose) instead of throwing; act still fails loud only if a child is run directly. Reuses the atom — builds NO new budget/journal/selection logic. Budget conserved across depth (reserve-on-spawn fails closed at any depth), spend bubbles to root, journal records each nested tree, maxDepth enforced across recursion. Proven OFFLINE (no creds; scripted drivers+workers) in tests/loops/driver-recursion.test.ts: depth-2 chain root->mid->inner->worker (node id rec:s0:s0:s0 — a non-recursive build cannot produce it), fail-closed budget conservation across depth, spend roll-up (spentTotal = the worker's exact spend), nested-journal sub-trees, depth-ceiling across recursion. Gates hand-verified: build 0, typecheck 0, lint 0, 911 tests pass. * refactor(bench)+docs: reclaim runKeystoneGate -> runGate; strip 4 docs to latest-only Rename the opaque 'keystone' jargon: runKeystoneGate->runGate (+ RunGateOptions/GateArmResult/ GateReport), bench/src/keystone-gate.ts->gate.ts (+ -cli, +test), all import paths and CLI banners. Strip the last historical createDriver/runExperiment 'was removed/nuked' breadcrumbs from architecture.md, architecture-interpretations.md, learning-flywheel.md, roadmap-rsi.md — upgrading agents now see only the current agent-driver/Supervisor reality (history lives in git). Gates green; 905->911 with the keystone test. * docs(research): keystone recursion ✅ (9d188e1); createDriver retire ✅ via nuke; #2b brain next * feat(supervise): coordinationDriverAgent — the cheap/offline driver (LLM tool-loop over the coordination verbs) The CHEAP, in-process, no-creds variant of the recursive driver: act() mounts createCoordinationTools over its scope and runs an LLM tool-loop (injected chat seam) so the driver REASONS spawn/steer/await/stop; composes with 2a recursion (a driver agent spawns a driver agent, via makeWorkerAgent -> driverChild). NOT the primary driver — the CAPABLE driver is a sandbox agent with the coordination verbs as an MCP. This one is the offline-testable + cheap-orchestration path. Prompt is INJECTED (decoupled from agent-eval). Proven OFFLINE (no creds, scripted mock chat) tests/loops/coordination-driver.test.ts: the tool-loop drives real Scope.spawn via the coordination verbs + folds results back; a driver AGENT spawns a driver AGENT (separate nested journal tree). typecheck 0, lint 0. * docs(research): dual-purpose resolution (one substrate serves product + proof); #2b cheap driver done, #2c capable sandbox driver + #3 completion-oracle next * feat(supervise): completion-oracle — settled ⟺ delivered (Foreman 0/18) The honest settle: a node counts as delivered only when a deployable check passes, never on self-report. - completion-gate.ts: gateOnDeliverable wraps any Executor so its settlement valid reflects a DeliverableSpec check (both execute shapes; fail-closed). - coordination-driver finalize: returns the best DELIVERED child; undefined when none delivered — a driver cannot self-declare done via prose. - driver-executor: derive the driver child's verdict from its direct settled events, so delivery composes UP the recursion (a sub-driver is valid only when it itself selected a delivered child). - supervisor: a winner MUST carry a real Out; a successful act that produced nothing is a no-winner, never a winner wrapping undefined. 8 offline tests: leaf gate (both execute shapes, fail-closed), ran-but-didn't- deliver yields no winner, the gate dominates score, delivery propagates up the recursion. * docs(research): completion-oracle #3 ✅ (bd58761) — settled ⟺ delivered, composes up the recursion * feat(bench): atom-humaneval — agents-driving-agents on a live deployable-checked domain A coordinationDriverAgent (real router brain) drives gated workers on HumanEval: each worker is settled valid ONLY when the local Docker test suite passes (completion-oracle, not self-report), against a blind best-of-K baseline. Proven live: the driver spawns, the worker solves, the checker gates, the supervisor returns a winner only on real delivery. Also exports gateOnDeliverable/DeliverableSpec from the runtime barrel (the #3 primitive was added to supervise/ but not surfaced on the package). * feat(topology): animated visual replay of a recursive agent run Fold the one runtime-hooks stream into a timestamped ReplayEvent[] (createReplayRecorder) and render a self-contained, scrubbable HTML player (renderReplayHtml) — the recursive agent tree animated over wall-clock, each node colored by the completion-oracle: delivered (valid) green, ran-but-not-delivered amber, failed red, with live token/cost counters. Synthesizes the unspawned root driver so the whole recursion renders. No server/build/deps. Wired into atom-humaneval (every driver run emits a replay.html). 4 offline recorder tests; proven on a live HumanEval run (driver -> worker -> delivered). * fix(bench): atom-humaneval blind arm survives transient router errors (a 502 is a failed attempt, not a crash — matches the driver arm's down-typing) * feat(supervise): the supervisor AUTHORS worker profiles from a skill (the intelligence, not the plumbing) The supervisor's job is to DESIGN the agents it spawns — read the task, decompose it, and author a tailored profile (instructions + model) per worker. supervisorSkill is the how-to it reads (its own system prompt) — THE optimizable self-improvement surface; authoredWorker builds a worker from an authored profile; asAuthoredProfile catches empty/placeholder profiles (a skill violation). Proven offline (no creds, no plumbing): a skill-guided supervisor authors DISTINCT, tailored worker recipes per sub-task and they flow to the workers. 3 tests. * feat(supervise): coordination MCP over a live Scope — the real keystone for in-box driving serveCoordinationMcp fronts a live Scope with an HTTP JSON-RPC MCP server: an in-box coding harness (opencode via cli-bridge) mounts mcp.mcpServers.coordination and calls spawn_worker as a native tool, landing on Scope.spawn — a real box driving real boxes, not emulated function-tools. Real test: HTTP tools/call spawn_worker -> Scope.spawn -> worker settles -> winner (no mock of the MCP path). Plus the standard supervise SKILL.md. * feat(bench): prove a coding harness drives the Scope via the coordination MCP (live) opencode (glm-5-turbo via cli-bridge) mounts mcp.mcpServers.coordination (type:http → opencode remote) and calls spawn_worker itself → real Scope.spawn → worker settles, and reads back the await_next result. The in-box driving path is REAL — a coding agent drives recursion as a native tool, not emulated. (Bridge wants mcp type:'http', not 'remote'.) * feat(bench): WHOLE real e2e — opencode supervisor drives opencode workers via the coordination MCP, real test gates delivery Live, no mock: the opencode supervisor (glm-5-turbo via cli-bridge) mounts the coordination MCP, authors worker profiles, calls spawn_worker -> real Scope.spawn -> real opencode workers code in a cwd -> python3 test gates valid -> supervisor settles on the delivered worker -> winner. The completion-oracle (deployable check, not LLM judge) decided delivery over the supervisor's confusion that it couldn't see the workers' isolated cwds (→ shared Workspace next). Proof artifact for the in-box-driving path; the law-compliant productionization is a substrate backend (tmux/bridge/sandbox) that runs authored profiles — not this harness-specific script. * docs(canonical-api): the AgentProfile law — author the profile, the substrate materializes it §1.5 + decision-table rows + CLAUDE.md §0 pointer. The thing we keep forgetting: an agent IS its full AgentProfile (prompt+skills+tools/mcp+subagents+hooks+permissions+model), not a prompt; change behavior by AUTHORING the profile and letting the sandbox substrate materialize it into harness shapes — never write a verify-loop or harness-specific config (self-verification is a hook/process; opencode is only the cli-bridge test target; a missing lever is a substrate gap). * docs(research): consolidate docs/research 28→14 — retire shipped/subsumed design docs Retired 14 design-research docs whose content is now shipped code, in .evolve/current.json, or self-declared subsumed/retracted (the recursion atom shipped; the optimization-space layer evidence landed; verdicts reached). Refreshed the research index, recorded the retirement + rationale in deletion-ledger.md (Pass 2), and fixed every inbound link (top index, the harvest-corpus.ts comment → current.json, optimization-space's suite links). Kept the SSOT masterplan, the canonical-referenced maps (optimization-space/leapfrog), the two gated belief specs, the postmortem guardrail, the build-lists, and the agent-lab tombstones. No broken links into the 14 remain from any canonical doc or src/. * docs(canonical-api): substrate pin 0.89 → 0.92 (matches the merged package.json)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
observe()as the trace-derived third-person watcher that turns worker behavior into findings, reports, and optional corpus factsgitWorkspace/Shellport for clone/commit/push durable workspace loopsbench/src/observe-steer-workspace-loop.mts, the local substrate proof for: Supervisor/Scope -> coordination MCP tools -> git workspace ->observe()finding ->steer_worker/Scope.send-> corrective worker -> fresh-clone integration passScope, Supervisor,runLoop, validators, journals, MCP coordination, git workspace, andobservedefineLoopfacade/protocol, its exports, docs, example, and testsQuestion,QuestionDecision,QuestionPolicy,CoordinationEvent; no facade-eraLoopQuestiontypesAGENTS.md/CLAUDE.mdnow point to canonicaldocs/BUILDING.mdanddocs/ANTI_PATTERNS.mdloop-writer, the facade postmortem, and the docs index with the proof command and the "do not relocate protocol and call it simplification" guardrailDesign Notes
The post-audit API stance is deliberate: loops are not a new runtime grammar. They are ordinary agent code over the existing substrate.
observe()is the new load-bearing primitive;gitWorkspaceis the durable workspace seam; MCP coordination is the sandbox binding for the sameScopeverbs.The decisive local join is now proven by:
pnpm exec tsx bench/src/observe-steer-workspace-loop.mtsThat proof uses a mock
ChatClienttransport for the observer model call and local BYO worker executors so it is reproducible without cloud credentials. Honest remaining proof: run the same shape withopenSandboxRunworkers and a remote branch that a sandbox can clone and push.Docs placement is now explicit:
AGENTS.mdandCLAUDE.mdare bootloaders; durable build rules live indocs/BUILDING.md; named failure modes live indocs/ANTI_PATTERNS.md; evidence and postmortems stay underdocs/research/*/ memory.Validation
pnpm lintpnpm typecheckpnpm test -- --runInBand(67 files / 678 tests)pnpm exec vitest run tests/loops/workspace.test.ts tests/loops/coordination.test.ts --reporter=dotpnpm exec vitest run tests/loops/workspace.test.ts --reporter=dotpnpm exec tsx bench/src/observe-steer-workspace-loop.mts(INTEGRATION OK)pnpm buildpnpm verify:packagepython3 /home/drew/.codex/skills/.system/skill-creator/scripts/quick_validate.py skills/loop-writergit diff --checkgit fetch origin main && git merge-tree --write-tree origin/main HEADNotes
This PR remains draft. The next proof should be the cloud variant:
openSandboxRunworker plus remote git branch, without adding a new loop facade first.