Skip to content

feat(loops): add observe and substrate loop proofs#194

Merged
drewstone merged 11 commits into
mainfrom
feat/observe-closed-loop
Jun 8, 2026
Merged

feat(loops): add observe and substrate loop proofs#194
drewstone merged 11 commits into
mainfrom
feat/observe-closed-loop

Conversation

@drewstone

@drewstone drewstone commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • add observe() as the trace-derived third-person watcher that turns worker behavior into findings, reports, and optional corpus facts
  • add the narrow gitWorkspace/Shell port for clone/commit/push durable workspace loops
  • add bench/src/observe-steer-workspace-loop.mts, the local substrate proof for: Supervisor/Scope -> coordination MCP tools -> git workspace -> observe() finding -> steer_worker/Scope.send -> corrective worker -> fresh-clone integration pass
  • keep loop authoring substrate-first: Scope, Supervisor, runLoop, validators, journals, MCP coordination, git workspace, and observe
  • remove the experimental defineLoop facade/protocol, its exports, docs, example, and tests
  • tighten MCP coordination back to substrate names: Question, QuestionDecision, QuestionPolicy, CoordinationEvent; no facade-era LoopQuestion types
  • split process rules out of agent bootloaders: AGENTS.md/CLAUDE.md now point to canonical docs/BUILDING.md and docs/ANTI_PATTERNS.md
  • update loop-writer, the facade postmortem, and the docs index with the proof command and the "do not relocate protocol and call it simplification" guardrail

Design Notes

The post-audit API stance is deliberate: loops are not a new runtime grammar. They are ordinary agent code over the existing substrate. observe() is the new load-bearing primitive; gitWorkspace is the durable workspace seam; MCP coordination is the sandbox binding for the same Scope verbs.

The decisive local join is now proven by:

pnpm exec tsx bench/src/observe-steer-workspace-loop.mts

That proof uses a mock ChatClient transport for the observer model call and local BYO worker executors so it is reproducible without cloud credentials. Honest remaining proof: run the same shape with openSandboxRun workers and a remote branch that a sandbox can clone and push.

Docs placement is now explicit: AGENTS.md and CLAUDE.md are bootloaders; durable build rules live in docs/BUILDING.md; named failure modes live in docs/ANTI_PATTERNS.md; evidence and postmortems stay under docs/research/* / memory.

Validation

  • pnpm lint
  • pnpm typecheck
  • pnpm test -- --runInBand (67 files / 678 tests)
  • pnpm exec vitest run tests/loops/workspace.test.ts tests/loops/coordination.test.ts --reporter=dot
  • pnpm exec vitest run tests/loops/workspace.test.ts --reporter=dot
  • pnpm exec tsx bench/src/observe-steer-workspace-loop.mts (INTEGRATION OK)
  • pnpm build
  • pnpm verify:package
  • python3 /home/drew/.codex/skills/.system/skill-creator/scripts/quick_validate.py skills/loop-writer
  • git diff --check
  • git fetch origin main && git merge-tree --write-tree origin/main HEAD

Notes

This PR remains draft. The next proof should be the cloud variant: openSandboxRun worker plus remote git branch, without adding a new loop facade first.

drewstone added 5 commits June 7, 2026 16:55
…loop

The connective tissue that turns a one-way driver→worker pipe into a feedback
loop. A worker can't see itself; observe() reads its TRACE and produces:
- findings + an operator report (what to fix — split agent vs operator), fed
  back DOWN as a steer and OUT to the operator
- durable corpus facts the NEXT run reads back (continuous self-improvement)

Findings are trace-derived, never judge-derived (derived_from_judge:false) —
the selector≠judge firewall. Harness-agnostic: reads a trace + output, so it
watches opencode/codex/hermes/BYO identically. Built on agent-eval's ChatClient
+ AnalystFinding; persists to the existing Corpus.

bench/src/fleet.mts: the whole vision end to end, runnable from a laptop —
a thin local driver fans out N workers to CLOUD sandboxes, observes each
trace, reports what to fix, banks learnings; run twice and the second run
injects the first's learnings into the workers. Proven live (opencode × 2
cloud workers): the observer caught a real inefficiency (unbatched bash calls)
and banked it.
The red-team flagged two FATAL design flaws: (1) parallel cloud workers share
no filesystem, so accumulating loops (migration) integrate to nothing; (2)
resume restores decisions, not the mutated workspace. This proves the fix —
a git-backed durable workspace — on 3 dependency-ordered modules:

- durable workspace = a bare git repo (models a remote branch); each worker is
  a FRESH clone (a fresh box's empty FS), torn down after commit+push.
- PROVEN (a): worker b's fresh clone finds a.py ON DISK (git carried it, not a
  string of names); c finds b.py; the integration test imports a<-b<-c, links.
- PROVEN (b): KILL after b, RESUME → a+b skipped (durable git has them), only c
  re-runs against a clone that already contains a+b's committed code, links.

Verdict shift: migration moves from 'cut from the pitch' to 'buildable on a
ctx.workspace handle'. The seam is git; the durable layer survives box teardown
by construction. Next: a cloud variant uses a GitHub branch as the workspace.
@drewstone drewstone changed the title feat(loops): add defineLoop authoring surface feat(loops): add observe and substrate loop proofs Jun 8, 2026
@drewstone drewstone marked this pull request as ready for review June 8, 2026 12:42

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after local release-gate verification: typecheck, tests, build, lint, package export verification, and merge-tree against origin/main. Runtime hook surface is additive; delegate harness/model support is covered by tests.

@drewstone drewstone merged commit 9c371b8 into main Jun 8, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 9, 2026
…naming + onboarding fixes

The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:

  runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
    → runs each strategy, scores by the environment's own deployable check, returns the
      per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
      gives the verdict. Resilient to transient per-task infra (skip, don't crash).

Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.

Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.
drewstone added a commit that referenced this pull request Jun 9, 2026
* feat(bench): GEPA over the analyst/steerer prompt on the canonical stack

The analyst IS the steerer (observe()'s findings → recommended_action → the depth
steer), so optimizing the analyst prompt optimizes the loop. This evolves it with
agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse
+ paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution
in agent-eval 0.83, only the primitives, so the population loop is thin orchestration
over them.

- observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob);
  defaultAnalystInstruction exported. Firewall stays structural (input has no score).
- agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer.
- eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe
  gate; breadth computed ONCE per task (shared baseline, correct + halves cost);
  failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default
  (the +16.4pp instruction) FIRST, then the designer-panel population.

Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect
→ mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.

* fix(bench): GEPA harness survives gym/router infra blips (skip failed tasks)

The first real run died when the (long-lived) gym container wedged: breadth
baselines returned 0% then runAgentic threw 'every rollout went down', killing the
whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task
whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the
depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a
fresh gym container + WIDTH<=2.

* refactor(bench): delete eops-gate.mts — the throwaway flat-loop prototype (−433 LOC)

It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the
canonical Supervisor + a second copy of the gym client (6 functions duplicating
gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind
depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam
(agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the
GEPA harness run on the canonical path; this prototype only de-risked the plumbing
(gym standup, router-tools worker, depth-best scoring) and is now dead weight.

* feat(bench): package the optimization suite (runBenchmark) + clarify naming + onboarding fixes

The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:

  runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
    → runs each strategy, scores by the environment's own deployable check, returns the
      per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
      gives the verdict. Resilient to transient per-task infra (skip, don't crash).

Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.

Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.

* feat(bench): make Strategy a first-class, OPEN abstraction (author your own)

The question: when we collapse to "refine", can a dev create their OWN strategy?
Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability
existed (a strategy is an Agent) but the door wasn't cut.

Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget)
=> Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by
returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine`
and `sample` ship as instances AND the reference driver implementations (depthDriver/
breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for
back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own.

What's under the words:
  sample = K independent attempts, keep the best-verifying (best-of-N / resample)
  refine = attempt → observe() reads the trace → steer the next → repeat (iterate)
A multi-agent "team" is just a Strategy whose driver spawns several different agents —
same recursive Agent atom, coordinated over the Scope.

* feat(bench): defineStrategy + composable steps — author a loop in ~15 lines (skillifiable)

The original goal: loops compact enough to skillify, so agents author them. A 70-line
Supervisor driver isn't that. This adds the composable LEGO:

  defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... })

A strategy body gets two steps — shot() (one worker attempt over an artifact) and
critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/
Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is
the unit an agent or a skill can emit.

Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered
shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure
motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure
strategy logic, no plumbing.

Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are
UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact
spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified;
adaptiveRefine live-smoke pending the gym (GEPA has it).

* docs(bench): strategy-demo example — the optimization suite in 3 layers (gym-free, runnable)

The missing onboarding piece: a runnable demo of the whole suite on a toy "counter"
Environment (needs only a router key — no dataset, no sandbox). Shows all three layers:
  1. runBenchmark(env, …) — default strategies compared, free.
  2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior.
  3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(),
     zero Supervisor ceremony. The skillifiable unit.
Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and
score via the Environment's own check. README documents the model + the customization hooks.

* chore(examples): clearer names — drop the confusing `with-` prefix; clarify intent

Disciplined subset of the examples-naming audit (NOT the proposed 01-08 numbering /
.deprecated quarantine — that's churn for throwaway examples and the README already
orders them):
  with-knowledge-readiness → knowledge-gating   (`with-` read as an optional toggle)
  with-intelligence-export → intelligence-export (same)
  agent-into-reviewer      → pipe-into-reviewer  (signals the 2-runtime piping)
KEPT runtime-run (it teaches startRuntimeRun — the name matches the product API) and
agents-of-all-shapes (memorable + has a test). git mv preserves history; README +
docs/concepts + all internal self-references updated; zero stragglers.
drewstone added a commit that referenced this pull request Jun 9, 2026
…eal agent-eval primitives) (#205)

* feat(bench): GEPA over the analyst/steerer prompt on the canonical stack

The analyst IS the steerer (observe()'s findings → recommended_action → the depth
steer), so optimizing the analyst prompt optimizes the loop. This evolves it with
agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse
+ paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution
in agent-eval 0.83, only the primitives, so the population loop is thin orchestration
over them.

- observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob);
  defaultAnalystInstruction exported. Firewall stays structural (input has no score).
- agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer.
- eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe
  gate; breadth computed ONCE per task (shared baseline, correct + halves cost);
  failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default
  (the +16.4pp instruction) FIRST, then the designer-panel population.

Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect
→ mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.

* fix(bench): GEPA harness survives gym/router infra blips (skip failed tasks)

The first real run died when the (long-lived) gym container wedged: breadth
baselines returned 0% then runAgentic threw 'every rollout went down', killing the
whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task
whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the
depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a
fresh gym container + WIDTH<=2.

* refactor(bench): delete eops-gate.mts — the throwaway flat-loop prototype (−433 LOC)

It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the
canonical Supervisor + a second copy of the gym client (6 functions duplicating
gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind
depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam
(agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the
GEPA harness run on the canonical path; this prototype only de-risked the plumbing
(gym standup, router-tools worker, depth-best scoring) and is now dead weight.

* feat(bench): package the optimization suite (runBenchmark) + clarify naming + onboarding fixes

The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:

  runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
    → runs each strategy, scores by the environment's own deployable check, returns the
      per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
      gives the verdict. Resilient to transient per-task infra (skip, don't crash).

Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.

Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.

* feat(bench): make Strategy a first-class, OPEN abstraction (author your own)

The question: when we collapse to "refine", can a dev create their OWN strategy?
Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability
existed (a strategy is an Agent) but the door wasn't cut.

Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget)
=> Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by
returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine`
and `sample` ship as instances AND the reference driver implementations (depthDriver/
breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for
back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own.

What's under the words:
  sample = K independent attempts, keep the best-verifying (best-of-N / resample)
  refine = attempt → observe() reads the trace → steer the next → repeat (iterate)
A multi-agent "team" is just a Strategy whose driver spawns several different agents —
same recursive Agent atom, coordinated over the Scope.

* feat(bench): defineStrategy + composable steps — author a loop in ~15 lines (skillifiable)

The original goal: loops compact enough to skillify, so agents author them. A 70-line
Supervisor driver isn't that. This adds the composable LEGO:

  defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... })

A strategy body gets two steps — shot() (one worker attempt over an artifact) and
critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/
Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is
the unit an agent or a skill can emit.

Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered
shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure
motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure
strategy logic, no plumbing.

Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are
UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact
spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified;
adaptiveRefine live-smoke pending the gym (GEPA has it).

* docs(bench): strategy-demo example — the optimization suite in 3 layers (gym-free, runnable)

The missing onboarding piece: a runnable demo of the whole suite on a toy "counter"
Environment (needs only a router key — no dataset, no sandbox). Shows all three layers:
  1. runBenchmark(env, …) — default strategies compared, free.
  2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior.
  3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(),
     zero Supervisor ceremony. The skillifiable unit.
Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and
score via the Environment's own check. README documents the model + the customization hooks.

* feat(bench): GEPA frozen holdout — confirm the winner generalizes vs the baseline

Adds a HOLDOUT=N option: after optimizing on the search tasks, score the winning
analyst instruction AND the seeded baseline (observe default) on a DISJOINT slice
(offset = search-set size). Holdout breadth computed once; winner+baseline depth
scored against it. Reports whether GEPA GENERALIZED (winner > baseline on held-out
tasks) — the frozen confirmation the discipline requires (guards against overfitting
the search set). loadItsmTasks gains an offset param.
drewstone added a commit that referenced this pull request Jun 16, 2026
…dation (#304)

* feat(intelligence): capability-delivery manifest — composeCertifiedProfile + resolver + ladder

Add the unified, future-proof delivery structure: one certified unit of agent
power = { interface, binding }. Interfaces are CLOSED (tool / mcp / context /
retrieval / hook / subagent); bindings are OPEN (inline / file / http /
sandbox-code / mcp-stdio / mcp-remote / process-on-infra / rag-index /
memory-store / wasm / a2a). A single resolver lowers any binding into one uniform
ResolvedSurface consumed identically by the host seam (RouterToolsSeam tools +
executeToolCall) and the sandbox seam (AgentProfile).

- src/intelligence/capability.ts: the manifest types + CapabilityNotAdmittedError
  + manifestFromProfile (lowers today's CertifiedProfile wire into capabilities[]
  with best-effort binding inference, so the spine delivers value before the
  plane changes).
- src/intelligence/resolver.ts: composeCertifiedProfile — the spine resolves
  inline/file (byte-identical to composeCertifiedPrompt, the regression lock),
  mcp-stdio/mcp-remote (strict union widens to the SDK's flat
  AgentProfileMcpServer — an always-valid lowering), and http tools (the host
  seam). Ladder rungs that need infra (sandbox-code, process-on-infra) are
  injected ResolveCtx providers; rag-index/memory-store/wasm/a2a throw
  CapabilityNotAdmittedError (memory gated on the E3 admission bar). Fail-closed:
  null manifest -> base surface, per-capability failure -> drop (diagnostic via
  onDrop), post-resolve drift drops any tool/mcp whose live names diverge.
- src/mcp/delegation-profile.ts: composeProductionAgentProfile now also merges
  tools box-flags, hooks, subagents, and injects ResolvedSurface.mcpConnections
  into AgentProfile.mcp (the sandbox-seam mapping).
- exports + export gate + the two spec corrections (mcp lowers via always-valid
  widening; tools lower two ways since AgentProfile.tools is box flags).

* docs(rsi): correct depth>breadth to the POWER-16 tie at n=48 (not the n=16 +16.4pp)

The +16.4pp CI[+5.3,+29.8] n=16 depth-steered-continuation result did not
replicate when powered: depth-breadth = +4.7pp CI[-1.9,+11.4] at n=48 (a tie;
+4.1pp at n=72). architecture.md and roadmap-rsi.md advertised it as a cleared
keystone; they now carry the retraction and point at .evolve/current.json.

* chore(clean): remove dead mock loop + orphan re-exports/interface (432 LOC)

- delete bench/src/observe-steer-workspace-loop.mts (the #194 mock anti-pattern; 0 inbound refs)
- drop orphan pass-through re-exports CaptureIntegrityError/ReplayError/VerificationError (src/errors.ts)
- drop orphan interface AgentTaskRunSummary (src/types.ts)
- fix doc-rot in loop-facade-postmortem.md; gitignore stray test_repo/
- deletion-ledger.md tracks deletions + the deferred migrations (driver.ts 12 callers, AgentProfile superset)

Gates verified by hand: typecheck 0, lint 0, 924 tests pass / 0 fail. Load-bearing fail-loud fences left intact (NOT dead code).

* docs(research): atom-compression plan, harness-compat matrix, long-horizon map

* chore(deps): bump @types/node 25.9.3 + playwright 1.61.0 (dev)

Safe minor/patch dev bumps, gates verified (typecheck 0, lint 0, 924 tests pass).
Deferred (need own careful pass): biome 2.5 (13 new lint warnings), typescript 6 + vitest 4 (majors), agent-eval 0.92 (substrate — sync with the AgentProfile superset work).

* docs(research): RSI atom masterplan + build tracker (single source of truth)

* docs(research): collapse N driver prompts → one cached generator (software 3.0)

Replace the per-role hand-coded prompt builders with generateDriverSystemPrompt(spec):
a (fused) router call generates the driver prompt from {role,goal,target,harness,stance},
cached for semantic reuse via PromptRegistry + hashContent key (file/JSON or DB). The
hand-authored worker-driver prompt becomes the generator's seed + its tests the invariants.
Single optimizable surface; depends on a tangle-router fusion primitive (separate issue).

* docs(research): active push — RUN/DELETE/IMPROVE worklist (delete createDriver, run commit0, deep-clean, dedup)

* docs(research): createDriver delete BLOCKED (paradigm diff, evidenced); commit0 RAN; the delete fork

* refactor(runtime): full nuke of the createDriver/string-prompt measurement+eval paradigm

DELETE the wrong abstraction (createDriver = a code TopologyPlanner driving runLoop over
string-prompt→string-answer calls, judged by adapter.judge) and the entire old bench
experiment + eval-gen apparatus built on it. The agent-driver (AgentProfile driving
AgentProfile via coordination tools) replaces it; the runLoop KERNEL and the Scope/
Supervisor are untouched.

Deleted (15): src/runtime/driver.ts; bench experiment.ts(+test)/steering-experiment(+test)/
improve-prompt/research-loop/finsearch-loop/rsi/generate-eval/run-benchmarks/run.ts/
skills-sandbox/profile-coord-sandbox; tests/loops/dynamic.test.ts.
Survivors (search-bench/cloud-loop/fleet/commit0-gate) re-homed onto a new pure helper
bench/src/sandbox-run.ts (answerOutput/sandboxAgentRun/WorkerBackendType/AnalystFn/llmAnalyst —
no experiment shell). runLoop kernel tests kept via a scriptedDriver stub in refine-driver.ts.

Gates (hand-verified): build 0, typecheck 0 (root+bench; also fixed a pre-existing bench
BackendType red), lint 0, 905 tests pass. Zero dangling code refs.

ACCEPTED casualties of the full nuke (rebuild on the agent-driver/Supervisor path when wanted):
the generate-eval data engine, the AgentProfile-coordinate optimizer (profile-coord), and
run.ts's non-experiment subcommands (preflight/verify-judge/solve-one/ui-review).

* docs(research): full nuke DONE (-3492 LOC); doc/skill-rot follow-up tracked

* docs(cleanup): retarget all docs+skills off the nuked createDriver/runExperiment to the agent-driver/Supervisor reality

* feat(supervise): recursive driver-executor — agents driving agents driving agents

A spawned child can now BE a driver. driverExecutorFactory mounts a NESTED Scope over
the SAME conserved budget pool + shared journal (scope.ts's new NestedScopeSeam), one
depth deeper, and runs the wrapped driver's act there. A child resolves to a LEAF
(worker) OR — for a role:'driver' spec, via withDriverExecutor — this executor,
recursively. So a driver spawns a driver spawns a worker on one budget-conserving tree.
The persona/strategy spawn fences now route a driver child to the recursive executor
(compose) instead of throwing; act still fails loud only if a child is run directly.

Reuses the atom — builds NO new budget/journal/selection logic. Budget conserved across
depth (reserve-on-spawn fails closed at any depth), spend bubbles to root, journal records
each nested tree, maxDepth enforced across recursion.

Proven OFFLINE (no creds; scripted drivers+workers) in tests/loops/driver-recursion.test.ts:
depth-2 chain root->mid->inner->worker (node id rec:s0:s0:s0 — a non-recursive build cannot
produce it), fail-closed budget conservation across depth, spend roll-up (spentTotal = the
worker's exact spend), nested-journal sub-trees, depth-ceiling across recursion.
Gates hand-verified: build 0, typecheck 0, lint 0, 911 tests pass.

* refactor(bench)+docs: reclaim runKeystoneGate -> runGate; strip 4 docs to latest-only

Rename the opaque 'keystone' jargon: runKeystoneGate->runGate (+ RunGateOptions/GateArmResult/
GateReport), bench/src/keystone-gate.ts->gate.ts (+ -cli, +test), all import paths and CLI
banners. Strip the last historical createDriver/runExperiment 'was removed/nuked' breadcrumbs
from architecture.md, architecture-interpretations.md, learning-flywheel.md, roadmap-rsi.md —
upgrading agents now see only the current agent-driver/Supervisor reality (history lives in git).
Gates green; 905->911 with the keystone test.

* docs(research): keystone recursion ✅ (9d188e1); createDriver retire ✅ via nuke; #2b brain next

* feat(supervise): coordinationDriverAgent — the cheap/offline driver (LLM tool-loop over the coordination verbs)

The CHEAP, in-process, no-creds variant of the recursive driver: act() mounts
createCoordinationTools over its scope and runs an LLM tool-loop (injected chat seam) so the
driver REASONS spawn/steer/await/stop; composes with 2a recursion (a driver agent spawns a
driver agent, via makeWorkerAgent -> driverChild). NOT the primary driver — the CAPABLE
driver is a sandbox agent with the coordination verbs as an MCP. This one is the
offline-testable + cheap-orchestration path. Prompt is INJECTED (decoupled from agent-eval).

Proven OFFLINE (no creds, scripted mock chat) tests/loops/coordination-driver.test.ts:
the tool-loop drives real Scope.spawn via the coordination verbs + folds results back; a
driver AGENT spawns a driver AGENT (separate nested journal tree). typecheck 0, lint 0.

* docs(research): dual-purpose resolution (one substrate serves product + proof); #2b cheap driver done, #2c capable sandbox driver + #3 completion-oracle next

* feat(supervise): completion-oracle — settled ⟺ delivered (Foreman 0/18)

The honest settle: a node counts as delivered only when a deployable check
passes, never on self-report.

- completion-gate.ts: gateOnDeliverable wraps any Executor so its settlement
  valid reflects a DeliverableSpec check (both execute shapes; fail-closed).
- coordination-driver finalize: returns the best DELIVERED child; undefined
  when none delivered — a driver cannot self-declare done via prose.
- driver-executor: derive the driver child's verdict from its direct settled
  events, so delivery composes UP the recursion (a sub-driver is valid only
  when it itself selected a delivered child).
- supervisor: a winner MUST carry a real Out; a successful act that produced
  nothing is a no-winner, never a winner wrapping undefined.

8 offline tests: leaf gate (both execute shapes, fail-closed), ran-but-didn't-
deliver yields no winner, the gate dominates score, delivery propagates up the
recursion.

* docs(research): completion-oracle #3 ✅ (bd58761) — settled ⟺ delivered, composes up the recursion

* feat(bench): atom-humaneval — agents-driving-agents on a live deployable-checked domain

A coordinationDriverAgent (real router brain) drives gated workers on HumanEval:
each worker is settled valid ONLY when the local Docker test suite passes
(completion-oracle, not self-report), against a blind best-of-K baseline. Proven
live: the driver spawns, the worker solves, the checker gates, the supervisor
returns a winner only on real delivery.

Also exports gateOnDeliverable/DeliverableSpec from the runtime barrel (the #3
primitive was added to supervise/ but not surfaced on the package).

* feat(topology): animated visual replay of a recursive agent run

Fold the one runtime-hooks stream into a timestamped ReplayEvent[] (createReplayRecorder)
and render a self-contained, scrubbable HTML player (renderReplayHtml) — the recursive
agent tree animated over wall-clock, each node colored by the completion-oracle: delivered
(valid) green, ran-but-not-delivered amber, failed red, with live token/cost counters.
Synthesizes the unspawned root driver so the whole recursion renders. No server/build/deps.

Wired into atom-humaneval (every driver run emits a replay.html). 4 offline recorder tests;
proven on a live HumanEval run (driver -> worker -> delivered).

* fix(bench): atom-humaneval blind arm survives transient router errors (a 502 is a failed attempt, not a crash — matches the driver arm's down-typing)

* feat(supervise): the supervisor AUTHORS worker profiles from a skill (the intelligence, not the plumbing)

The supervisor's job is to DESIGN the agents it spawns — read the task, decompose it,
and author a tailored profile (instructions + model) per worker. supervisorSkill is the
how-to it reads (its own system prompt) — THE optimizable self-improvement surface;
authoredWorker builds a worker from an authored profile; asAuthoredProfile catches
empty/placeholder profiles (a skill violation).

Proven offline (no creds, no plumbing): a skill-guided supervisor authors DISTINCT,
tailored worker recipes per sub-task and they flow to the workers. 3 tests.

* feat(supervise): coordination MCP over a live Scope — the real keystone for in-box driving

serveCoordinationMcp fronts a live Scope with an HTTP JSON-RPC MCP server: an in-box
coding harness (opencode via cli-bridge) mounts mcp.mcpServers.coordination and calls
spawn_worker as a native tool, landing on Scope.spawn — a real box driving real boxes,
not emulated function-tools. Real test: HTTP tools/call spawn_worker -> Scope.spawn ->
worker settles -> winner (no mock of the MCP path). Plus the standard supervise SKILL.md.

* feat(bench): prove a coding harness drives the Scope via the coordination MCP (live)

opencode (glm-5-turbo via cli-bridge) mounts mcp.mcpServers.coordination (type:http →
opencode remote) and calls spawn_worker itself → real Scope.spawn → worker settles, and
reads back the await_next result. The in-box driving path is REAL — a coding agent drives
recursion as a native tool, not emulated. (Bridge wants mcp type:'http', not 'remote'.)

* feat(bench): WHOLE real e2e — opencode supervisor drives opencode workers via the coordination MCP, real test gates delivery

Live, no mock: the opencode supervisor (glm-5-turbo via cli-bridge) mounts the coordination
MCP, authors worker profiles, calls spawn_worker -> real Scope.spawn -> real opencode workers
code in a cwd -> python3 test gates valid -> supervisor settles on the delivered worker ->
winner. The completion-oracle (deployable check, not LLM judge) decided delivery over the
supervisor's confusion that it couldn't see the workers' isolated cwds (→ shared Workspace next).

Proof artifact for the in-box-driving path; the law-compliant productionization is a substrate
backend (tmux/bridge/sandbox) that runs authored profiles — not this harness-specific script.

* docs(canonical-api): the AgentProfile law — author the profile, the substrate materializes it

§1.5 + decision-table rows + CLAUDE.md §0 pointer. The thing we keep forgetting: an agent IS
its full AgentProfile (prompt+skills+tools/mcp+subagents+hooks+permissions+model), not a prompt;
change behavior by AUTHORING the profile and letting the sandbox substrate materialize it into
harness shapes — never write a verify-loop or harness-specific config (self-verification is a
hook/process; opencode is only the cli-bridge test target; a missing lever is a substrate gap).

* docs(research): consolidate docs/research 28→14 — retire shipped/subsumed design docs

Retired 14 design-research docs whose content is now shipped code, in .evolve/current.json, or
self-declared subsumed/retracted (the recursion atom shipped; the optimization-space layer
evidence landed; verdicts reached). Refreshed the research index, recorded the retirement +
rationale in deletion-ledger.md (Pass 2), and fixed every inbound link (top index, the
harvest-corpus.ts comment → current.json, optimization-space's suite links). Kept the SSOT
masterplan, the canonical-referenced maps (optimization-space/leapfrog), the two gated belief
specs, the postmortem guardrail, the build-lists, and the agent-lab tombstones. No broken links
into the 14 remain from any canonical doc or src/.

* docs(canonical-api): substrate pin 0.89 → 0.92 (matches the merged package.json)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants