tangle-network · drewstone · Jun 6, 2026 · Jun 6, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -23,7 +23,7 @@ agent-knowledge ─┐
 agent-runtime ───┘   (this repo — wraps the substrate)
 ```
 
-**Rule: agent-runtime depends on agent-eval. agent-eval MUST NOT import from agent-runtime.** No upward imports, no `peerDependencies` in agent-eval pointing here, no `import type { X } from '@tangle-network/agent-runtime'` inside agent-eval. A spotted upward import is a bug — file an issue and move the type into agent-eval. agent-eval is declared a **required `peerDependency`** (pinned `^0.76.0`), not a hard dependency — keep it in sync with the `optimizePrompt`/`heldoutSignificance`/`loopDispatch` APIs the code uses.
+**Rule: agent-runtime depends on agent-eval. agent-eval MUST NOT import from agent-runtime.** No upward imports, no `peerDependencies` in agent-eval pointing here, no `import type { X } from '@tangle-network/agent-runtime'` inside agent-eval. A spotted upward import is a bug — file an issue and move the type into agent-eval. agent-eval is declared a **required `peerDependency`** (floor `>=0.83.0` — `selfImprove` exposes `analyzeGeneration` from 0.83), not a hard dependency — keep it in sync with the `selfImprove`/`heldoutSignificance`/`loopDispatch` APIs the code uses.
 
 Substrate primitives CONSUMED from agent-eval: `DefaultVerdict`, `RunRecord`, `AgentEvalError` + taxonomy, `AnalystFinding`/`AnalystRunResult`/`FindingsDiff`, `TraceAnalystKindSpec`, `KnowledgeReadinessReport`, and the campaign types (`DispatchContext`/`ProfileDispatchFn`/`Scenario`, type-only).
 

diff --git a/README.md b/README.md
@@ -58,7 +58,7 @@ That is the common case. Everything below is for when one chat turn is not enoug
 | Delegate a disciplined loop by mode (code, research, ...) | `runDelegatedLoop` or `agent-runtime-loop` | root |
 | Build code reliably (reviewed, gated) | `createDefaultCoderDelegate` | `/mcp` |
 | Grow a knowledge base with only grounded facts | `createKbGate` | `/mcp` |
-| Improve a prompt safely (identity-gated) | `optimizePrompt` | `/improvement` |
+| Improve a prompt safely (identity-gated) | `selfImprove` | `@tangle-network/agent-eval/contract` |
 | Ship loop traces to a GenAI viewer | `buildLoopOtelSpans` plus `createOtelExporter` | root |
 | Expose delegation as MCP tools to a sandbox agent | `createMcpServer` or `agent-runtime-mcp` | `/mcp` |
 | Mutate surfaces from trace findings | `runAnalystLoop` | `/analyst-loop` |
@@ -90,21 +90,25 @@ Shipped drivers (`/loops/drivers`): `createRefineDriver` (single task, iterate u
 
 The same machinery, run at the optimization timescale.
 
-`optimizePrompt` (`/improvement`) optimizes any text prompt over agent-eval's `runImprovementLoop`, identity-gated by construction. It runs evals, proposes candidates (default `gepaDriver`), and a held-out gate compares candidate against baseline. `result.prompt` is the baseline unless the gate decided `ship`, so registering a prompt for optimization can never regress it.
+The one entry point is agent-eval's **`selfImprove`** (`@tangle-network/agent-eval/contract`). It runs a closed loop over any text/config surface, identity-gated by construction: it evaluates, proposes candidates (default `gepaDriver`), and a held-out gate ships a winner only if it beats the baseline. `result.winner.surface` is the baseline unless `result.gateDecision === 'ship'`, so registering a surface for optimization can never regress it.
 
 ```ts
-import { optimizePrompt } from '@tangle-network/agent-runtime/improvement'
-
-const { prompt, improved, delta } = await optimizePrompt({
-  baselinePrompt: CURRENT_SYSTEM_PROMPT,
-  runWithPrompt: (candidate, scenario, ctx) => runYourThing(candidate, scenario),
-  scenarios, holdoutScenarios, judges, runDir,
-  reflection: { llm, model: 'claude-sonnet-4-6' },
+import { selfImprove } from '@tangle-network/agent-eval/contract'
+
+const result = await selfImprove({
+  baselineSurface: CURRENT_SYSTEM_PROMPT,
+  agent: (surface, scenario, ctx) => runYourThing(surface, scenario),
+  scenarios,
+  judge,
+  budget: { holdoutScenarios, generations: 3 },
+  llm: { baseUrl, apiKey, model: 'claude-sonnet-4-6' },
 })
-// assign `prompt` unconditionally; it is the safe one
+// result.winner.surface is the safe one — the baseline unless gateDecision === 'ship'
 ```
 
-`runAnalystLoop` (`/analyst-loop`) mines real run traces into findings; `createAnalystDriverHook` feeds those findings to a dynamic-driver planner via `PlannerContext.analyses`, with a firewall (`assertTraceDerivedFindings`) that rejects any finding derived from a judge verdict. `reportOptimizationRun` (`/improvement`) ships an optimization run's proposal and verdict to Tangle Intelligence over the eval-run wire.
+agent-runtime contributes the runtime-specific piece: the **CODE-surface `improvementDriver`** (`/improvement`) — a git-worktree mutator you pass to `selfImprove` as `driver` to optimize code instead of a string.
+
+`runAnalystLoop` (`/analyst-loop`) mines real run traces into findings; `createAnalystDriverHook` feeds those findings to a dynamic-driver planner via `PlannerContext.analyses`, with a firewall (`assertTraceDerivedFindings`) that rejects any finding derived from a judge verdict. Production intake — turning real run traces into the corpus `selfImprove` optimizes against — is agent-eval's `analyzeRuns` / `partitionRunsByAuthoringModel` (`/contract`).
 
 ## Delegated loops
 
@@ -170,17 +174,18 @@ One entrypoint, `runExperiment(adapter, { sandboxClient, agentRun, arms, ... })`
 | Driver | none, required by `runLoop` | `createRefineDriver`, `createFanoutVoteDriver`, `createDynamicDriver` |
 | Winner selection (coder delegate) | `highest-score` | `winnerSelection` option |
 | KB gate min passage | 12 chars | `createKbGate({ minPassageChars })` |
-| `optimizePrompt` gate | `heldOutGate` | `defaultProductionGate` for red-team hardening |
+| `selfImprove` gate | held-out gate (default) | pass `gate: defaultProductionGate` for red-team hardening |
 | OTEL export | off | set `OTEL_EXPORTER_OTLP_ENDPOINT` |
 | Loop-runner mode failure | recorded as `{ ok: false }` | `runDelegatedLoop` never crashes on a thrown engine |
 
 ## Composition with the stack
 
 ```
 agent-runtime   handleChatTurn, runLoop + drivers, runProgram, runDelegatedLoop, createMcpServer,
-                optimizePrompt, createKbGate, buildLoopOtelSpans, defineAgent
+                improvementDriver, createKbGate, buildLoopOtelSpans, defineAgent
 
-agent-eval      runEvalCampaign, runImprovementLoop (gepaDriver), heldOutGate, runAgentMatrix.
+agent-eval      selfImprove (the optimization entry point), runEvalCampaign,
+                runImprovementLoop (gepaDriver), heldOutGate, runAgentMatrix, analyzeRuns.
                 Consumes runtime traces, scores, gates promotion. agent-runtime depends on it,
                 never the reverse.
 
@@ -200,15 +205,15 @@ sandbox         AgentProfile, Sandbox.create, streamPrompt, exportTraceBundle. T
 | `.../loops` | the `runLoop` kernel, the `refine` / `fanout-vote` / `dynamic` drivers, `runProgram`, `loopDispatch` |
 | `.../profiles` | `coderProfile`, `researcherProfile` presets |
 | `.../mcp` | `createMcpServer`, `createDefaultCoderDelegate`, `createKbGate`, the `agent-runtime-mcp` bin |
-| `.../improvement` | `optimizePrompt` (text), `improvementDriver` (code/worktree), `reportOptimizationRun` |
+| `.../improvement` | `improvementDriver` (code/worktree `CandidateGenerator`), `agenticGenerator`, `reflectiveGenerator` — the code-surface driver you pass to agent-eval's `selfImprove` |
 | `.../analyst-loop` | `runAnalystLoop`, the analyst registry driver |
 | `.../platform` | cross-site SSO and the integrations hub |
 
 Bins: `agent-runtime-mcp` (delegation MCP server), `agent-runtime-loop` (schedulable delegated loop-runner).
 
 ## Adoption skill
 
-This package ships a self-contained adoption skill at [`skills/agent-runtime-adoption/SKILL.md`](./skills/agent-runtime-adoption/SKILL.md): driven loops, topology drivers, the `loopDispatch` campaign bridge, MCP delegation, and identity-gated `optimizePrompt`. It needs only this package plus `@tangle-network/agent-eval`. For the full self-improving pipeline (trace sink, analyst loop, scorecard, production loop, CI), see the `agent-eval-adoption` and `agent-stack-adoption` skills.
+This package ships a self-contained adoption skill at [`skills/agent-runtime-adoption/SKILL.md`](./skills/agent-runtime-adoption/SKILL.md): driven loops, topology drivers, the `loopDispatch` campaign bridge, MCP delegation, and the code-surface `improvementDriver` for agent-eval's `selfImprove`. It needs only this package plus `@tangle-network/agent-eval`. For the full self-improving pipeline (trace sink, analyst loop, scorecard, production loop, CI), see the `agent-eval-adoption` and `agent-stack-adoption` skills.
 
 ## Stability, tests, docs
 

diff --git a/bench/HARNESS.md b/bench/HARNESS.md
@@ -3,7 +3,7 @@
 If you're an agent picking this up: read this page, then run `pnpm help` + `pnpm gate` —
 do NOT re-derive the harness from source. This map is SHORT on purpose; if it disagrees
 with the code, the code wins — fix this page in the same turn (the anti-rediscovery law).
-Verified against source 2026-06-03 · agent-eval pinned `^0.76.0` (the optimizePrompt /
+Verified against source 2026-06-06 · agent-eval pinned `^0.83.0` (the selfImprove /
 heldoutSignificance API is version-coupled).
 
 ## What this harness answers
@@ -71,7 +71,7 @@ run.ts:  help · preflight · verify-judge · solve-one · solve-one-local · so
 standalone tools (NOT in run.ts — the gate lives here):
   corpus-replay.mts  --selector: selector@k vs random@k vs oracle@k over a corpus (THE offline gate)
   corpus-report.mts  paired-bootstrap CI + Benjamini-Hochberg over corpora
-  improve-prompt.ts     GEPA-optimize a directive vs a held-out gate + paired CI (optimizePrompt)
+  improve-prompt.ts     GEPA-optimize a directive vs a held-out gate + paired CI (selfImprove)
   finsearch-loop.ts  the real runLoop+createDynamicDriver closed loop on FinSearchComp
   terminal-compare.ts  Terminal-Bench compare (own main, not in run.ts)
 unit tests (the only fully-green, cred-free runnable surface besides offline replay):

diff --git a/bench/src/improve-prompt.ts b/bench/src/improve-prompt.ts
@@ -589,7 +589,7 @@ async function main() {
   console.log(`  ► held-out delta:            ${(result.lift * 100).toFixed(1)} pp`)
   console.log(`  gate decision: ${result.gateDecision} (improved=${improved})`)
 
-  // 0.76 heldoutSignificance: a bootstrap CI on the PAIRED winner−baseline held-out
+  // heldoutSignificance: a bootstrap CI on the PAIRED winner−baseline held-out
   // delta — turns a bare "+X pp" (a few-instance swing at thin n) into a CI + a
   // significance verdict, so we know whether to trust/promote or just scale n.
   try {

diff --git a/package.json b/package.json
@@ -124,7 +124,7 @@
   "license": "MIT",
   "packageManager": "pnpm@10.28.0",
   "peerDependencies": {
-    "@tangle-network/agent-eval": ">=0.76.0 <1.0.0",
+    "@tangle-network/agent-eval": ">=0.83.0 <1.0.0",
     "@tangle-network/agent-knowledge": ">=1.3.0 <2.0.0",
     "@tangle-network/sandbox": ">=0.1.2 <0.5.0",
     "playwright": "^1.40.0"

diff --git a/skills/agent-runtime-adoption/SKILL.md b/skills/agent-runtime-adoption/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: agent-runtime-adoption
-description: Adopt @tangle-network/agent-runtime in a product — the driven-loop kernel (runLoop), topology drivers (refine / fanout-vote / dynamic agent-authored), the loopDispatch campaign bridge, MCP delegation, and identity-gated prompt-surface optimization (optimizePrompt). Self-contained; needs only the published package + @tangle-network/agent-eval. Use when wiring runLoop, choosing a topology driver, optimizing a system/planner prompt, or exposing delegation tools.
+description: Adopt @tangle-network/agent-runtime in a product — the driven-loop kernel (runLoop), topology drivers (refine / fanout-vote / dynamic agent-authored), the loopDispatch campaign bridge, MCP delegation, and the code-surface improvementDriver for agent-eval's selfImprove (the optimization entry point). Self-contained; needs only the published package + @tangle-network/agent-eval. Use when wiring runLoop, choosing a topology driver, optimizing a system/planner prompt or code surface, or exposing delegation tools.
 ---
 
 # agent-runtime adoption — driven loops, topology drivers, prompt optimization
@@ -106,43 +106,46 @@ const dispatch = loopCampaignDispatch({
 
 `loopDispatch` is the `runProfileMatrix` variant (profile is an axis).
 
-## Identity-gated prompt optimization — `optimizePrompt`
+## Identity-gated optimization — agent-eval's `selfImprove`
 
-`@tangle-network/agent-runtime/improvement`. The text-surface entry point onto
-agent-eval's `runImprovementLoop` — sibling to `improvementDriver` (the
-code/worktree path). Optimizes any prompt surface (system / planner / judge
-rubric) and is **identity-gated by construction**: it runs evals, proposes
-candidates (default driver `gepaDriver`), and the held-out gate compares
-candidate vs baseline. `result.prompt` is the **baseline unless the gate decided
-`'ship'`** — so registering a prompt for optimization can never regress it; it
-only improves when held-out data earns it.
+The optimization entry point is **`selfImprove`** (`@tangle-network/agent-eval/contract`),
+NOT agent-runtime — agent-runtime contributes the code-surface `improvementDriver`
+(`/improvement`, the git-worktree path) you pass to it as `driver` to optimize CODE
+instead of a string. `selfImprove` optimizes any text/config surface (system /
+planner / judge rubric) and is **identity-gated by construction**: it runs evals,
+proposes candidates (default driver `gepaDriver`), and a held-out gate ships a winner
+only if it beats the baseline. `result.winner.surface` is the **baseline unless
+`result.gateDecision === 'ship'`** — so registering a surface for optimization can
+never regress it; it only improves when held-out data earns it.
 
 ```ts
-import { optimizePrompt } from '@tangle-network/agent-runtime/improvement'
-const { prompt, improved, decision, delta } = await optimizePrompt({
-  baselinePrompt: CURRENT_SYSTEM_PROMPT,
-  runWithPrompt: (prompt, scenario, ctx) => runYourThing(prompt, scenario),  // sandbox / runLoop / direct call
-  scenarios, holdoutScenarios, judges, runDir,
-  reflection: { llm, model: REFLECTION_MODEL },   // builds the default gepaDriver
-  // gate? — defaults to heldOutGate; pass defaultProductionGate for red-team hardening
+import { selfImprove } from '@tangle-network/agent-eval/contract'
+const result = await selfImprove({
+  baselineSurface: CURRENT_SYSTEM_PROMPT,
+  agent: (surface, scenario, ctx) => runYourThing(surface, scenario),  // sandbox / runLoop / direct call
+  scenarios,
+  judge,
+  budget: { holdoutScenarios, generations: 3, populationSize: 2 },
+  llm: { baseUrl, apiKey, model: REFLECTION_MODEL },   // drives the default gepaDriver
+  // driver? — pass agent-runtime's improvementDriver to optimize CODE (worktree) instead of a string
+  // gate?   — defaults to a held-out gate; pass defaultProductionGate for red-team hardening
 })
-// use `prompt` unconditionally: it's the baseline until a candidate genuinely wins
+// use result.winner.surface unconditionally: it's the baseline until a candidate genuinely wins
 ```
 
-### optimizePrompt gotchas — read before wiring
+### selfImprove gotchas — read before wiring
 
 - **`gepaDriver` mutates TEXT only**, and its only structural guard is `##` H2
   headings (`preserveSections`) + `maxSentenceEdits`. Make load-bearing sections
   of your prompt real `##` headings, and treat the output schema as fixed code —
   GEPA optimizes the prose, never the envelope/contract.
 - **Scenarios must be domain-real.** Derive them from the surface's own traces /
   ground truth, not from unrelated corpora. Cross-domain examples are noise.
-- **Extend, don't fork.** If the product already wires `runImprovementLoop`
-  (e.g. for a main-agent prompt), add the new surface as another target in that
-  harness rather than bolting on a second optimizer.
-- `runWithPrompt` is the only domain seam — the optimizer never assumes how a
-  prompt runs. Report cost via `ctx.cost` inside it so the integrity guard sees
-  real activity.
+- **Extend, don't fork.** If the product already wires `selfImprove` /
+  `runImprovementLoop` (e.g. for a main-agent prompt), add the new surface as
+  another target in that harness rather than bolting on a second optimizer.
+- `agent` is the only domain seam — the optimizer never assumes how a surface
+  runs. Report cost via `ctx.cost` inside it so the integrity guard sees real activity.
 - A live run needs a real backend (`TANGLE_API_KEY` / router, or local
   cli-bridge) and real spend; it is not free.
 
@@ -161,7 +164,7 @@ Mount it on a production `AgentProfile.mcp`; do not re-implement delegation.
       `loops/types.ts:Driver` only when none fit — never fork the kernel.
 - [ ] `runLoop` is bridged to campaigns via `loopDispatch` / `loopCampaignDispatch`
       (usage + trace auto-forwarded), not a hand-rolled ExecCtx.
-- [ ] Every optimizable prompt is registered through `optimizePrompt` (or the
+- [ ] Every optimizable prompt is registered through `selfImprove` (or the
       product's existing `runImprovementLoop`), identity-gated on a held-out set.
 - [ ] Boundaries fail loud: no `null` sandbox client, no silent adapter return,
       no unguarded planner envelope.