diff --git a/CLAUDE.md b/CLAUDE.md index a1449e11..2b42fe89 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -23,7 +23,7 @@ agent-knowledge ─┐ agent-runtime ───┘ (this repo — wraps the substrate) ``` -**Rule: agent-runtime depends on agent-eval. agent-eval MUST NOT import from agent-runtime.** No upward imports, no `peerDependencies` in agent-eval pointing here, no `import type { X } from '@tangle-network/agent-runtime'` inside agent-eval. A spotted upward import is a bug — file an issue and move the type into agent-eval. agent-eval is declared a **required `peerDependency`** (pinned `^0.76.0`), not a hard dependency — keep it in sync with the `optimizePrompt`/`heldoutSignificance`/`loopDispatch` APIs the code uses. +**Rule: agent-runtime depends on agent-eval. agent-eval MUST NOT import from agent-runtime.** No upward imports, no `peerDependencies` in agent-eval pointing here, no `import type { X } from '@tangle-network/agent-runtime'` inside agent-eval. A spotted upward import is a bug — file an issue and move the type into agent-eval. agent-eval is declared a **required `peerDependency`** (floor `>=0.83.0` — `selfImprove` exposes `analyzeGeneration` from 0.83), not a hard dependency — keep it in sync with the `selfImprove`/`heldoutSignificance`/`loopDispatch` APIs the code uses. Substrate primitives CONSUMED from agent-eval: `DefaultVerdict`, `RunRecord`, `AgentEvalError` + taxonomy, `AnalystFinding`/`AnalystRunResult`/`FindingsDiff`, `TraceAnalystKindSpec`, `KnowledgeReadinessReport`, and the campaign types (`DispatchContext`/`ProfileDispatchFn`/`Scenario`, type-only). diff --git a/README.md b/README.md index eb1c5fd7..b5e51c67 100644 --- a/README.md +++ b/README.md @@ -58,7 +58,7 @@ That is the common case. Everything below is for when one chat turn is not enoug | Delegate a disciplined loop by mode (code, research, ...) | `runDelegatedLoop` or `agent-runtime-loop` | root | | Build code reliably (reviewed, gated) | `createDefaultCoderDelegate` | `/mcp` | | Grow a knowledge base with only grounded facts | `createKbGate` | `/mcp` | -| Improve a prompt safely (identity-gated) | `optimizePrompt` | `/improvement` | +| Improve a prompt safely (identity-gated) | `selfImprove` | `@tangle-network/agent-eval/contract` | | Ship loop traces to a GenAI viewer | `buildLoopOtelSpans` plus `createOtelExporter` | root | | Expose delegation as MCP tools to a sandbox agent | `createMcpServer` or `agent-runtime-mcp` | `/mcp` | | Mutate surfaces from trace findings | `runAnalystLoop` | `/analyst-loop` | @@ -90,21 +90,25 @@ Shipped drivers (`/loops/drivers`): `createRefineDriver` (single task, iterate u The same machinery, run at the optimization timescale. -`optimizePrompt` (`/improvement`) optimizes any text prompt over agent-eval's `runImprovementLoop`, identity-gated by construction. It runs evals, proposes candidates (default `gepaDriver`), and a held-out gate compares candidate against baseline. `result.prompt` is the baseline unless the gate decided `ship`, so registering a prompt for optimization can never regress it. +The one entry point is agent-eval's **`selfImprove`** (`@tangle-network/agent-eval/contract`). It runs a closed loop over any text/config surface, identity-gated by construction: it evaluates, proposes candidates (default `gepaDriver`), and a held-out gate ships a winner only if it beats the baseline. `result.winner.surface` is the baseline unless `result.gateDecision === 'ship'`, so registering a surface for optimization can never regress it. ```ts -import { optimizePrompt } from '@tangle-network/agent-runtime/improvement' - -const { prompt, improved, delta } = await optimizePrompt({ - baselinePrompt: CURRENT_SYSTEM_PROMPT, - runWithPrompt: (candidate, scenario, ctx) => runYourThing(candidate, scenario), - scenarios, holdoutScenarios, judges, runDir, - reflection: { llm, model: 'claude-sonnet-4-6' }, +import { selfImprove } from '@tangle-network/agent-eval/contract' + +const result = await selfImprove({ + baselineSurface: CURRENT_SYSTEM_PROMPT, + agent: (surface, scenario, ctx) => runYourThing(surface, scenario), + scenarios, + judge, + budget: { holdoutScenarios, generations: 3 }, + llm: { baseUrl, apiKey, model: 'claude-sonnet-4-6' }, }) -// assign `prompt` unconditionally; it is the safe one +// result.winner.surface is the safe one — the baseline unless gateDecision === 'ship' ``` -`runAnalystLoop` (`/analyst-loop`) mines real run traces into findings; `createAnalystDriverHook` feeds those findings to a dynamic-driver planner via `PlannerContext.analyses`, with a firewall (`assertTraceDerivedFindings`) that rejects any finding derived from a judge verdict. `reportOptimizationRun` (`/improvement`) ships an optimization run's proposal and verdict to Tangle Intelligence over the eval-run wire. +agent-runtime contributes the runtime-specific piece: the **CODE-surface `improvementDriver`** (`/improvement`) — a git-worktree mutator you pass to `selfImprove` as `driver` to optimize code instead of a string. + +`runAnalystLoop` (`/analyst-loop`) mines real run traces into findings; `createAnalystDriverHook` feeds those findings to a dynamic-driver planner via `PlannerContext.analyses`, with a firewall (`assertTraceDerivedFindings`) that rejects any finding derived from a judge verdict. Production intake — turning real run traces into the corpus `selfImprove` optimizes against — is agent-eval's `analyzeRuns` / `partitionRunsByAuthoringModel` (`/contract`). ## Delegated loops @@ -170,7 +174,7 @@ One entrypoint, `runExperiment(adapter, { sandboxClient, agentRun, arms, ... })` | Driver | none, required by `runLoop` | `createRefineDriver`, `createFanoutVoteDriver`, `createDynamicDriver` | | Winner selection (coder delegate) | `highest-score` | `winnerSelection` option | | KB gate min passage | 12 chars | `createKbGate({ minPassageChars })` | -| `optimizePrompt` gate | `heldOutGate` | `defaultProductionGate` for red-team hardening | +| `selfImprove` gate | held-out gate (default) | pass `gate: defaultProductionGate` for red-team hardening | | OTEL export | off | set `OTEL_EXPORTER_OTLP_ENDPOINT` | | Loop-runner mode failure | recorded as `{ ok: false }` | `runDelegatedLoop` never crashes on a thrown engine | @@ -178,9 +182,10 @@ One entrypoint, `runExperiment(adapter, { sandboxClient, agentRun, arms, ... })` ``` agent-runtime handleChatTurn, runLoop + drivers, runProgram, runDelegatedLoop, createMcpServer, - optimizePrompt, createKbGate, buildLoopOtelSpans, defineAgent + improvementDriver, createKbGate, buildLoopOtelSpans, defineAgent -agent-eval runEvalCampaign, runImprovementLoop (gepaDriver), heldOutGate, runAgentMatrix. +agent-eval selfImprove (the optimization entry point), runEvalCampaign, + runImprovementLoop (gepaDriver), heldOutGate, runAgentMatrix, analyzeRuns. Consumes runtime traces, scores, gates promotion. agent-runtime depends on it, never the reverse. @@ -200,7 +205,7 @@ sandbox AgentProfile, Sandbox.create, streamPrompt, exportTraceBundle. T | `.../loops` | the `runLoop` kernel, the `refine` / `fanout-vote` / `dynamic` drivers, `runProgram`, `loopDispatch` | | `.../profiles` | `coderProfile`, `researcherProfile` presets | | `.../mcp` | `createMcpServer`, `createDefaultCoderDelegate`, `createKbGate`, the `agent-runtime-mcp` bin | -| `.../improvement` | `optimizePrompt` (text), `improvementDriver` (code/worktree), `reportOptimizationRun` | +| `.../improvement` | `improvementDriver` (code/worktree `CandidateGenerator`), `agenticGenerator`, `reflectiveGenerator` — the code-surface driver you pass to agent-eval's `selfImprove` | | `.../analyst-loop` | `runAnalystLoop`, the analyst registry driver | | `.../platform` | cross-site SSO and the integrations hub | @@ -208,7 +213,7 @@ Bins: `agent-runtime-mcp` (delegation MCP server), `agent-runtime-loop` (schedul ## Adoption skill -This package ships a self-contained adoption skill at [`skills/agent-runtime-adoption/SKILL.md`](./skills/agent-runtime-adoption/SKILL.md): driven loops, topology drivers, the `loopDispatch` campaign bridge, MCP delegation, and identity-gated `optimizePrompt`. It needs only this package plus `@tangle-network/agent-eval`. For the full self-improving pipeline (trace sink, analyst loop, scorecard, production loop, CI), see the `agent-eval-adoption` and `agent-stack-adoption` skills. +This package ships a self-contained adoption skill at [`skills/agent-runtime-adoption/SKILL.md`](./skills/agent-runtime-adoption/SKILL.md): driven loops, topology drivers, the `loopDispatch` campaign bridge, MCP delegation, and the code-surface `improvementDriver` for agent-eval's `selfImprove`. It needs only this package plus `@tangle-network/agent-eval`. For the full self-improving pipeline (trace sink, analyst loop, scorecard, production loop, CI), see the `agent-eval-adoption` and `agent-stack-adoption` skills. ## Stability, tests, docs diff --git a/bench/HARNESS.md b/bench/HARNESS.md index 6c02ff80..07d5bd2a 100644 --- a/bench/HARNESS.md +++ b/bench/HARNESS.md @@ -3,7 +3,7 @@ If you're an agent picking this up: read this page, then run `pnpm help` + `pnpm gate` — do NOT re-derive the harness from source. This map is SHORT on purpose; if it disagrees with the code, the code wins — fix this page in the same turn (the anti-rediscovery law). -Verified against source 2026-06-03 · agent-eval pinned `^0.76.0` (the optimizePrompt / +Verified against source 2026-06-06 · agent-eval pinned `^0.83.0` (the selfImprove / heldoutSignificance API is version-coupled). ## What this harness answers @@ -71,7 +71,7 @@ run.ts: help · preflight · verify-judge · solve-one · solve-one-local · so standalone tools (NOT in run.ts — the gate lives here): corpus-replay.mts --selector: selector@k vs random@k vs oracle@k over a corpus (THE offline gate) corpus-report.mts paired-bootstrap CI + Benjamini-Hochberg over corpora - improve-prompt.ts GEPA-optimize a directive vs a held-out gate + paired CI (optimizePrompt) + improve-prompt.ts GEPA-optimize a directive vs a held-out gate + paired CI (selfImprove) finsearch-loop.ts the real runLoop+createDynamicDriver closed loop on FinSearchComp terminal-compare.ts Terminal-Bench compare (own main, not in run.ts) unit tests (the only fully-green, cred-free runnable surface besides offline replay): diff --git a/bench/src/improve-prompt.ts b/bench/src/improve-prompt.ts index 6ca2e5e5..24a29cbe 100644 --- a/bench/src/improve-prompt.ts +++ b/bench/src/improve-prompt.ts @@ -589,7 +589,7 @@ async function main() { console.log(` ► held-out delta: ${(result.lift * 100).toFixed(1)} pp`) console.log(` gate decision: ${result.gateDecision} (improved=${improved})`) - // 0.76 heldoutSignificance: a bootstrap CI on the PAIRED winner−baseline held-out + // heldoutSignificance: a bootstrap CI on the PAIRED winner−baseline held-out // delta — turns a bare "+X pp" (a few-instance swing at thin n) into a CI + a // significance verdict, so we know whether to trust/promote or just scale n. try { diff --git a/package.json b/package.json index af226198..fcbe4ec6 100644 --- a/package.json +++ b/package.json @@ -124,7 +124,7 @@ "license": "MIT", "packageManager": "pnpm@10.28.0", "peerDependencies": { - "@tangle-network/agent-eval": ">=0.76.0 <1.0.0", + "@tangle-network/agent-eval": ">=0.83.0 <1.0.0", "@tangle-network/agent-knowledge": ">=1.3.0 <2.0.0", "@tangle-network/sandbox": ">=0.1.2 <0.5.0", "playwright": "^1.40.0" diff --git a/skills/agent-runtime-adoption/SKILL.md b/skills/agent-runtime-adoption/SKILL.md index 8730c858..42ae8ec2 100644 --- a/skills/agent-runtime-adoption/SKILL.md +++ b/skills/agent-runtime-adoption/SKILL.md @@ -1,6 +1,6 @@ --- name: agent-runtime-adoption -description: Adopt @tangle-network/agent-runtime in a product — the driven-loop kernel (runLoop), topology drivers (refine / fanout-vote / dynamic agent-authored), the loopDispatch campaign bridge, MCP delegation, and identity-gated prompt-surface optimization (optimizePrompt). Self-contained; needs only the published package + @tangle-network/agent-eval. Use when wiring runLoop, choosing a topology driver, optimizing a system/planner prompt, or exposing delegation tools. +description: Adopt @tangle-network/agent-runtime in a product — the driven-loop kernel (runLoop), topology drivers (refine / fanout-vote / dynamic agent-authored), the loopDispatch campaign bridge, MCP delegation, and the code-surface improvementDriver for agent-eval's selfImprove (the optimization entry point). Self-contained; needs only the published package + @tangle-network/agent-eval. Use when wiring runLoop, choosing a topology driver, optimizing a system/planner prompt or code surface, or exposing delegation tools. --- # agent-runtime adoption — driven loops, topology drivers, prompt optimization @@ -106,30 +106,34 @@ const dispatch = loopCampaignDispatch({ `loopDispatch` is the `runProfileMatrix` variant (profile is an axis). -## Identity-gated prompt optimization — `optimizePrompt` +## Identity-gated optimization — agent-eval's `selfImprove` -`@tangle-network/agent-runtime/improvement`. The text-surface entry point onto -agent-eval's `runImprovementLoop` — sibling to `improvementDriver` (the -code/worktree path). Optimizes any prompt surface (system / planner / judge -rubric) and is **identity-gated by construction**: it runs evals, proposes -candidates (default driver `gepaDriver`), and the held-out gate compares -candidate vs baseline. `result.prompt` is the **baseline unless the gate decided -`'ship'`** — so registering a prompt for optimization can never regress it; it -only improves when held-out data earns it. +The optimization entry point is **`selfImprove`** (`@tangle-network/agent-eval/contract`), +NOT agent-runtime — agent-runtime contributes the code-surface `improvementDriver` +(`/improvement`, the git-worktree path) you pass to it as `driver` to optimize CODE +instead of a string. `selfImprove` optimizes any text/config surface (system / +planner / judge rubric) and is **identity-gated by construction**: it runs evals, +proposes candidates (default driver `gepaDriver`), and a held-out gate ships a winner +only if it beats the baseline. `result.winner.surface` is the **baseline unless +`result.gateDecision === 'ship'`** — so registering a surface for optimization can +never regress it; it only improves when held-out data earns it. ```ts -import { optimizePrompt } from '@tangle-network/agent-runtime/improvement' -const { prompt, improved, decision, delta } = await optimizePrompt({ - baselinePrompt: CURRENT_SYSTEM_PROMPT, - runWithPrompt: (prompt, scenario, ctx) => runYourThing(prompt, scenario), // sandbox / runLoop / direct call - scenarios, holdoutScenarios, judges, runDir, - reflection: { llm, model: REFLECTION_MODEL }, // builds the default gepaDriver - // gate? — defaults to heldOutGate; pass defaultProductionGate for red-team hardening +import { selfImprove } from '@tangle-network/agent-eval/contract' +const result = await selfImprove({ + baselineSurface: CURRENT_SYSTEM_PROMPT, + agent: (surface, scenario, ctx) => runYourThing(surface, scenario), // sandbox / runLoop / direct call + scenarios, + judge, + budget: { holdoutScenarios, generations: 3, populationSize: 2 }, + llm: { baseUrl, apiKey, model: REFLECTION_MODEL }, // drives the default gepaDriver + // driver? — pass agent-runtime's improvementDriver to optimize CODE (worktree) instead of a string + // gate? — defaults to a held-out gate; pass defaultProductionGate for red-team hardening }) -// use `prompt` unconditionally: it's the baseline until a candidate genuinely wins +// use result.winner.surface unconditionally: it's the baseline until a candidate genuinely wins ``` -### optimizePrompt gotchas — read before wiring +### selfImprove gotchas — read before wiring - **`gepaDriver` mutates TEXT only**, and its only structural guard is `##` H2 headings (`preserveSections`) + `maxSentenceEdits`. Make load-bearing sections @@ -137,12 +141,11 @@ const { prompt, improved, decision, delta } = await optimizePrompt({ GEPA optimizes the prose, never the envelope/contract. - **Scenarios must be domain-real.** Derive them from the surface's own traces / ground truth, not from unrelated corpora. Cross-domain examples are noise. -- **Extend, don't fork.** If the product already wires `runImprovementLoop` - (e.g. for a main-agent prompt), add the new surface as another target in that - harness rather than bolting on a second optimizer. -- `runWithPrompt` is the only domain seam — the optimizer never assumes how a - prompt runs. Report cost via `ctx.cost` inside it so the integrity guard sees - real activity. +- **Extend, don't fork.** If the product already wires `selfImprove` / + `runImprovementLoop` (e.g. for a main-agent prompt), add the new surface as + another target in that harness rather than bolting on a second optimizer. +- `agent` is the only domain seam — the optimizer never assumes how a surface + runs. Report cost via `ctx.cost` inside it so the integrity guard sees real activity. - A live run needs a real backend (`TANGLE_API_KEY` / router, or local cli-bridge) and real spend; it is not free. @@ -161,7 +164,7 @@ Mount it on a production `AgentProfile.mcp`; do not re-implement delegation. `loops/types.ts:Driver` only when none fit — never fork the kernel. - [ ] `runLoop` is bridged to campaigns via `loopDispatch` / `loopCampaignDispatch` (usage + trace auto-forwarded), not a hand-rolled ExecCtx. -- [ ] Every optimizable prompt is registered through `optimizePrompt` (or the +- [ ] Every optimizable prompt is registered through `selfImprove` (or the product's existing `runImprovementLoop`), identity-gated on a held-out set. - [ ] Boundaries fail loud: no `null` sandbox client, no silent adapter return, no unguarded planner envelope.