Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ agent-knowledge ─┐
agent-runtime ───┘ (this repo — wraps the substrate)
```

**Rule: agent-runtime depends on agent-eval. agent-eval MUST NOT import from agent-runtime.** No upward imports, no `peerDependencies` in agent-eval pointing here, no `import type { X } from '@tangle-network/agent-runtime'` inside agent-eval. A spotted upward import is a bug — file an issue and move the type into agent-eval. agent-eval is declared a **required `peerDependency`** (pinned `^0.76.0`), not a hard dependency — keep it in sync with the `optimizePrompt`/`heldoutSignificance`/`loopDispatch` APIs the code uses.
**Rule: agent-runtime depends on agent-eval. agent-eval MUST NOT import from agent-runtime.** No upward imports, no `peerDependencies` in agent-eval pointing here, no `import type { X } from '@tangle-network/agent-runtime'` inside agent-eval. A spotted upward import is a bug — file an issue and move the type into agent-eval. agent-eval is declared a **required `peerDependency`** (floor `>=0.83.0` — `selfImprove` exposes `analyzeGeneration` from 0.83), not a hard dependency — keep it in sync with the `selfImprove`/`heldoutSignificance`/`loopDispatch` APIs the code uses.

Substrate primitives CONSUMED from agent-eval: `DefaultVerdict`, `RunRecord`, `AgentEvalError` + taxonomy, `AnalystFinding`/`AnalystRunResult`/`FindingsDiff`, `TraceAnalystKindSpec`, `KnowledgeReadinessReport`, and the campaign types (`DispatchContext`/`ProfileDispatchFn`/`Scenario`, type-only).

Expand Down
37 changes: 21 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ That is the common case. Everything below is for when one chat turn is not enoug
| Delegate a disciplined loop by mode (code, research, ...) | `runDelegatedLoop` or `agent-runtime-loop` | root |
| Build code reliably (reviewed, gated) | `createDefaultCoderDelegate` | `/mcp` |
| Grow a knowledge base with only grounded facts | `createKbGate` | `/mcp` |
| Improve a prompt safely (identity-gated) | `optimizePrompt` | `/improvement` |
| Improve a prompt safely (identity-gated) | `selfImprove` | `@tangle-network/agent-eval/contract` |
| Ship loop traces to a GenAI viewer | `buildLoopOtelSpans` plus `createOtelExporter` | root |
| Expose delegation as MCP tools to a sandbox agent | `createMcpServer` or `agent-runtime-mcp` | `/mcp` |
| Mutate surfaces from trace findings | `runAnalystLoop` | `/analyst-loop` |
Expand Down Expand Up @@ -90,21 +90,25 @@ Shipped drivers (`/loops/drivers`): `createRefineDriver` (single task, iterate u

The same machinery, run at the optimization timescale.

`optimizePrompt` (`/improvement`) optimizes any text prompt over agent-eval's `runImprovementLoop`, identity-gated by construction. It runs evals, proposes candidates (default `gepaDriver`), and a held-out gate compares candidate against baseline. `result.prompt` is the baseline unless the gate decided `ship`, so registering a prompt for optimization can never regress it.
The one entry point is agent-eval's **`selfImprove`** (`@tangle-network/agent-eval/contract`). It runs a closed loop over any text/config surface, identity-gated by construction: it evaluates, proposes candidates (default `gepaDriver`), and a held-out gate ships a winner only if it beats the baseline. `result.winner.surface` is the baseline unless `result.gateDecision === 'ship'`, so registering a surface for optimization can never regress it.

```ts
import { optimizePrompt } from '@tangle-network/agent-runtime/improvement'

const { prompt, improved, delta } = await optimizePrompt({
baselinePrompt: CURRENT_SYSTEM_PROMPT,
runWithPrompt: (candidate, scenario, ctx) => runYourThing(candidate, scenario),
scenarios, holdoutScenarios, judges, runDir,
reflection: { llm, model: 'claude-sonnet-4-6' },
import { selfImprove } from '@tangle-network/agent-eval/contract'

const result = await selfImprove({
baselineSurface: CURRENT_SYSTEM_PROMPT,
agent: (surface, scenario, ctx) => runYourThing(surface, scenario),
scenarios,
judge,
budget: { holdoutScenarios, generations: 3 },
llm: { baseUrl, apiKey, model: 'claude-sonnet-4-6' },
})
// assign `prompt` unconditionally; it is the safe one
// result.winner.surface is the safe one — the baseline unless gateDecision === 'ship'
```

`runAnalystLoop` (`/analyst-loop`) mines real run traces into findings; `createAnalystDriverHook` feeds those findings to a dynamic-driver planner via `PlannerContext.analyses`, with a firewall (`assertTraceDerivedFindings`) that rejects any finding derived from a judge verdict. `reportOptimizationRun` (`/improvement`) ships an optimization run's proposal and verdict to Tangle Intelligence over the eval-run wire.
agent-runtime contributes the runtime-specific piece: the **CODE-surface `improvementDriver`** (`/improvement`) — a git-worktree mutator you pass to `selfImprove` as `driver` to optimize code instead of a string.

`runAnalystLoop` (`/analyst-loop`) mines real run traces into findings; `createAnalystDriverHook` feeds those findings to a dynamic-driver planner via `PlannerContext.analyses`, with a firewall (`assertTraceDerivedFindings`) that rejects any finding derived from a judge verdict. Production intake — turning real run traces into the corpus `selfImprove` optimizes against — is agent-eval's `analyzeRuns` / `partitionRunsByAuthoringModel` (`/contract`).

## Delegated loops

Expand Down Expand Up @@ -170,17 +174,18 @@ One entrypoint, `runExperiment(adapter, { sandboxClient, agentRun, arms, ... })`
| Driver | none, required by `runLoop` | `createRefineDriver`, `createFanoutVoteDriver`, `createDynamicDriver` |
| Winner selection (coder delegate) | `highest-score` | `winnerSelection` option |
| KB gate min passage | 12 chars | `createKbGate({ minPassageChars })` |
| `optimizePrompt` gate | `heldOutGate` | `defaultProductionGate` for red-team hardening |
| `selfImprove` gate | held-out gate (default) | pass `gate: defaultProductionGate` for red-team hardening |
| OTEL export | off | set `OTEL_EXPORTER_OTLP_ENDPOINT` |
| Loop-runner mode failure | recorded as `{ ok: false }` | `runDelegatedLoop` never crashes on a thrown engine |

## Composition with the stack

```
agent-runtime handleChatTurn, runLoop + drivers, runProgram, runDelegatedLoop, createMcpServer,
optimizePrompt, createKbGate, buildLoopOtelSpans, defineAgent
improvementDriver, createKbGate, buildLoopOtelSpans, defineAgent

agent-eval runEvalCampaign, runImprovementLoop (gepaDriver), heldOutGate, runAgentMatrix.
agent-eval selfImprove (the optimization entry point), runEvalCampaign,
runImprovementLoop (gepaDriver), heldOutGate, runAgentMatrix, analyzeRuns.
Consumes runtime traces, scores, gates promotion. agent-runtime depends on it,
never the reverse.

Expand All @@ -200,15 +205,15 @@ sandbox AgentProfile, Sandbox.create, streamPrompt, exportTraceBundle. T
| `.../loops` | the `runLoop` kernel, the `refine` / `fanout-vote` / `dynamic` drivers, `runProgram`, `loopDispatch` |
| `.../profiles` | `coderProfile`, `researcherProfile` presets |
| `.../mcp` | `createMcpServer`, `createDefaultCoderDelegate`, `createKbGate`, the `agent-runtime-mcp` bin |
| `.../improvement` | `optimizePrompt` (text), `improvementDriver` (code/worktree), `reportOptimizationRun` |
| `.../improvement` | `improvementDriver` (code/worktree `CandidateGenerator`), `agenticGenerator`, `reflectiveGenerator` — the code-surface driver you pass to agent-eval's `selfImprove` |
| `.../analyst-loop` | `runAnalystLoop`, the analyst registry driver |
| `.../platform` | cross-site SSO and the integrations hub |

Bins: `agent-runtime-mcp` (delegation MCP server), `agent-runtime-loop` (schedulable delegated loop-runner).

## Adoption skill

This package ships a self-contained adoption skill at [`skills/agent-runtime-adoption/SKILL.md`](./skills/agent-runtime-adoption/SKILL.md): driven loops, topology drivers, the `loopDispatch` campaign bridge, MCP delegation, and identity-gated `optimizePrompt`. It needs only this package plus `@tangle-network/agent-eval`. For the full self-improving pipeline (trace sink, analyst loop, scorecard, production loop, CI), see the `agent-eval-adoption` and `agent-stack-adoption` skills.
This package ships a self-contained adoption skill at [`skills/agent-runtime-adoption/SKILL.md`](./skills/agent-runtime-adoption/SKILL.md): driven loops, topology drivers, the `loopDispatch` campaign bridge, MCP delegation, and the code-surface `improvementDriver` for agent-eval's `selfImprove`. It needs only this package plus `@tangle-network/agent-eval`. For the full self-improving pipeline (trace sink, analyst loop, scorecard, production loop, CI), see the `agent-eval-adoption` and `agent-stack-adoption` skills.

## Stability, tests, docs

Expand Down
4 changes: 2 additions & 2 deletions bench/HARNESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
If you're an agent picking this up: read this page, then run `pnpm help` + `pnpm gate` —
do NOT re-derive the harness from source. This map is SHORT on purpose; if it disagrees
with the code, the code wins — fix this page in the same turn (the anti-rediscovery law).
Verified against source 2026-06-03 · agent-eval pinned `^0.76.0` (the optimizePrompt /
Verified against source 2026-06-06 · agent-eval pinned `^0.83.0` (the selfImprove /
heldoutSignificance API is version-coupled).

## What this harness answers
Expand Down Expand Up @@ -71,7 +71,7 @@ run.ts: help · preflight · verify-judge · solve-one · solve-one-local · so
standalone tools (NOT in run.ts — the gate lives here):
corpus-replay.mts --selector: selector@k vs random@k vs oracle@k over a corpus (THE offline gate)
corpus-report.mts paired-bootstrap CI + Benjamini-Hochberg over corpora
improve-prompt.ts GEPA-optimize a directive vs a held-out gate + paired CI (optimizePrompt)
improve-prompt.ts GEPA-optimize a directive vs a held-out gate + paired CI (selfImprove)
finsearch-loop.ts the real runLoop+createDynamicDriver closed loop on FinSearchComp
terminal-compare.ts Terminal-Bench compare (own main, not in run.ts)
unit tests (the only fully-green, cred-free runnable surface besides offline replay):
Expand Down
2 changes: 1 addition & 1 deletion bench/src/improve-prompt.ts
Original file line number Diff line number Diff line change
Expand Up @@ -589,7 +589,7 @@ async function main() {
console.log(` ► held-out delta: ${(result.lift * 100).toFixed(1)} pp`)
console.log(` gate decision: ${result.gateDecision} (improved=${improved})`)

// 0.76 heldoutSignificance: a bootstrap CI on the PAIRED winner−baseline held-out
// heldoutSignificance: a bootstrap CI on the PAIRED winner−baseline held-out
// delta — turns a bare "+X pp" (a few-instance swing at thin n) into a CI + a
// significance verdict, so we know whether to trust/promote or just scale n.
try {
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@
"license": "MIT",
"packageManager": "pnpm@10.28.0",
"peerDependencies": {
"@tangle-network/agent-eval": ">=0.76.0 <1.0.0",
"@tangle-network/agent-eval": ">=0.83.0 <1.0.0",
"@tangle-network/agent-knowledge": ">=1.3.0 <2.0.0",
"@tangle-network/sandbox": ">=0.1.2 <0.5.0",
"playwright": "^1.40.0"
Expand Down
55 changes: 29 additions & 26 deletions skills/agent-runtime-adoption/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: agent-runtime-adoption
description: Adopt @tangle-network/agent-runtime in a product — the driven-loop kernel (runLoop), topology drivers (refine / fanout-vote / dynamic agent-authored), the loopDispatch campaign bridge, MCP delegation, and identity-gated prompt-surface optimization (optimizePrompt). Self-contained; needs only the published package + @tangle-network/agent-eval. Use when wiring runLoop, choosing a topology driver, optimizing a system/planner prompt, or exposing delegation tools.
description: Adopt @tangle-network/agent-runtime in a product — the driven-loop kernel (runLoop), topology drivers (refine / fanout-vote / dynamic agent-authored), the loopDispatch campaign bridge, MCP delegation, and the code-surface improvementDriver for agent-eval's selfImprove (the optimization entry point). Self-contained; needs only the published package + @tangle-network/agent-eval. Use when wiring runLoop, choosing a topology driver, optimizing a system/planner prompt or code surface, or exposing delegation tools.
---

# agent-runtime adoption — driven loops, topology drivers, prompt optimization
Expand Down Expand Up @@ -106,43 +106,46 @@ const dispatch = loopCampaignDispatch({

`loopDispatch` is the `runProfileMatrix` variant (profile is an axis).

## Identity-gated prompt optimization — `optimizePrompt`
## Identity-gated optimization — agent-eval's `selfImprove`

`@tangle-network/agent-runtime/improvement`. The text-surface entry point onto
agent-eval's `runImprovementLoop` — sibling to `improvementDriver` (the
code/worktree path). Optimizes any prompt surface (system / planner / judge
rubric) and is **identity-gated by construction**: it runs evals, proposes
candidates (default driver `gepaDriver`), and the held-out gate compares
candidate vs baseline. `result.prompt` is the **baseline unless the gate decided
`'ship'`** — so registering a prompt for optimization can never regress it; it
only improves when held-out data earns it.
The optimization entry point is **`selfImprove`** (`@tangle-network/agent-eval/contract`),
NOT agent-runtime — agent-runtime contributes the code-surface `improvementDriver`
(`/improvement`, the git-worktree path) you pass to it as `driver` to optimize CODE
instead of a string. `selfImprove` optimizes any text/config surface (system /
planner / judge rubric) and is **identity-gated by construction**: it runs evals,
proposes candidates (default driver `gepaDriver`), and a held-out gate ships a winner
only if it beats the baseline. `result.winner.surface` is the **baseline unless
`result.gateDecision === 'ship'`** — so registering a surface for optimization can
never regress it; it only improves when held-out data earns it.

```ts
import { optimizePrompt } from '@tangle-network/agent-runtime/improvement'
const { prompt, improved, decision, delta } = await optimizePrompt({
baselinePrompt: CURRENT_SYSTEM_PROMPT,
runWithPrompt: (prompt, scenario, ctx) => runYourThing(prompt, scenario), // sandbox / runLoop / direct call
scenarios, holdoutScenarios, judges, runDir,
reflection: { llm, model: REFLECTION_MODEL }, // builds the default gepaDriver
// gate? — defaults to heldOutGate; pass defaultProductionGate for red-team hardening
import { selfImprove } from '@tangle-network/agent-eval/contract'
const result = await selfImprove({
baselineSurface: CURRENT_SYSTEM_PROMPT,
agent: (surface, scenario, ctx) => runYourThing(surface, scenario), // sandbox / runLoop / direct call
scenarios,
judge,
budget: { holdoutScenarios, generations: 3, populationSize: 2 },
llm: { baseUrl, apiKey, model: REFLECTION_MODEL }, // drives the default gepaDriver
// driver? — pass agent-runtime's improvementDriver to optimize CODE (worktree) instead of a string
// gate? — defaults to a held-out gate; pass defaultProductionGate for red-team hardening
})
// use `prompt` unconditionally: it's the baseline until a candidate genuinely wins
// use result.winner.surface unconditionally: it's the baseline until a candidate genuinely wins
```

### optimizePrompt gotchas — read before wiring
### selfImprove gotchas — read before wiring

- **`gepaDriver` mutates TEXT only**, and its only structural guard is `##` H2
headings (`preserveSections`) + `maxSentenceEdits`. Make load-bearing sections
of your prompt real `##` headings, and treat the output schema as fixed code —
GEPA optimizes the prose, never the envelope/contract.
- **Scenarios must be domain-real.** Derive them from the surface's own traces /
ground truth, not from unrelated corpora. Cross-domain examples are noise.
- **Extend, don't fork.** If the product already wires `runImprovementLoop`
(e.g. for a main-agent prompt), add the new surface as another target in that
harness rather than bolting on a second optimizer.
- `runWithPrompt` is the only domain seam — the optimizer never assumes how a
prompt runs. Report cost via `ctx.cost` inside it so the integrity guard sees
real activity.
- **Extend, don't fork.** If the product already wires `selfImprove` /
`runImprovementLoop` (e.g. for a main-agent prompt), add the new surface as
another target in that harness rather than bolting on a second optimizer.
- `agent` is the only domain seam — the optimizer never assumes how a surface
runs. Report cost via `ctx.cost` inside it so the integrity guard sees real activity.
- A live run needs a real backend (`TANGLE_API_KEY` / router, or local
cli-bridge) and real spend; it is not free.

Expand All @@ -161,7 +164,7 @@ Mount it on a production `AgentProfile.mcp`; do not re-implement delegation.
`loops/types.ts:Driver` only when none fit — never fork the kernel.
- [ ] `runLoop` is bridged to campaigns via `loopDispatch` / `loopCampaignDispatch`
(usage + trace auto-forwarded), not a hand-rolled ExecCtx.
- [ ] Every optimizable prompt is registered through `optimizePrompt` (or the
- [ ] Every optimizable prompt is registered through `selfImprove` (or the
product's existing `runImprovementLoop`), identity-gated on a held-out set.
- [ ] Boundaries fail loud: no `null` sandbox client, no silent adapter return,
no unguarded planner envelope.
Expand Down
Loading