chore(deps): bump @tangle-network/agent-eval from ^0.20.0 to ^0.23.0 by drewstone · Pull Request #3 · tangle-network/agent-runtime

drewstone · 2026-05-08T22:58:03Z

Summary

Mechanical version bump absorbing three minors of new agent-eval surface. The control-loop primitives this runtime consumes (runAgentControlLoop, scoreKnowledgeReadiness, blockingKnowledgeEval, userQuestionsForKnowledgeGaps, acquisitionPlansForKnowledgeGaps) are unchanged across 0.21 → 0.22 → 0.23, so the bump is a no-op functionally.

What's new in agent-eval 0.21 → 0.23 (not consumed here)

0.21 capture-integrity (canary leak / golden-precision tracking, scenario hashing)
0.22 campaign artifact (RL campaign result format, evidence metadata, integration manifest gates)
0.23 RL bridge (hashJson / canonicalize exposed for arbitrary content signing; RunRecord.scenarioId made optional to support pre-trace records)

Code change

One small enrichment around the new optional RunRecord.scenarioId field: runAgentTask already knows the canonical scenarioId it passes into runAgentControlLoop (options.scenarioId ?? task.id). When adapter.projectRunRecords returns records without a scenarioId, the runtime now backfills it with that same canonical value before returning. Adapters that already set scenarioId are untouched. This means callers reading result.runRecords always see a populated scenarioId without each adapter having to thread it through.

The hoist of options.scenarioId ?? task.id into a local is purely so the same value is used at the control-loop call site and the post-projection backfill.

Test plan

pnpm install — resolves to @tangle-network/agent-eval@0.23.0
pnpm typecheck — clean
pnpm test — 16/16 pass (no count change)
pnpm build — clean (dist/index.js 34.20 KB → 34.31 KB, dist/index.d.ts unchanged)
Reviewer: confirm the scenarioId backfill behavior is wanted (alternative: leave RunRecords entirely as the adapter returned them)

Notes

The ^0.20.0 floor in the task description was actually ^0.20.12 in the lockfile; the bump goes from ^0.20.12 → ^0.23.0.
No RunRecord is constructed inside this repo — construction is delegated to user-supplied adapters via the projectRunRecords hook — so no other call sites needed touching.

🤖 Generated with Claude Code

Mechanical version bump absorbing 0.21/0.22/0.23 surface (capture-integrity, campaign artifact, RL bridge). The control-loop primitives we consume (runAgentControlLoop, scoreKnowledgeReadiness, blockingKnowledgeEval, userQuestionsForKnowledgeGaps, acquisitionPlansForKnowledgeGaps) are unchanged, so this is a no-op functionally. Also: 0.23 made RunRecord.scenarioId optional. Backfill the canonical scenarioId (options.scenarioId ?? task.id, the same value passed into the control loop) onto adapter-projected records that omit it, so consumers of runtime.runRecords always see a populated scenarioId without each adapter having to thread it through. Verification: pnpm typecheck, pnpm test (16 passed), pnpm build all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ree (#6) The agent-eval ^0.23.0 pin already landed in #3, but downstream lockfiles still resolve agent-runtime@0.5.4 → a transitive agent-eval@0.20.12 entry. Cutting 0.5.5 invalidates that lockfile entry so every consumer collapses to a single agent-eval 0.23.x copy after their next pnpm install. agent-runtime does not directly depend on agent-knowledge, so no agent-knowledge pin change is required here — consumers pick up ^1.2.0 via their own package.json after this release lands.

…-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.

@experimental

…r runLoop (backend-blind) (#150) * feat(loops): opt-in session continuation + checkpoint-fork lineage (backend-blind) Two @experimental, default-OFF seams on runLoop so a loop can CONTINUE a sandbox session across iterations (same box + sessionId, no prompt-text replay) and FORK fanout branches from a parent checkpoint (shared context prefix) — both behind a capability probe so the kernel asks 'can I fork?' (client.criuStatus) and never names Docker/Firecracker, degrading to fresh boxes when CRIU is absent. - sandbox-capabilities.ts: memoized, fail-closed criuStatus probe -> {canFork}. - sandbox-lineage.ts: createSandboxLineage owns box+session handles with start/continue/fork/teardown; reuses the kernel's acquireSandbox / buildBackendOptions / deleteBoxSafe; fail-loud if the probe says canFork but the box has no fork(). - run-loop.ts: RunLoopOptions.lineage (sessionContinuity / forkFanout); refine continues, fanout forks-once, else fresh-through-lineage. Default OFF is byte-identical to today, so random@k stays N independent fresh boxes (the compute-control invariant). Rejects lineage + onWorkerBox (both own boxes). - 7 new unit tests (continuation reuses session; fork when canFork; fresh fallback; default-off invariant). Full suite 621 pass, typecheck clean. * fix(loops): address PR #150 review — bound forks, prune lineage, fail-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.

…dation (#304) * feat(intelligence): capability-delivery manifest — composeCertifiedProfile + resolver + ladder Add the unified, future-proof delivery structure: one certified unit of agent power = { interface, binding }. Interfaces are CLOSED (tool / mcp / context / retrieval / hook / subagent); bindings are OPEN (inline / file / http / sandbox-code / mcp-stdio / mcp-remote / process-on-infra / rag-index / memory-store / wasm / a2a). A single resolver lowers any binding into one uniform ResolvedSurface consumed identically by the host seam (RouterToolsSeam tools + executeToolCall) and the sandbox seam (AgentProfile). - src/intelligence/capability.ts: the manifest types + CapabilityNotAdmittedError + manifestFromProfile (lowers today's CertifiedProfile wire into capabilities[] with best-effort binding inference, so the spine delivers value before the plane changes). - src/intelligence/resolver.ts: composeCertifiedProfile — the spine resolves inline/file (byte-identical to composeCertifiedPrompt, the regression lock), mcp-stdio/mcp-remote (strict union widens to the SDK's flat AgentProfileMcpServer — an always-valid lowering), and http tools (the host seam). Ladder rungs that need infra (sandbox-code, process-on-infra) are injected ResolveCtx providers; rag-index/memory-store/wasm/a2a throw CapabilityNotAdmittedError (memory gated on the E3 admission bar). Fail-closed: null manifest -> base surface, per-capability failure -> drop (diagnostic via onDrop), post-resolve drift drops any tool/mcp whose live names diverge. - src/mcp/delegation-profile.ts: composeProductionAgentProfile now also merges tools box-flags, hooks, subagents, and injects ResolvedSurface.mcpConnections into AgentProfile.mcp (the sandbox-seam mapping). - exports + export gate + the two spec corrections (mcp lowers via always-valid widening; tools lower two ways since AgentProfile.tools is box flags). * docs(rsi): correct depth>breadth to the POWER-16 tie at n=48 (not the n=16 +16.4pp) The +16.4pp CI[+5.3,+29.8] n=16 depth-steered-continuation result did not replicate when powered: depth-breadth = +4.7pp CI[-1.9,+11.4] at n=48 (a tie; +4.1pp at n=72). architecture.md and roadmap-rsi.md advertised it as a cleared keystone; they now carry the retraction and point at .evolve/current.json. * chore(clean): remove dead mock loop + orphan re-exports/interface (432 LOC) - delete bench/src/observe-steer-workspace-loop.mts (the #194 mock anti-pattern; 0 inbound refs) - drop orphan pass-through re-exports CaptureIntegrityError/ReplayError/VerificationError (src/errors.ts) - drop orphan interface AgentTaskRunSummary (src/types.ts) - fix doc-rot in loop-facade-postmortem.md; gitignore stray test_repo/ - deletion-ledger.md tracks deletions + the deferred migrations (driver.ts 12 callers, AgentProfile superset) Gates verified by hand: typecheck 0, lint 0, 924 tests pass / 0 fail. Load-bearing fail-loud fences left intact (NOT dead code). * docs(research): atom-compression plan, harness-compat matrix, long-horizon map * chore(deps): bump @types/node 25.9.3 + playwright 1.61.0 (dev) Safe minor/patch dev bumps, gates verified (typecheck 0, lint 0, 924 tests pass). Deferred (need own careful pass): biome 2.5 (13 new lint warnings), typescript 6 + vitest 4 (majors), agent-eval 0.92 (substrate — sync with the AgentProfile superset work). * docs(research): RSI atom masterplan + build tracker (single source of truth) * docs(research): collapse N driver prompts → one cached generator (software 3.0) Replace the per-role hand-coded prompt builders with generateDriverSystemPrompt(spec): a (fused) router call generates the driver prompt from {role,goal,target,harness,stance}, cached for semantic reuse via PromptRegistry + hashContent key (file/JSON or DB). The hand-authored worker-driver prompt becomes the generator's seed + its tests the invariants. Single optimizable surface; depends on a tangle-router fusion primitive (separate issue). * docs(research): active push — RUN/DELETE/IMPROVE worklist (delete createDriver, run commit0, deep-clean, dedup) * docs(research): createDriver delete BLOCKED (paradigm diff, evidenced); commit0 RAN; the delete fork * refactor(runtime): full nuke of the createDriver/string-prompt measurement+eval paradigm DELETE the wrong abstraction (createDriver = a code TopologyPlanner driving runLoop over string-prompt→string-answer calls, judged by adapter.judge) and the entire old bench experiment + eval-gen apparatus built on it. The agent-driver (AgentProfile driving AgentProfile via coordination tools) replaces it; the runLoop KERNEL and the Scope/ Supervisor are untouched. Deleted (15): src/runtime/driver.ts; bench experiment.ts(+test)/steering-experiment(+test)/ improve-prompt/research-loop/finsearch-loop/rsi/generate-eval/run-benchmarks/run.ts/ skills-sandbox/profile-coord-sandbox; tests/loops/dynamic.test.ts. Survivors (search-bench/cloud-loop/fleet/commit0-gate) re-homed onto a new pure helper bench/src/sandbox-run.ts (answerOutput/sandboxAgentRun/WorkerBackendType/AnalystFn/llmAnalyst — no experiment shell). runLoop kernel tests kept via a scriptedDriver stub in refine-driver.ts. Gates (hand-verified): build 0, typecheck 0 (root+bench; also fixed a pre-existing bench BackendType red), lint 0, 905 tests pass. Zero dangling code refs. ACCEPTED casualties of the full nuke (rebuild on the agent-driver/Supervisor path when wanted): the generate-eval data engine, the AgentProfile-coordinate optimizer (profile-coord), and run.ts's non-experiment subcommands (preflight/verify-judge/solve-one/ui-review). * docs(research): full nuke DONE (-3492 LOC); doc/skill-rot follow-up tracked * docs(cleanup): retarget all docs+skills off the nuked createDriver/runExperiment to the agent-driver/Supervisor reality * feat(supervise): recursive driver-executor — agents driving agents driving agents A spawned child can now BE a driver. driverExecutorFactory mounts a NESTED Scope over the SAME conserved budget pool + shared journal (scope.ts's new NestedScopeSeam), one depth deeper, and runs the wrapped driver's act there. A child resolves to a LEAF (worker) OR — for a role:'driver' spec, via withDriverExecutor — this executor, recursively. So a driver spawns a driver spawns a worker on one budget-conserving tree. The persona/strategy spawn fences now route a driver child to the recursive executor (compose) instead of throwing; act still fails loud only if a child is run directly. Reuses the atom — builds NO new budget/journal/selection logic. Budget conserved across depth (reserve-on-spawn fails closed at any depth), spend bubbles to root, journal records each nested tree, maxDepth enforced across recursion. Proven OFFLINE (no creds; scripted drivers+workers) in tests/loops/driver-recursion.test.ts: depth-2 chain root->mid->inner->worker (node id rec:s0:s0:s0 — a non-recursive build cannot produce it), fail-closed budget conservation across depth, spend roll-up (spentTotal = the worker's exact spend), nested-journal sub-trees, depth-ceiling across recursion. Gates hand-verified: build 0, typecheck 0, lint 0, 911 tests pass. * refactor(bench)+docs: reclaim runKeystoneGate -> runGate; strip 4 docs to latest-only Rename the opaque 'keystone' jargon: runKeystoneGate->runGate (+ RunGateOptions/GateArmResult/ GateReport), bench/src/keystone-gate.ts->gate.ts (+ -cli, +test), all import paths and CLI banners. Strip the last historical createDriver/runExperiment 'was removed/nuked' breadcrumbs from architecture.md, architecture-interpretations.md, learning-flywheel.md, roadmap-rsi.md — upgrading agents now see only the current agent-driver/Supervisor reality (history lives in git). Gates green; 905->911 with the keystone test. * docs(research): keystone recursion ✅ (9d188e1); createDriver retire ✅ via nuke; #2b brain next * feat(supervise): coordinationDriverAgent — the cheap/offline driver (LLM tool-loop over the coordination verbs) The CHEAP, in-process, no-creds variant of the recursive driver: act() mounts createCoordinationTools over its scope and runs an LLM tool-loop (injected chat seam) so the driver REASONS spawn/steer/await/stop; composes with 2a recursion (a driver agent spawns a driver agent, via makeWorkerAgent -> driverChild). NOT the primary driver — the CAPABLE driver is a sandbox agent with the coordination verbs as an MCP. This one is the offline-testable + cheap-orchestration path. Prompt is INJECTED (decoupled from agent-eval). Proven OFFLINE (no creds, scripted mock chat) tests/loops/coordination-driver.test.ts: the tool-loop drives real Scope.spawn via the coordination verbs + folds results back; a driver AGENT spawns a driver AGENT (separate nested journal tree). typecheck 0, lint 0. * docs(research): dual-purpose resolution (one substrate serves product + proof); #2b cheap driver done, #2c capable sandbox driver + #3 completion-oracle next * feat(supervise): completion-oracle — settled ⟺ delivered (Foreman 0/18) The honest settle: a node counts as delivered only when a deployable check passes, never on self-report. - completion-gate.ts: gateOnDeliverable wraps any Executor so its settlement valid reflects a DeliverableSpec check (both execute shapes; fail-closed). - coordination-driver finalize: returns the best DELIVERED child; undefined when none delivered — a driver cannot self-declare done via prose. - driver-executor: derive the driver child's verdict from its direct settled events, so delivery composes UP the recursion (a sub-driver is valid only when it itself selected a delivered child). - supervisor: a winner MUST carry a real Out; a successful act that produced nothing is a no-winner, never a winner wrapping undefined. 8 offline tests: leaf gate (both execute shapes, fail-closed), ran-but-didn't- deliver yields no winner, the gate dominates score, delivery propagates up the recursion. * docs(research): completion-oracle #3 ✅ (bd58761) — settled ⟺ delivered, composes up the recursion * feat(bench): atom-humaneval — agents-driving-agents on a live deployable-checked domain A coordinationDriverAgent (real router brain) drives gated workers on HumanEval: each worker is settled valid ONLY when the local Docker test suite passes (completion-oracle, not self-report), against a blind best-of-K baseline. Proven live: the driver spawns, the worker solves, the checker gates, the supervisor returns a winner only on real delivery. Also exports gateOnDeliverable/DeliverableSpec from the runtime barrel (the #3 primitive was added to supervise/ but not surfaced on the package). * feat(topology): animated visual replay of a recursive agent run Fold the one runtime-hooks stream into a timestamped ReplayEvent[] (createReplayRecorder) and render a self-contained, scrubbable HTML player (renderReplayHtml) — the recursive agent tree animated over wall-clock, each node colored by the completion-oracle: delivered (valid) green, ran-but-not-delivered amber, failed red, with live token/cost counters. Synthesizes the unspawned root driver so the whole recursion renders. No server/build/deps. Wired into atom-humaneval (every driver run emits a replay.html). 4 offline recorder tests; proven on a live HumanEval run (driver -> worker -> delivered). * fix(bench): atom-humaneval blind arm survives transient router errors (a 502 is a failed attempt, not a crash — matches the driver arm's down-typing) * feat(supervise): the supervisor AUTHORS worker profiles from a skill (the intelligence, not the plumbing) The supervisor's job is to DESIGN the agents it spawns — read the task, decompose it, and author a tailored profile (instructions + model) per worker. supervisorSkill is the how-to it reads (its own system prompt) — THE optimizable self-improvement surface; authoredWorker builds a worker from an authored profile; asAuthoredProfile catches empty/placeholder profiles (a skill violation). Proven offline (no creds, no plumbing): a skill-guided supervisor authors DISTINCT, tailored worker recipes per sub-task and they flow to the workers. 3 tests. * feat(supervise): coordination MCP over a live Scope — the real keystone for in-box driving serveCoordinationMcp fronts a live Scope with an HTTP JSON-RPC MCP server: an in-box coding harness (opencode via cli-bridge) mounts mcp.mcpServers.coordination and calls spawn_worker as a native tool, landing on Scope.spawn — a real box driving real boxes, not emulated function-tools. Real test: HTTP tools/call spawn_worker -> Scope.spawn -> worker settles -> winner (no mock of the MCP path). Plus the standard supervise SKILL.md. * feat(bench): prove a coding harness drives the Scope via the coordination MCP (live) opencode (glm-5-turbo via cli-bridge) mounts mcp.mcpServers.coordination (type:http → opencode remote) and calls spawn_worker itself → real Scope.spawn → worker settles, and reads back the await_next result. The in-box driving path is REAL — a coding agent drives recursion as a native tool, not emulated. (Bridge wants mcp type:'http', not 'remote'.) * feat(bench): WHOLE real e2e — opencode supervisor drives opencode workers via the coordination MCP, real test gates delivery Live, no mock: the opencode supervisor (glm-5-turbo via cli-bridge) mounts the coordination MCP, authors worker profiles, calls spawn_worker -> real Scope.spawn -> real opencode workers code in a cwd -> python3 test gates valid -> supervisor settles on the delivered worker -> winner. The completion-oracle (deployable check, not LLM judge) decided delivery over the supervisor's confusion that it couldn't see the workers' isolated cwds (→ shared Workspace next). Proof artifact for the in-box-driving path; the law-compliant productionization is a substrate backend (tmux/bridge/sandbox) that runs authored profiles — not this harness-specific script. * docs(canonical-api): the AgentProfile law — author the profile, the substrate materializes it §1.5 + decision-table rows + CLAUDE.md §0 pointer. The thing we keep forgetting: an agent IS its full AgentProfile (prompt+skills+tools/mcp+subagents+hooks+permissions+model), not a prompt; change behavior by AUTHORING the profile and letting the sandbox substrate materialize it into harness shapes — never write a verify-loop or harness-specific config (self-verification is a hook/process; opencode is only the cli-bridge test target; a missing lever is a substrate gap). * docs(research): consolidate docs/research 28→14 — retire shipped/subsumed design docs Retired 14 design-research docs whose content is now shipped code, in .evolve/current.json, or self-declared subsumed/retracted (the recursion atom shipped; the optimization-space layer evidence landed; verdicts reached). Refreshed the research index, recorded the retirement + rationale in deletion-ledger.md (Pass 2), and fixed every inbound link (top index, the harvest-corpus.ts comment → current.json, optimization-space's suite links). Kept the SSOT masterplan, the canonical-referenced maps (optimization-space/leapfrog), the two gated belief specs, the postmortem guardrail, the build-lists, and the agent-lab tombstones. No broken links into the 14 remain from any canonical doc or src/. * docs(canonical-api): substrate pin 0.89 → 0.92 (matches the merged package.json)

drewstone merged commit 13b08bb into main May 8, 2026

drewstone mentioned this pull request May 10, 2026

chore: agent-runtime 0.5.5 — unify agent-eval / agent-knowledge dep tree #6

Merged

4 tasks

drewstone mentioned this pull request Jun 4, 2026

feat(loops): opt-in session-continuation + checkpoint-fork lineage for runLoop (backend-blind) #150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(deps): bump @tangle-network/agent-eval from ^0.20.0 to ^0.23.0#3

chore(deps): bump @tangle-network/agent-eval from ^0.20.0 to ^0.23.0#3
drewstone merged 1 commit into
mainfrom
chore/bump-agent-eval-0.23.0

drewstone commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

drewstone commented May 8, 2026

Summary

What's new in agent-eval 0.21 → 0.23 (not consumed here)

Code change

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant