Skip to content

chore(deps): bump @tangle-network/agent-eval from ^0.20.0 to ^0.23.0#3

Merged
drewstone merged 1 commit into
mainfrom
chore/bump-agent-eval-0.23.0
May 8, 2026
Merged

chore(deps): bump @tangle-network/agent-eval from ^0.20.0 to ^0.23.0#3
drewstone merged 1 commit into
mainfrom
chore/bump-agent-eval-0.23.0

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Summary

Mechanical version bump absorbing three minors of new agent-eval surface. The control-loop primitives this runtime consumes (runAgentControlLoop, scoreKnowledgeReadiness, blockingKnowledgeEval, userQuestionsForKnowledgeGaps, acquisitionPlansForKnowledgeGaps) are unchanged across 0.21 → 0.22 → 0.23, so the bump is a no-op functionally.

What's new in agent-eval 0.21 → 0.23 (not consumed here)

  • 0.21 capture-integrity (canary leak / golden-precision tracking, scenario hashing)
  • 0.22 campaign artifact (RL campaign result format, evidence metadata, integration manifest gates)
  • 0.23 RL bridge (hashJson / canonicalize exposed for arbitrary content signing; RunRecord.scenarioId made optional to support pre-trace records)

Code change

One small enrichment around the new optional RunRecord.scenarioId field: runAgentTask already knows the canonical scenarioId it passes into runAgentControlLoop (options.scenarioId ?? task.id). When adapter.projectRunRecords returns records without a scenarioId, the runtime now backfills it with that same canonical value before returning. Adapters that already set scenarioId are untouched. This means callers reading result.runRecords always see a populated scenarioId without each adapter having to thread it through.

The hoist of options.scenarioId ?? task.id into a local is purely so the same value is used at the control-loop call site and the post-projection backfill.

Test plan

  • pnpm install — resolves to @tangle-network/agent-eval@0.23.0
  • pnpm typecheck — clean
  • pnpm test — 16/16 pass (no count change)
  • pnpm build — clean (dist/index.js 34.20 KB → 34.31 KB, dist/index.d.ts unchanged)
  • Reviewer: confirm the scenarioId backfill behavior is wanted (alternative: leave RunRecords entirely as the adapter returned them)

Notes

  • The ^0.20.0 floor in the task description was actually ^0.20.12 in the lockfile; the bump goes from ^0.20.12^0.23.0.
  • No RunRecord is constructed inside this repo — construction is delegated to user-supplied adapters via the projectRunRecords hook — so no other call sites needed touching.

🤖 Generated with Claude Code

Mechanical version bump absorbing 0.21/0.22/0.23 surface (capture-integrity,
campaign artifact, RL bridge). The control-loop primitives we consume
(runAgentControlLoop, scoreKnowledgeReadiness, blockingKnowledgeEval,
userQuestionsForKnowledgeGaps, acquisitionPlansForKnowledgeGaps) are
unchanged, so this is a no-op functionally.

Also: 0.23 made RunRecord.scenarioId optional. Backfill the canonical
scenarioId (options.scenarioId ?? task.id, the same value passed into the
control loop) onto adapter-projected records that omit it, so consumers of
runtime.runRecords always see a populated scenarioId without each adapter
having to thread it through.

Verification: pnpm typecheck, pnpm test (16 passed), pnpm build all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@drewstone drewstone merged commit 13b08bb into main May 8, 2026
drewstone added a commit that referenced this pull request May 10, 2026
…ree (#6)

The agent-eval ^0.23.0 pin already landed in #3, but downstream lockfiles
still resolve agent-runtime@0.5.4 → a transitive agent-eval@0.20.12 entry.
Cutting 0.5.5 invalidates that lockfile entry so every consumer collapses
to a single agent-eval 0.23.x copy after their next pnpm install.

agent-runtime does not directly depend on agent-knowledge, so no
agent-knowledge pin change is required here — consumers pick up
^1.2.0 via their own package.json after this release lands.
tangletools pushed a commit that referenced this pull request Jun 4, 2026
…-loud session continuity

Resolve all six findings from the review (none blocked landing; #1 gated
enabling, #3/#4 wanted documenting). Lineage remains default-OFF and
byte-identical to the fresh-box path when both flags are unset.

- #1 sessionContinuity silent no-op: `continue` now asserts the session is
  still known to the sandbox via `box.session(id).status()` before streaming.
  A `null` (platform never honored the client-minted id, or it was reaped)
  raises a ValidationError, which executeIteration now propagates as a hard
  structural failure instead of degrading to a soft empty iteration — so a
  non-honoring platform errors loudly rather than running contextless turns.
- #2 unbounded fork creation: `fork` provisions child boxes through
  `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single
  `Promise.all` over all N branches.
- #3 fork ignores per-branch specs: documented on `fork` and
  `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent
  image/profile (per-branch specs apply only on the degraded fresh path).
- #4 lineage holds every box to loop end: kernel prunes boxes no future round
  can descend from after each round, gated on a kernel-inferred (monotonic)
  branch point — skipped when the driver authors its own `parentIndex`. The
  unprunable case is documented as the box ceiling.
- #5 abort during fork: documented the SDK's signal-less fork; abort is now
  checked per branch (between bounded waves) + an abort-under-lineage test.
- #6 export order: alphabetized the loops barrel.

Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/
fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent,
abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone added a commit that referenced this pull request Jun 4, 2026
…r runLoop (backend-blind) (#150)

* feat(loops): opt-in session continuation + checkpoint-fork lineage (backend-blind)

Two @experimental, default-OFF seams on runLoop so a loop can CONTINUE a sandbox
session across iterations (same box + sessionId, no prompt-text replay) and FORK
fanout branches from a parent checkpoint (shared context prefix) — both behind a
capability probe so the kernel asks 'can I fork?' (client.criuStatus) and never
names Docker/Firecracker, degrading to fresh boxes when CRIU is absent.

- sandbox-capabilities.ts: memoized, fail-closed criuStatus probe -> {canFork}.
- sandbox-lineage.ts: createSandboxLineage owns box+session handles with
  start/continue/fork/teardown; reuses the kernel's acquireSandbox /
  buildBackendOptions / deleteBoxSafe; fail-loud if the probe says canFork but
  the box has no fork().
- run-loop.ts: RunLoopOptions.lineage (sessionContinuity / forkFanout); refine
  continues, fanout forks-once, else fresh-through-lineage. Default OFF is
  byte-identical to today, so random@k stays N independent fresh boxes (the
  compute-control invariant). Rejects lineage + onWorkerBox (both own boxes).
- 7 new unit tests (continuation reuses session; fork when canFork; fresh
  fallback; default-off invariant). Full suite 621 pass, typecheck clean.

* fix(loops): address PR #150 review — bound forks, prune lineage, fail-loud session continuity

Resolve all six findings from the review (none blocked landing; #1 gated
enabling, #3/#4 wanted documenting). Lineage remains default-OFF and
byte-identical to the fresh-box path when both flags are unset.

- #1 sessionContinuity silent no-op: `continue` now asserts the session is
  still known to the sandbox via `box.session(id).status()` before streaming.
  A `null` (platform never honored the client-minted id, or it was reaped)
  raises a ValidationError, which executeIteration now propagates as a hard
  structural failure instead of degrading to a soft empty iteration — so a
  non-honoring platform errors loudly rather than running contextless turns.
- #2 unbounded fork creation: `fork` provisions child boxes through
  `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single
  `Promise.all` over all N branches.
- #3 fork ignores per-branch specs: documented on `fork` and
  `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent
  image/profile (per-branch specs apply only on the degraded fresh path).
- #4 lineage holds every box to loop end: kernel prunes boxes no future round
  can descend from after each round, gated on a kernel-inferred (monotonic)
  branch point — skipped when the driver authors its own `parentIndex`. The
  unprunable case is documented as the box ceiling.
- #5 abort during fork: documented the SDK's signal-less fork; abort is now
  checked per branch (between bounded waves) + an abort-under-lineage test.
- #6 export order: alphabetized the loops barrel.

Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/
fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent,
abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone added a commit that referenced this pull request Jun 16, 2026
…dation (#304)

* feat(intelligence): capability-delivery manifest — composeCertifiedProfile + resolver + ladder

Add the unified, future-proof delivery structure: one certified unit of agent
power = { interface, binding }. Interfaces are CLOSED (tool / mcp / context /
retrieval / hook / subagent); bindings are OPEN (inline / file / http /
sandbox-code / mcp-stdio / mcp-remote / process-on-infra / rag-index /
memory-store / wasm / a2a). A single resolver lowers any binding into one uniform
ResolvedSurface consumed identically by the host seam (RouterToolsSeam tools +
executeToolCall) and the sandbox seam (AgentProfile).

- src/intelligence/capability.ts: the manifest types + CapabilityNotAdmittedError
  + manifestFromProfile (lowers today's CertifiedProfile wire into capabilities[]
  with best-effort binding inference, so the spine delivers value before the
  plane changes).
- src/intelligence/resolver.ts: composeCertifiedProfile — the spine resolves
  inline/file (byte-identical to composeCertifiedPrompt, the regression lock),
  mcp-stdio/mcp-remote (strict union widens to the SDK's flat
  AgentProfileMcpServer — an always-valid lowering), and http tools (the host
  seam). Ladder rungs that need infra (sandbox-code, process-on-infra) are
  injected ResolveCtx providers; rag-index/memory-store/wasm/a2a throw
  CapabilityNotAdmittedError (memory gated on the E3 admission bar). Fail-closed:
  null manifest -> base surface, per-capability failure -> drop (diagnostic via
  onDrop), post-resolve drift drops any tool/mcp whose live names diverge.
- src/mcp/delegation-profile.ts: composeProductionAgentProfile now also merges
  tools box-flags, hooks, subagents, and injects ResolvedSurface.mcpConnections
  into AgentProfile.mcp (the sandbox-seam mapping).
- exports + export gate + the two spec corrections (mcp lowers via always-valid
  widening; tools lower two ways since AgentProfile.tools is box flags).

* docs(rsi): correct depth>breadth to the POWER-16 tie at n=48 (not the n=16 +16.4pp)

The +16.4pp CI[+5.3,+29.8] n=16 depth-steered-continuation result did not
replicate when powered: depth-breadth = +4.7pp CI[-1.9,+11.4] at n=48 (a tie;
+4.1pp at n=72). architecture.md and roadmap-rsi.md advertised it as a cleared
keystone; they now carry the retraction and point at .evolve/current.json.

* chore(clean): remove dead mock loop + orphan re-exports/interface (432 LOC)

- delete bench/src/observe-steer-workspace-loop.mts (the #194 mock anti-pattern; 0 inbound refs)
- drop orphan pass-through re-exports CaptureIntegrityError/ReplayError/VerificationError (src/errors.ts)
- drop orphan interface AgentTaskRunSummary (src/types.ts)
- fix doc-rot in loop-facade-postmortem.md; gitignore stray test_repo/
- deletion-ledger.md tracks deletions + the deferred migrations (driver.ts 12 callers, AgentProfile superset)

Gates verified by hand: typecheck 0, lint 0, 924 tests pass / 0 fail. Load-bearing fail-loud fences left intact (NOT dead code).

* docs(research): atom-compression plan, harness-compat matrix, long-horizon map

* chore(deps): bump @types/node 25.9.3 + playwright 1.61.0 (dev)

Safe minor/patch dev bumps, gates verified (typecheck 0, lint 0, 924 tests pass).
Deferred (need own careful pass): biome 2.5 (13 new lint warnings), typescript 6 + vitest 4 (majors), agent-eval 0.92 (substrate — sync with the AgentProfile superset work).

* docs(research): RSI atom masterplan + build tracker (single source of truth)

* docs(research): collapse N driver prompts → one cached generator (software 3.0)

Replace the per-role hand-coded prompt builders with generateDriverSystemPrompt(spec):
a (fused) router call generates the driver prompt from {role,goal,target,harness,stance},
cached for semantic reuse via PromptRegistry + hashContent key (file/JSON or DB). The
hand-authored worker-driver prompt becomes the generator's seed + its tests the invariants.
Single optimizable surface; depends on a tangle-router fusion primitive (separate issue).

* docs(research): active push — RUN/DELETE/IMPROVE worklist (delete createDriver, run commit0, deep-clean, dedup)

* docs(research): createDriver delete BLOCKED (paradigm diff, evidenced); commit0 RAN; the delete fork

* refactor(runtime): full nuke of the createDriver/string-prompt measurement+eval paradigm

DELETE the wrong abstraction (createDriver = a code TopologyPlanner driving runLoop over
string-prompt→string-answer calls, judged by adapter.judge) and the entire old bench
experiment + eval-gen apparatus built on it. The agent-driver (AgentProfile driving
AgentProfile via coordination tools) replaces it; the runLoop KERNEL and the Scope/
Supervisor are untouched.

Deleted (15): src/runtime/driver.ts; bench experiment.ts(+test)/steering-experiment(+test)/
improve-prompt/research-loop/finsearch-loop/rsi/generate-eval/run-benchmarks/run.ts/
skills-sandbox/profile-coord-sandbox; tests/loops/dynamic.test.ts.
Survivors (search-bench/cloud-loop/fleet/commit0-gate) re-homed onto a new pure helper
bench/src/sandbox-run.ts (answerOutput/sandboxAgentRun/WorkerBackendType/AnalystFn/llmAnalyst —
no experiment shell). runLoop kernel tests kept via a scriptedDriver stub in refine-driver.ts.

Gates (hand-verified): build 0, typecheck 0 (root+bench; also fixed a pre-existing bench
BackendType red), lint 0, 905 tests pass. Zero dangling code refs.

ACCEPTED casualties of the full nuke (rebuild on the agent-driver/Supervisor path when wanted):
the generate-eval data engine, the AgentProfile-coordinate optimizer (profile-coord), and
run.ts's non-experiment subcommands (preflight/verify-judge/solve-one/ui-review).

* docs(research): full nuke DONE (-3492 LOC); doc/skill-rot follow-up tracked

* docs(cleanup): retarget all docs+skills off the nuked createDriver/runExperiment to the agent-driver/Supervisor reality

* feat(supervise): recursive driver-executor — agents driving agents driving agents

A spawned child can now BE a driver. driverExecutorFactory mounts a NESTED Scope over
the SAME conserved budget pool + shared journal (scope.ts's new NestedScopeSeam), one
depth deeper, and runs the wrapped driver's act there. A child resolves to a LEAF
(worker) OR — for a role:'driver' spec, via withDriverExecutor — this executor,
recursively. So a driver spawns a driver spawns a worker on one budget-conserving tree.
The persona/strategy spawn fences now route a driver child to the recursive executor
(compose) instead of throwing; act still fails loud only if a child is run directly.

Reuses the atom — builds NO new budget/journal/selection logic. Budget conserved across
depth (reserve-on-spawn fails closed at any depth), spend bubbles to root, journal records
each nested tree, maxDepth enforced across recursion.

Proven OFFLINE (no creds; scripted drivers+workers) in tests/loops/driver-recursion.test.ts:
depth-2 chain root->mid->inner->worker (node id rec:s0:s0:s0 — a non-recursive build cannot
produce it), fail-closed budget conservation across depth, spend roll-up (spentTotal = the
worker's exact spend), nested-journal sub-trees, depth-ceiling across recursion.
Gates hand-verified: build 0, typecheck 0, lint 0, 911 tests pass.

* refactor(bench)+docs: reclaim runKeystoneGate -> runGate; strip 4 docs to latest-only

Rename the opaque 'keystone' jargon: runKeystoneGate->runGate (+ RunGateOptions/GateArmResult/
GateReport), bench/src/keystone-gate.ts->gate.ts (+ -cli, +test), all import paths and CLI
banners. Strip the last historical createDriver/runExperiment 'was removed/nuked' breadcrumbs
from architecture.md, architecture-interpretations.md, learning-flywheel.md, roadmap-rsi.md —
upgrading agents now see only the current agent-driver/Supervisor reality (history lives in git).
Gates green; 905->911 with the keystone test.

* docs(research): keystone recursion ✅ (9d188e1); createDriver retire ✅ via nuke; #2b brain next

* feat(supervise): coordinationDriverAgent — the cheap/offline driver (LLM tool-loop over the coordination verbs)

The CHEAP, in-process, no-creds variant of the recursive driver: act() mounts
createCoordinationTools over its scope and runs an LLM tool-loop (injected chat seam) so the
driver REASONS spawn/steer/await/stop; composes with 2a recursion (a driver agent spawns a
driver agent, via makeWorkerAgent -> driverChild). NOT the primary driver — the CAPABLE
driver is a sandbox agent with the coordination verbs as an MCP. This one is the
offline-testable + cheap-orchestration path. Prompt is INJECTED (decoupled from agent-eval).

Proven OFFLINE (no creds, scripted mock chat) tests/loops/coordination-driver.test.ts:
the tool-loop drives real Scope.spawn via the coordination verbs + folds results back; a
driver AGENT spawns a driver AGENT (separate nested journal tree). typecheck 0, lint 0.

* docs(research): dual-purpose resolution (one substrate serves product + proof); #2b cheap driver done, #2c capable sandbox driver + #3 completion-oracle next

* feat(supervise): completion-oracle — settled ⟺ delivered (Foreman 0/18)

The honest settle: a node counts as delivered only when a deployable check
passes, never on self-report.

- completion-gate.ts: gateOnDeliverable wraps any Executor so its settlement
  valid reflects a DeliverableSpec check (both execute shapes; fail-closed).
- coordination-driver finalize: returns the best DELIVERED child; undefined
  when none delivered — a driver cannot self-declare done via prose.
- driver-executor: derive the driver child's verdict from its direct settled
  events, so delivery composes UP the recursion (a sub-driver is valid only
  when it itself selected a delivered child).
- supervisor: a winner MUST carry a real Out; a successful act that produced
  nothing is a no-winner, never a winner wrapping undefined.

8 offline tests: leaf gate (both execute shapes, fail-closed), ran-but-didn't-
deliver yields no winner, the gate dominates score, delivery propagates up the
recursion.

* docs(research): completion-oracle #3 ✅ (bd58761) — settled ⟺ delivered, composes up the recursion

* feat(bench): atom-humaneval — agents-driving-agents on a live deployable-checked domain

A coordinationDriverAgent (real router brain) drives gated workers on HumanEval:
each worker is settled valid ONLY when the local Docker test suite passes
(completion-oracle, not self-report), against a blind best-of-K baseline. Proven
live: the driver spawns, the worker solves, the checker gates, the supervisor
returns a winner only on real delivery.

Also exports gateOnDeliverable/DeliverableSpec from the runtime barrel (the #3
primitive was added to supervise/ but not surfaced on the package).

* feat(topology): animated visual replay of a recursive agent run

Fold the one runtime-hooks stream into a timestamped ReplayEvent[] (createReplayRecorder)
and render a self-contained, scrubbable HTML player (renderReplayHtml) — the recursive
agent tree animated over wall-clock, each node colored by the completion-oracle: delivered
(valid) green, ran-but-not-delivered amber, failed red, with live token/cost counters.
Synthesizes the unspawned root driver so the whole recursion renders. No server/build/deps.

Wired into atom-humaneval (every driver run emits a replay.html). 4 offline recorder tests;
proven on a live HumanEval run (driver -> worker -> delivered).

* fix(bench): atom-humaneval blind arm survives transient router errors (a 502 is a failed attempt, not a crash — matches the driver arm's down-typing)

* feat(supervise): the supervisor AUTHORS worker profiles from a skill (the intelligence, not the plumbing)

The supervisor's job is to DESIGN the agents it spawns — read the task, decompose it,
and author a tailored profile (instructions + model) per worker. supervisorSkill is the
how-to it reads (its own system prompt) — THE optimizable self-improvement surface;
authoredWorker builds a worker from an authored profile; asAuthoredProfile catches
empty/placeholder profiles (a skill violation).

Proven offline (no creds, no plumbing): a skill-guided supervisor authors DISTINCT,
tailored worker recipes per sub-task and they flow to the workers. 3 tests.

* feat(supervise): coordination MCP over a live Scope — the real keystone for in-box driving

serveCoordinationMcp fronts a live Scope with an HTTP JSON-RPC MCP server: an in-box
coding harness (opencode via cli-bridge) mounts mcp.mcpServers.coordination and calls
spawn_worker as a native tool, landing on Scope.spawn — a real box driving real boxes,
not emulated function-tools. Real test: HTTP tools/call spawn_worker -> Scope.spawn ->
worker settles -> winner (no mock of the MCP path). Plus the standard supervise SKILL.md.

* feat(bench): prove a coding harness drives the Scope via the coordination MCP (live)

opencode (glm-5-turbo via cli-bridge) mounts mcp.mcpServers.coordination (type:http →
opencode remote) and calls spawn_worker itself → real Scope.spawn → worker settles, and
reads back the await_next result. The in-box driving path is REAL — a coding agent drives
recursion as a native tool, not emulated. (Bridge wants mcp type:'http', not 'remote'.)

* feat(bench): WHOLE real e2e — opencode supervisor drives opencode workers via the coordination MCP, real test gates delivery

Live, no mock: the opencode supervisor (glm-5-turbo via cli-bridge) mounts the coordination
MCP, authors worker profiles, calls spawn_worker -> real Scope.spawn -> real opencode workers
code in a cwd -> python3 test gates valid -> supervisor settles on the delivered worker ->
winner. The completion-oracle (deployable check, not LLM judge) decided delivery over the
supervisor's confusion that it couldn't see the workers' isolated cwds (→ shared Workspace next).

Proof artifact for the in-box-driving path; the law-compliant productionization is a substrate
backend (tmux/bridge/sandbox) that runs authored profiles — not this harness-specific script.

* docs(canonical-api): the AgentProfile law — author the profile, the substrate materializes it

§1.5 + decision-table rows + CLAUDE.md §0 pointer. The thing we keep forgetting: an agent IS
its full AgentProfile (prompt+skills+tools/mcp+subagents+hooks+permissions+model), not a prompt;
change behavior by AUTHORING the profile and letting the sandbox substrate materialize it into
harness shapes — never write a verify-loop or harness-specific config (self-verification is a
hook/process; opencode is only the cli-bridge test target; a missing lever is a substrate gap).

* docs(research): consolidate docs/research 28→14 — retire shipped/subsumed design docs

Retired 14 design-research docs whose content is now shipped code, in .evolve/current.json, or
self-declared subsumed/retracted (the recursion atom shipped; the optimization-space layer
evidence landed; verdicts reached). Refreshed the research index, recorded the retirement +
rationale in deletion-ledger.md (Pass 2), and fixed every inbound link (top index, the
harvest-corpus.ts comment → current.json, optimization-space's suite links). Kept the SSOT
masterplan, the canonical-referenced maps (optimization-space/leapfrog), the two gated belief
specs, the postmortem guardrail, the build-lists, and the agent-lab tombstones. No broken links
into the 14 remain from any canonical doc or src/.

* docs(canonical-api): substrate pin 0.89 → 0.92 (matches the merged package.json)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant