chore(deps): bump @tangle-network/agent-eval from ^0.20.0 to ^0.23.0#3
Merged
Conversation
Mechanical version bump absorbing 0.21/0.22/0.23 surface (capture-integrity, campaign artifact, RL bridge). The control-loop primitives we consume (runAgentControlLoop, scoreKnowledgeReadiness, blockingKnowledgeEval, userQuestionsForKnowledgeGaps, acquisitionPlansForKnowledgeGaps) are unchanged, so this is a no-op functionally. Also: 0.23 made RunRecord.scenarioId optional. Backfill the canonical scenarioId (options.scenarioId ?? task.id, the same value passed into the control loop) onto adapter-projected records that omit it, so consumers of runtime.runRecords always see a populated scenarioId without each adapter having to thread it through. Verification: pnpm typecheck, pnpm test (16 passed), pnpm build all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
drewstone
added a commit
that referenced
this pull request
May 10, 2026
…ree (#6) The agent-eval ^0.23.0 pin already landed in #3, but downstream lockfiles still resolve agent-runtime@0.5.4 → a transitive agent-eval@0.20.12 entry. Cutting 0.5.5 invalidates that lockfile entry so every consumer collapses to a single agent-eval 0.23.x copy after their next pnpm install. agent-runtime does not directly depend on agent-knowledge, so no agent-knowledge pin change is required here — consumers pick up ^1.2.0 via their own package.json after this release lands.
tangletools
pushed a commit
that referenced
this pull request
Jun 4, 2026
…-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone
added a commit
that referenced
this pull request
Jun 4, 2026
…r runLoop (backend-blind) (#150) * feat(loops): opt-in session continuation + checkpoint-fork lineage (backend-blind) Two @experimental, default-OFF seams on runLoop so a loop can CONTINUE a sandbox session across iterations (same box + sessionId, no prompt-text replay) and FORK fanout branches from a parent checkpoint (shared context prefix) — both behind a capability probe so the kernel asks 'can I fork?' (client.criuStatus) and never names Docker/Firecracker, degrading to fresh boxes when CRIU is absent. - sandbox-capabilities.ts: memoized, fail-closed criuStatus probe -> {canFork}. - sandbox-lineage.ts: createSandboxLineage owns box+session handles with start/continue/fork/teardown; reuses the kernel's acquireSandbox / buildBackendOptions / deleteBoxSafe; fail-loud if the probe says canFork but the box has no fork(). - run-loop.ts: RunLoopOptions.lineage (sessionContinuity / forkFanout); refine continues, fanout forks-once, else fresh-through-lineage. Default OFF is byte-identical to today, so random@k stays N independent fresh boxes (the compute-control invariant). Rejects lineage + onWorkerBox (both own boxes). - 7 new unit tests (continuation reuses session; fork when canFork; fresh fallback; default-off invariant). Full suite 621 pass, typecheck clean. * fix(loops): address PR #150 review — bound forks, prune lineage, fail-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone
added a commit
that referenced
this pull request
Jun 16, 2026
…dation (#304) * feat(intelligence): capability-delivery manifest — composeCertifiedProfile + resolver + ladder Add the unified, future-proof delivery structure: one certified unit of agent power = { interface, binding }. Interfaces are CLOSED (tool / mcp / context / retrieval / hook / subagent); bindings are OPEN (inline / file / http / sandbox-code / mcp-stdio / mcp-remote / process-on-infra / rag-index / memory-store / wasm / a2a). A single resolver lowers any binding into one uniform ResolvedSurface consumed identically by the host seam (RouterToolsSeam tools + executeToolCall) and the sandbox seam (AgentProfile). - src/intelligence/capability.ts: the manifest types + CapabilityNotAdmittedError + manifestFromProfile (lowers today's CertifiedProfile wire into capabilities[] with best-effort binding inference, so the spine delivers value before the plane changes). - src/intelligence/resolver.ts: composeCertifiedProfile — the spine resolves inline/file (byte-identical to composeCertifiedPrompt, the regression lock), mcp-stdio/mcp-remote (strict union widens to the SDK's flat AgentProfileMcpServer — an always-valid lowering), and http tools (the host seam). Ladder rungs that need infra (sandbox-code, process-on-infra) are injected ResolveCtx providers; rag-index/memory-store/wasm/a2a throw CapabilityNotAdmittedError (memory gated on the E3 admission bar). Fail-closed: null manifest -> base surface, per-capability failure -> drop (diagnostic via onDrop), post-resolve drift drops any tool/mcp whose live names diverge. - src/mcp/delegation-profile.ts: composeProductionAgentProfile now also merges tools box-flags, hooks, subagents, and injects ResolvedSurface.mcpConnections into AgentProfile.mcp (the sandbox-seam mapping). - exports + export gate + the two spec corrections (mcp lowers via always-valid widening; tools lower two ways since AgentProfile.tools is box flags). * docs(rsi): correct depth>breadth to the POWER-16 tie at n=48 (not the n=16 +16.4pp) The +16.4pp CI[+5.3,+29.8] n=16 depth-steered-continuation result did not replicate when powered: depth-breadth = +4.7pp CI[-1.9,+11.4] at n=48 (a tie; +4.1pp at n=72). architecture.md and roadmap-rsi.md advertised it as a cleared keystone; they now carry the retraction and point at .evolve/current.json. * chore(clean): remove dead mock loop + orphan re-exports/interface (432 LOC) - delete bench/src/observe-steer-workspace-loop.mts (the #194 mock anti-pattern; 0 inbound refs) - drop orphan pass-through re-exports CaptureIntegrityError/ReplayError/VerificationError (src/errors.ts) - drop orphan interface AgentTaskRunSummary (src/types.ts) - fix doc-rot in loop-facade-postmortem.md; gitignore stray test_repo/ - deletion-ledger.md tracks deletions + the deferred migrations (driver.ts 12 callers, AgentProfile superset) Gates verified by hand: typecheck 0, lint 0, 924 tests pass / 0 fail. Load-bearing fail-loud fences left intact (NOT dead code). * docs(research): atom-compression plan, harness-compat matrix, long-horizon map * chore(deps): bump @types/node 25.9.3 + playwright 1.61.0 (dev) Safe minor/patch dev bumps, gates verified (typecheck 0, lint 0, 924 tests pass). Deferred (need own careful pass): biome 2.5 (13 new lint warnings), typescript 6 + vitest 4 (majors), agent-eval 0.92 (substrate — sync with the AgentProfile superset work). * docs(research): RSI atom masterplan + build tracker (single source of truth) * docs(research): collapse N driver prompts → one cached generator (software 3.0) Replace the per-role hand-coded prompt builders with generateDriverSystemPrompt(spec): a (fused) router call generates the driver prompt from {role,goal,target,harness,stance}, cached for semantic reuse via PromptRegistry + hashContent key (file/JSON or DB). The hand-authored worker-driver prompt becomes the generator's seed + its tests the invariants. Single optimizable surface; depends on a tangle-router fusion primitive (separate issue). * docs(research): active push — RUN/DELETE/IMPROVE worklist (delete createDriver, run commit0, deep-clean, dedup) * docs(research): createDriver delete BLOCKED (paradigm diff, evidenced); commit0 RAN; the delete fork * refactor(runtime): full nuke of the createDriver/string-prompt measurement+eval paradigm DELETE the wrong abstraction (createDriver = a code TopologyPlanner driving runLoop over string-prompt→string-answer calls, judged by adapter.judge) and the entire old bench experiment + eval-gen apparatus built on it. The agent-driver (AgentProfile driving AgentProfile via coordination tools) replaces it; the runLoop KERNEL and the Scope/ Supervisor are untouched. Deleted (15): src/runtime/driver.ts; bench experiment.ts(+test)/steering-experiment(+test)/ improve-prompt/research-loop/finsearch-loop/rsi/generate-eval/run-benchmarks/run.ts/ skills-sandbox/profile-coord-sandbox; tests/loops/dynamic.test.ts. Survivors (search-bench/cloud-loop/fleet/commit0-gate) re-homed onto a new pure helper bench/src/sandbox-run.ts (answerOutput/sandboxAgentRun/WorkerBackendType/AnalystFn/llmAnalyst — no experiment shell). runLoop kernel tests kept via a scriptedDriver stub in refine-driver.ts. Gates (hand-verified): build 0, typecheck 0 (root+bench; also fixed a pre-existing bench BackendType red), lint 0, 905 tests pass. Zero dangling code refs. ACCEPTED casualties of the full nuke (rebuild on the agent-driver/Supervisor path when wanted): the generate-eval data engine, the AgentProfile-coordinate optimizer (profile-coord), and run.ts's non-experiment subcommands (preflight/verify-judge/solve-one/ui-review). * docs(research): full nuke DONE (-3492 LOC); doc/skill-rot follow-up tracked * docs(cleanup): retarget all docs+skills off the nuked createDriver/runExperiment to the agent-driver/Supervisor reality * feat(supervise): recursive driver-executor — agents driving agents driving agents A spawned child can now BE a driver. driverExecutorFactory mounts a NESTED Scope over the SAME conserved budget pool + shared journal (scope.ts's new NestedScopeSeam), one depth deeper, and runs the wrapped driver's act there. A child resolves to a LEAF (worker) OR — for a role:'driver' spec, via withDriverExecutor — this executor, recursively. So a driver spawns a driver spawns a worker on one budget-conserving tree. The persona/strategy spawn fences now route a driver child to the recursive executor (compose) instead of throwing; act still fails loud only if a child is run directly. Reuses the atom — builds NO new budget/journal/selection logic. Budget conserved across depth (reserve-on-spawn fails closed at any depth), spend bubbles to root, journal records each nested tree, maxDepth enforced across recursion. Proven OFFLINE (no creds; scripted drivers+workers) in tests/loops/driver-recursion.test.ts: depth-2 chain root->mid->inner->worker (node id rec:s0:s0:s0 — a non-recursive build cannot produce it), fail-closed budget conservation across depth, spend roll-up (spentTotal = the worker's exact spend), nested-journal sub-trees, depth-ceiling across recursion. Gates hand-verified: build 0, typecheck 0, lint 0, 911 tests pass. * refactor(bench)+docs: reclaim runKeystoneGate -> runGate; strip 4 docs to latest-only Rename the opaque 'keystone' jargon: runKeystoneGate->runGate (+ RunGateOptions/GateArmResult/ GateReport), bench/src/keystone-gate.ts->gate.ts (+ -cli, +test), all import paths and CLI banners. Strip the last historical createDriver/runExperiment 'was removed/nuked' breadcrumbs from architecture.md, architecture-interpretations.md, learning-flywheel.md, roadmap-rsi.md — upgrading agents now see only the current agent-driver/Supervisor reality (history lives in git). Gates green; 905->911 with the keystone test. * docs(research): keystone recursion ✅ (9d188e1); createDriver retire ✅ via nuke; #2b brain next * feat(supervise): coordinationDriverAgent — the cheap/offline driver (LLM tool-loop over the coordination verbs) The CHEAP, in-process, no-creds variant of the recursive driver: act() mounts createCoordinationTools over its scope and runs an LLM tool-loop (injected chat seam) so the driver REASONS spawn/steer/await/stop; composes with 2a recursion (a driver agent spawns a driver agent, via makeWorkerAgent -> driverChild). NOT the primary driver — the CAPABLE driver is a sandbox agent with the coordination verbs as an MCP. This one is the offline-testable + cheap-orchestration path. Prompt is INJECTED (decoupled from agent-eval). Proven OFFLINE (no creds, scripted mock chat) tests/loops/coordination-driver.test.ts: the tool-loop drives real Scope.spawn via the coordination verbs + folds results back; a driver AGENT spawns a driver AGENT (separate nested journal tree). typecheck 0, lint 0. * docs(research): dual-purpose resolution (one substrate serves product + proof); #2b cheap driver done, #2c capable sandbox driver + #3 completion-oracle next * feat(supervise): completion-oracle — settled ⟺ delivered (Foreman 0/18) The honest settle: a node counts as delivered only when a deployable check passes, never on self-report. - completion-gate.ts: gateOnDeliverable wraps any Executor so its settlement valid reflects a DeliverableSpec check (both execute shapes; fail-closed). - coordination-driver finalize: returns the best DELIVERED child; undefined when none delivered — a driver cannot self-declare done via prose. - driver-executor: derive the driver child's verdict from its direct settled events, so delivery composes UP the recursion (a sub-driver is valid only when it itself selected a delivered child). - supervisor: a winner MUST carry a real Out; a successful act that produced nothing is a no-winner, never a winner wrapping undefined. 8 offline tests: leaf gate (both execute shapes, fail-closed), ran-but-didn't- deliver yields no winner, the gate dominates score, delivery propagates up the recursion. * docs(research): completion-oracle #3 ✅ (bd58761) — settled ⟺ delivered, composes up the recursion * feat(bench): atom-humaneval — agents-driving-agents on a live deployable-checked domain A coordinationDriverAgent (real router brain) drives gated workers on HumanEval: each worker is settled valid ONLY when the local Docker test suite passes (completion-oracle, not self-report), against a blind best-of-K baseline. Proven live: the driver spawns, the worker solves, the checker gates, the supervisor returns a winner only on real delivery. Also exports gateOnDeliverable/DeliverableSpec from the runtime barrel (the #3 primitive was added to supervise/ but not surfaced on the package). * feat(topology): animated visual replay of a recursive agent run Fold the one runtime-hooks stream into a timestamped ReplayEvent[] (createReplayRecorder) and render a self-contained, scrubbable HTML player (renderReplayHtml) — the recursive agent tree animated over wall-clock, each node colored by the completion-oracle: delivered (valid) green, ran-but-not-delivered amber, failed red, with live token/cost counters. Synthesizes the unspawned root driver so the whole recursion renders. No server/build/deps. Wired into atom-humaneval (every driver run emits a replay.html). 4 offline recorder tests; proven on a live HumanEval run (driver -> worker -> delivered). * fix(bench): atom-humaneval blind arm survives transient router errors (a 502 is a failed attempt, not a crash — matches the driver arm's down-typing) * feat(supervise): the supervisor AUTHORS worker profiles from a skill (the intelligence, not the plumbing) The supervisor's job is to DESIGN the agents it spawns — read the task, decompose it, and author a tailored profile (instructions + model) per worker. supervisorSkill is the how-to it reads (its own system prompt) — THE optimizable self-improvement surface; authoredWorker builds a worker from an authored profile; asAuthoredProfile catches empty/placeholder profiles (a skill violation). Proven offline (no creds, no plumbing): a skill-guided supervisor authors DISTINCT, tailored worker recipes per sub-task and they flow to the workers. 3 tests. * feat(supervise): coordination MCP over a live Scope — the real keystone for in-box driving serveCoordinationMcp fronts a live Scope with an HTTP JSON-RPC MCP server: an in-box coding harness (opencode via cli-bridge) mounts mcp.mcpServers.coordination and calls spawn_worker as a native tool, landing on Scope.spawn — a real box driving real boxes, not emulated function-tools. Real test: HTTP tools/call spawn_worker -> Scope.spawn -> worker settles -> winner (no mock of the MCP path). Plus the standard supervise SKILL.md. * feat(bench): prove a coding harness drives the Scope via the coordination MCP (live) opencode (glm-5-turbo via cli-bridge) mounts mcp.mcpServers.coordination (type:http → opencode remote) and calls spawn_worker itself → real Scope.spawn → worker settles, and reads back the await_next result. The in-box driving path is REAL — a coding agent drives recursion as a native tool, not emulated. (Bridge wants mcp type:'http', not 'remote'.) * feat(bench): WHOLE real e2e — opencode supervisor drives opencode workers via the coordination MCP, real test gates delivery Live, no mock: the opencode supervisor (glm-5-turbo via cli-bridge) mounts the coordination MCP, authors worker profiles, calls spawn_worker -> real Scope.spawn -> real opencode workers code in a cwd -> python3 test gates valid -> supervisor settles on the delivered worker -> winner. The completion-oracle (deployable check, not LLM judge) decided delivery over the supervisor's confusion that it couldn't see the workers' isolated cwds (→ shared Workspace next). Proof artifact for the in-box-driving path; the law-compliant productionization is a substrate backend (tmux/bridge/sandbox) that runs authored profiles — not this harness-specific script. * docs(canonical-api): the AgentProfile law — author the profile, the substrate materializes it §1.5 + decision-table rows + CLAUDE.md §0 pointer. The thing we keep forgetting: an agent IS its full AgentProfile (prompt+skills+tools/mcp+subagents+hooks+permissions+model), not a prompt; change behavior by AUTHORING the profile and letting the sandbox substrate materialize it into harness shapes — never write a verify-loop or harness-specific config (self-verification is a hook/process; opencode is only the cli-bridge test target; a missing lever is a substrate gap). * docs(research): consolidate docs/research 28→14 — retire shipped/subsumed design docs Retired 14 design-research docs whose content is now shipped code, in .evolve/current.json, or self-declared subsumed/retracted (the recursion atom shipped; the optimization-space layer evidence landed; verdicts reached). Refreshed the research index, recorded the retirement + rationale in deletion-ledger.md (Pass 2), and fixed every inbound link (top index, the harvest-corpus.ts comment → current.json, optimization-space's suite links). Kept the SSOT masterplan, the canonical-referenced maps (optimization-space/leapfrog), the two gated belief specs, the postmortem guardrail, the build-lists, and the agent-lab tombstones. No broken links into the 14 remain from any canonical doc or src/. * docs(canonical-api): substrate pin 0.89 → 0.92 (matches the merged package.json)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mechanical version bump absorbing three minors of new agent-eval surface. The control-loop primitives this runtime consumes (
runAgentControlLoop,scoreKnowledgeReadiness,blockingKnowledgeEval,userQuestionsForKnowledgeGaps,acquisitionPlansForKnowledgeGaps) are unchanged across 0.21 → 0.22 → 0.23, so the bump is a no-op functionally.What's new in agent-eval 0.21 → 0.23 (not consumed here)
hashJson/canonicalizeexposed for arbitrary content signing;RunRecord.scenarioIdmade optional to support pre-trace records)Code change
One small enrichment around the new optional
RunRecord.scenarioIdfield:runAgentTaskalready knows the canonicalscenarioIdit passes intorunAgentControlLoop(options.scenarioId ?? task.id). Whenadapter.projectRunRecordsreturns records without ascenarioId, the runtime now backfills it with that same canonical value before returning. Adapters that already setscenarioIdare untouched. This means callers readingresult.runRecordsalways see a populatedscenarioIdwithout each adapter having to thread it through.The hoist of
options.scenarioId ?? task.idinto a local is purely so the same value is used at the control-loop call site and the post-projection backfill.Test plan
pnpm install— resolves to@tangle-network/agent-eval@0.23.0pnpm typecheck— cleanpnpm test— 16/16 pass (no count change)pnpm build— clean (dist/index.js 34.20 KB → 34.31 KB, dist/index.d.ts unchanged)scenarioIdbackfill behavior is wanted (alternative: leave RunRecords entirely as the adapter returned them)Notes
^0.20.0floor in the task description was actually^0.20.12in the lockfile; the bump goes from^0.20.12→^0.23.0.RunRecordis constructed inside this repo — construction is delegated to user-supplied adapters via theprojectRunRecordshook — so no other call sites needed touching.🤖 Generated with Claude Code