fix: persist final runtime stream failures by drewstone · Pull Request #1 · tangle-network/agent-runtime

drewstone · 2026-05-04T14:53:29Z

Summary

bump package to 0.5.4
persist final failure events to RuntimeSessionStore
preserve primary stream failure when backend cleanup also fails
add regression tests for replayable failed sessions and cleanup errors

Why

Product sessions need reliable replay/resume/debug data. Failed streams previously yielded final to the caller but did not persist that final event, and backend cleanup could mask the original runtime failure.

Validation

pnpm typecheck
pnpm test
pnpm build
git diff --check

…-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.

@experimental

…r runLoop (backend-blind) (#150) * feat(loops): opt-in session continuation + checkpoint-fork lineage (backend-blind) Two @experimental, default-OFF seams on runLoop so a loop can CONTINUE a sandbox session across iterations (same box + sessionId, no prompt-text replay) and FORK fanout branches from a parent checkpoint (shared context prefix) — both behind a capability probe so the kernel asks 'can I fork?' (client.criuStatus) and never names Docker/Firecracker, degrading to fresh boxes when CRIU is absent. - sandbox-capabilities.ts: memoized, fail-closed criuStatus probe -> {canFork}. - sandbox-lineage.ts: createSandboxLineage owns box+session handles with start/continue/fork/teardown; reuses the kernel's acquireSandbox / buildBackendOptions / deleteBoxSafe; fail-loud if the probe says canFork but the box has no fork(). - run-loop.ts: RunLoopOptions.lineage (sessionContinuity / forkFanout); refine continues, fanout forks-once, else fresh-through-lineage. Default OFF is byte-identical to today, so random@k stays N independent fresh boxes (the compute-control invariant). Rejects lineage + onWorkerBox (both own boxes). - 7 new unit tests (continuation reuses session; fork when canFork; fresh fallback; default-off invariant). Full suite 621 pass, typecheck clean. * fix(loops): address PR #150 review — bound forks, prune lineage, fail-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.

…uard, prompts in artifact (#246) Grid run #1 silently degenerated: the thinking worker model burned the compressor's token cap on reasoning and emitted a ~1-word prompt — a different treatment (prompt REMOVAL) than the requested 50% ratio. The compressor is now a fixed non-thinking model (COMPRESSOR_MODEL, default deepseek-v4-flash), a degenerate output (<20% of target words) fails loud, and the artifact persists the actual cell prompts so the treatment is inspectable.

The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn / supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE row and routes self-improvement to improve().

…cs that can't lie, supervise() one-call (#347) * Revert "feat(runtime): durable run loop — wire supervisor resume + journal the kernel loop (#346)" This reverts commit edc1d54. * docs(simplification): master tracker — converged design, scratch list, full doc/module/example inventory + completion criteria * docs(simplification): red-team corrections — 4 verbs (run/improve/certify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor * docs(simplification): improve is ONE verb with a PLUGGABLE CandidateGenerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine' * refactor(runtime): extract the canonical runToolLoop; routerToolLoop becomes a thin adapter (keystone 1/4) * refactor(runtime): unify the supervisor brain on the canonical ToolLoopChat seam (keystone) Delete DriverChat + routerDriverChat; the coordination-driver brain is now the canonical ToolLoopChat and its loop runs through runToolLoop (routerBrain = 4 lines, was 60). The equal-k driver-inference metering is preserved exactly. Three tool-loop copies collapse to one. * docs(simplification): keystone WS1 is two phases — 1a (seam unified) done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next * refactor(runtime): internalize leaked recursion/seam/journal/trace plumbing from the public barrel * refactor(runtime): internalize durable spawn-journal + spawn-tree types from the public barrel * refactor(api): collapse public export subpaths 13→6 (fold audit into profiles; drop unused duplicates) * docs: fix 3 stale/fabricated symbol references (DriverChat, runSteeringExperiment, refineGepa label) * docs: consolidate 26→19 + archive (shrink canonical-api 984→76, merge 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes) * feat(docs-gate): CLASS 6 prose-symbol check — every backticked symbol in curated docs must resolve Scans canonical-api/concepts/architecture for backticked symbols outside code fences; reddens on any call-shaped or PascalCase symbol that resolves to no src/bench/substrate export or concept-whitelist entry. Walks every substrate dist/**/*.d.ts (not just index barrels). Closes the gap that let gepaDriver/refineGepa live in the docs unchecked. * chore(profiles): sort barrel exports after the audit fold (biome) * docs(simplification): mark WS1a/WS3/WS5 shipped * feat(runtime): supervisorAgent resolves the brain from profile.harness (WS1b) A supervisor is now an AgentProfile: harness null -> the in-process router tool-loop (coordinationDriverAgent; routerBrain becomes an internal detail), a coding-CLI harness (claude-code/opencode/codex) -> a sandboxed harness driving the coordination verbs via serveCoordinationMcp. Both arms share makeWorkerAgent + the keep-best-delivered oracle. Closes the critique's A2 (driver brain was router-only). Proven offline both arms. * docs(simplification): mark WS1b shipped (supervisorAgent — brain from profile.harness) * feat(runtime): supervise() one-call convenience + workerFromBackend supervise(profile, task, { backend|makeWorkerAgent, budget }) defaults blobs/perWorker/ journal/executors/maxDepth so 'just invoke the supervisor' is a one-liner. workerFromBackend derives the worker seam from a backend config + an optional completion oracle (settled⟺delivered). The raw seams (supervisorAgent + createSupervisor().run) stay for power use. * docs(simplification): table the supervisor/driver/worker multi-round design (round vs turn, prompt-policy retry, real-time trace self-correction) * docs(examples): canonical supervise() one-call example (the DX payoff — profile + goal, scaffolding defaulted) * refactor(mcp): rename spawn_worker → spawn_agent (the verb spawns ANY agent, incl. a sub-supervisor) The coordination verb always took a worker OR a driver profile and resolves a sub-supervisor via the role marker — the name lied. Renamed across the tool def, the LLM-facing descriptions, the scripted-brain tests, the examples, and the hand docs. WS4 (naming taxonomy). * refactor(mcp): rename observe_worker/steer_worker → observe_agent/steer_agent (consistent verb family) The coordination verbs operate on any spawned agent (a leaf worker OR a sub-supervisor), so the family is now spawn_agent / observe_agent / steer_agent. WS4 (naming taxonomy). * refactor(runtime): depthDriver/breadthDriver→depthStrategy/breadthStrategy, supervisorSkill→supervisorInstructions (WS4) They are strategy combinators and a prompt-instruction builder, not 'drivers'/'skills' — reserving 'Driver' for the agent-orchestration layer (coordinationDriverAgent/driverChild). * refactor(mcp): rename createDriveTurnResumeDriver → createDetachedTurnResumeDriver (WS4) * feat(improvement): improve() — the one pluggable RSI verb (facade over selfImprove; generator defaulted from surface) * test(improvement): offline improve() facade test (scripted generator, no creds) * refactor(examples): delete run-router.ts + loop.ts, rewrite sandbox/bridge runners onto supervise() run-router.ts duplicated examples/supervise/supervise.ts (router brain + router-tools backend). loop.ts's runSupervisorLoop/makeWorkerAgent duplicated supervise()/workerFromBackend. The sandbox + bridge runners now call supervise() with only their load-bearing per-backend seam; the shared demo task + scripted brain move to shared.ts. * refactor(examples): rewrite run-supervisor-mcp onto workerFromBackend() (single-sourced worker seam) Replaces the bespoke makeWorker (executor construction + per-worker file plumbing) with workerFromBackend(backend, deliverable); the deployable check now reads the worker's real output for ANSWER=42 (completion oracle, not a self-report). Keeps the cli-bridge harness supervisor arm that drives spawn_agent natively over the coordination MCP. * style(improvement,mcp): biome import ordering after WS4 rename + improve() exports * docs(examples): point READMEs at the pruned set + add an offline supervise() example test * docs(simplification): record WS4/WS2 closed decisions (AgentRunSpec deferred, improvementDriver/runLoop kept, improve surface boundary) * fix(scripts): align verify-package-exports with the 6-subpath surface (drop stale ./workflow) The 13→6 export collapse (e6ff2a2) removed the ./workflow subpath but left verify-package-exports.mjs asserting it (requiredExports + a runtime import), so the gate failed on a subpath the package intentionally no longer exposes. Verify the real subpaths (., ./agent, ./intelligence, ./loops, ./profiles, ./mcp) instead. * refactor: post-audit cleanups — rename internal runToolLoop→runBrainLoop, fix improve() default model - runToolLoop name collided with the public streaming runToolLoop; the internal brain-loop seam is now runBrainLoop (one grep = one concept). - improve()'s zero-config default reflection model was the dead anthropic/claude-sonnet-4.6 → deepseek-v4-flash (router-served). * docs: front-door supervise()/improve() — the #1 audit fix The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn / supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE row and routes self-improvement to improve(). * refactor: delete the standalone workflow-script engine (src/workflow, 2775 LOC + tests) A third orchestration substrate (a workflow-as-a-script DSL runner with its own checkpoints/budget/ delegates) that does NOT use the supervisor and is NOT self-improving — redundant with the Scope/Supervisor + supervise() path (the architecture's 'two substrates, do not invent a third'). Zero in-repo or fleet consumers; its ./workflow subpath was already dropped in WS3. * feat(mcp): spawn_agent accepts an optional per-spawn budget (supervisor can vary budget per worker) * feat(runtime): allowedModels guard on supervise()/improve() (fail-loud model-subset restriction) * docs(examples): strategy-evolution — the policy-search research journey (runStrategyEvolution + promotionGate) * feat(conversation): evalPersona() facade + examples/product-eval (user-sim product evals, one-call) * docs(examples): improve() — the RSI verb, offline scripted example * docs(examples): intelligence-recommend — connect traces→findings→improve() (the intelligence loop) * docs(examples): list the 4 new examples in the index README * docs(api): regenerate API reference for allowedModels guard + evalPersona facade * fix: address PR #347 review — restore contentAddress export, guard improve() JSON.parse, test flagged fns - HIGH: contentAddress was dropped from the runtime barrel by WS3 → bench/atom-humaneval + atom-mcp-e2e fail to compile (a content-addressing helper bench legitimately uses). Re-exported from the barrel. - MEDIUM: applyWinnerToProfile's JSON.parse threw a raw SyntaxError after a ship verdict on a malformed winner → parseWinnerJson guards it with a typed ConfigError + a test. - MEDIUM: finalizeBestDelivered + runBrainLoop had no direct tests → added focused unit tests (the blob store's content-address invariant is exercised). - LOW: supervise() decision-table/README rows implied backend is required (it's optional) → { budget, backend? }.

fix: persist final runtime stream failures

2e718d8

drewstone merged commit b78b2bd into main May 4, 2026

drewstone deleted the fix/stream-final-session-persistence branch May 4, 2026 14:54

drewstone mentioned this pull request Jun 4, 2026

feat(loops): opt-in session-continuation + checkpoint-fork lineage for runLoop (backend-blind) #150

Merged

This was referenced Jun 13, 2026

feat(lifecycle): profile-artifact lifecycle as a runtime primitive — generate/eval/promote/maintain skills, tools, MCPs, hooks, subagents for free #267

Closed

docs: post-POWER-16 go-live plan + verifiably-smarter-and-cheaper thesis #278

Merged

tangletools mentioned this pull request Jun 14, 2026

feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch) #282

Merged

drewstone mentioned this pull request Jun 24, 2026

docs(canonical-api): close the anti-reinvention gaps + de-reinvent the examples #370

Closed

tangletools mentioned this pull request Jun 24, 2026

feat(loops): leak-free steering drivers (naive, dumb) #372

Merged

drewstone mentioned this pull request Jun 24, 2026

docs(examples): three-persona cleanup — newcomer/senior/junior #376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persist final runtime stream failures#1

fix: persist final runtime stream failures#1
drewstone merged 1 commit into
mainfrom
fix/stream-final-session-persistence

drewstone commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant