Skip to content

fix: persist final runtime stream failures#1

Merged
drewstone merged 1 commit into
mainfrom
fix/stream-final-session-persistence
May 4, 2026
Merged

fix: persist final runtime stream failures#1
drewstone merged 1 commit into
mainfrom
fix/stream-final-session-persistence

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Summary

  • bump package to 0.5.4
  • persist final failure events to RuntimeSessionStore
  • preserve primary stream failure when backend cleanup also fails
  • add regression tests for replayable failed sessions and cleanup errors

Why

Product sessions need reliable replay/resume/debug data. Failed streams previously yielded final to the caller but did not persist that final event, and backend cleanup could mask the original runtime failure.

Validation

  • pnpm typecheck
  • pnpm test
  • pnpm build
  • git diff --check

@drewstone drewstone merged commit b78b2bd into main May 4, 2026
@drewstone drewstone deleted the fix/stream-final-session-persistence branch May 4, 2026 14:54
tangletools pushed a commit that referenced this pull request Jun 4, 2026
…-loud session continuity

Resolve all six findings from the review (none blocked landing; #1 gated
enabling, #3/#4 wanted documenting). Lineage remains default-OFF and
byte-identical to the fresh-box path when both flags are unset.

- #1 sessionContinuity silent no-op: `continue` now asserts the session is
  still known to the sandbox via `box.session(id).status()` before streaming.
  A `null` (platform never honored the client-minted id, or it was reaped)
  raises a ValidationError, which executeIteration now propagates as a hard
  structural failure instead of degrading to a soft empty iteration — so a
  non-honoring platform errors loudly rather than running contextless turns.
- #2 unbounded fork creation: `fork` provisions child boxes through
  `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single
  `Promise.all` over all N branches.
- #3 fork ignores per-branch specs: documented on `fork` and
  `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent
  image/profile (per-branch specs apply only on the degraded fresh path).
- #4 lineage holds every box to loop end: kernel prunes boxes no future round
  can descend from after each round, gated on a kernel-inferred (monotonic)
  branch point — skipped when the driver authors its own `parentIndex`. The
  unprunable case is documented as the box ceiling.
- #5 abort during fork: documented the SDK's signal-less fork; abort is now
  checked per branch (between bounded waves) + an abort-under-lineage test.
- #6 export order: alphabetized the loops barrel.

Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/
fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent,
abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone added a commit that referenced this pull request Jun 4, 2026
…r runLoop (backend-blind) (#150)

* feat(loops): opt-in session continuation + checkpoint-fork lineage (backend-blind)

Two @experimental, default-OFF seams on runLoop so a loop can CONTINUE a sandbox
session across iterations (same box + sessionId, no prompt-text replay) and FORK
fanout branches from a parent checkpoint (shared context prefix) — both behind a
capability probe so the kernel asks 'can I fork?' (client.criuStatus) and never
names Docker/Firecracker, degrading to fresh boxes when CRIU is absent.

- sandbox-capabilities.ts: memoized, fail-closed criuStatus probe -> {canFork}.
- sandbox-lineage.ts: createSandboxLineage owns box+session handles with
  start/continue/fork/teardown; reuses the kernel's acquireSandbox /
  buildBackendOptions / deleteBoxSafe; fail-loud if the probe says canFork but
  the box has no fork().
- run-loop.ts: RunLoopOptions.lineage (sessionContinuity / forkFanout); refine
  continues, fanout forks-once, else fresh-through-lineage. Default OFF is
  byte-identical to today, so random@k stays N independent fresh boxes (the
  compute-control invariant). Rejects lineage + onWorkerBox (both own boxes).
- 7 new unit tests (continuation reuses session; fork when canFork; fresh
  fallback; default-off invariant). Full suite 621 pass, typecheck clean.

* fix(loops): address PR #150 review — bound forks, prune lineage, fail-loud session continuity

Resolve all six findings from the review (none blocked landing; #1 gated
enabling, #3/#4 wanted documenting). Lineage remains default-OFF and
byte-identical to the fresh-box path when both flags are unset.

- #1 sessionContinuity silent no-op: `continue` now asserts the session is
  still known to the sandbox via `box.session(id).status()` before streaming.
  A `null` (platform never honored the client-minted id, or it was reaped)
  raises a ValidationError, which executeIteration now propagates as a hard
  structural failure instead of degrading to a soft empty iteration — so a
  non-honoring platform errors loudly rather than running contextless turns.
- #2 unbounded fork creation: `fork` provisions child boxes through
  `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single
  `Promise.all` over all N branches.
- #3 fork ignores per-branch specs: documented on `fork` and
  `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent
  image/profile (per-branch specs apply only on the degraded fresh path).
- #4 lineage holds every box to loop end: kernel prunes boxes no future round
  can descend from after each round, gated on a kernel-inferred (monotonic)
  branch point — skipped when the driver authors its own `parentIndex`. The
  unprunable case is documented as the box ceiling.
- #5 abort during fork: documented the SDK's signal-less fork; abort is now
  checked per branch (between bounded waves) + an abort-under-lineage test.
- #6 export order: alphabetized the loops barrel.

Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/
fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent,
abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone added a commit that referenced this pull request Jun 10, 2026
…uard, prompts in artifact (#246)

Grid run #1 silently degenerated: the thinking worker model burned the
compressor's token cap on reasoning and emitted a ~1-word prompt — a
different treatment (prompt REMOVAL) than the requested 50% ratio. The
compressor is now a fixed non-thinking model (COMPRESSOR_MODEL, default
deepseek-v4-flash), a degenerate output (<20% of target words) fails loud,
and the artifact persists the actual cell prompts so the treatment is
inspectable.
drewstone added a commit that referenced this pull request Jun 20, 2026
The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the
verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn /
supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE
row and routes self-improvement to improve().
drewstone added a commit that referenced this pull request Jun 20, 2026
…cs that can't lie, supervise() one-call (#347)

* Revert "feat(runtime): durable run loop — wire supervisor resume + journal the kernel loop (#346)"

This reverts commit edc1d54.

* docs(simplification): master tracker — converged design, scratch list, full doc/module/example inventory + completion criteria

* docs(simplification): red-team corrections — 4 verbs (run/improve/certify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor

* docs(simplification): improve is ONE verb with a PLUGGABLE CandidateGenerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine'

* refactor(runtime): extract the canonical runToolLoop; routerToolLoop becomes a thin adapter (keystone 1/4)

* refactor(runtime): unify the supervisor brain on the canonical ToolLoopChat seam (keystone)

Delete DriverChat + routerDriverChat; the coordination-driver brain is now the canonical
ToolLoopChat and its loop runs through runToolLoop (routerBrain = 4 lines, was 60). The
equal-k driver-inference metering is preserved exactly. Three tool-loop copies collapse to one.

* docs(simplification): keystone WS1 is two phases — 1a (seam unified) done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next

* refactor(runtime): internalize leaked recursion/seam/journal/trace plumbing from the public barrel

* refactor(runtime): internalize durable spawn-journal + spawn-tree types from the public barrel

* refactor(api): collapse public export subpaths 13→6 (fold audit into profiles; drop unused duplicates)

* docs: fix 3 stale/fabricated symbol references (DriverChat, runSteeringExperiment, refineGepa label)

* docs: consolidate 26→19 + archive (shrink canonical-api 984→76, merge 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes)

* feat(docs-gate): CLASS 6 prose-symbol check — every backticked symbol in curated docs must resolve

Scans canonical-api/concepts/architecture for backticked symbols outside code fences;
reddens on any call-shaped or PascalCase symbol that resolves to no src/bench/substrate
export or concept-whitelist entry. Walks every substrate dist/**/*.d.ts (not just index
barrels). Closes the gap that let gepaDriver/refineGepa live in the docs unchecked.

* chore(profiles): sort barrel exports after the audit fold (biome)

* docs(simplification): mark WS1a/WS3/WS5 shipped

* feat(runtime): supervisorAgent resolves the brain from profile.harness (WS1b)

A supervisor is now an AgentProfile: harness null -> the in-process router tool-loop
(coordinationDriverAgent; routerBrain becomes an internal detail), a coding-CLI harness
(claude-code/opencode/codex) -> a sandboxed harness driving the coordination verbs via
serveCoordinationMcp. Both arms share makeWorkerAgent + the keep-best-delivered oracle.
Closes the critique's A2 (driver brain was router-only). Proven offline both arms.

* docs(simplification): mark WS1b shipped (supervisorAgent — brain from profile.harness)

* feat(runtime): supervise() one-call convenience + workerFromBackend

supervise(profile, task, { backend|makeWorkerAgent, budget }) defaults blobs/perWorker/
journal/executors/maxDepth so 'just invoke the supervisor' is a one-liner. workerFromBackend
derives the worker seam from a backend config + an optional completion oracle (settled⟺delivered).
The raw seams (supervisorAgent + createSupervisor().run) stay for power use.

* docs(simplification): table the supervisor/driver/worker multi-round design (round vs turn, prompt-policy retry, real-time trace self-correction)

* docs(examples): canonical supervise() one-call example (the DX payoff — profile + goal, scaffolding defaulted)

* refactor(mcp): rename spawn_worker → spawn_agent (the verb spawns ANY agent, incl. a sub-supervisor)

The coordination verb always took a worker OR a driver profile and resolves a sub-supervisor
via the role marker — the name lied. Renamed across the tool def, the LLM-facing descriptions,
the scripted-brain tests, the examples, and the hand docs. WS4 (naming taxonomy).

* refactor(mcp): rename observe_worker/steer_worker → observe_agent/steer_agent (consistent verb family)

The coordination verbs operate on any spawned agent (a leaf worker OR a sub-supervisor), so the
family is now spawn_agent / observe_agent / steer_agent. WS4 (naming taxonomy).

* refactor(runtime): depthDriver/breadthDriver→depthStrategy/breadthStrategy, supervisorSkill→supervisorInstructions (WS4)

They are strategy combinators and a prompt-instruction builder, not 'drivers'/'skills' — reserving
'Driver' for the agent-orchestration layer (coordinationDriverAgent/driverChild).

* refactor(mcp): rename createDriveTurnResumeDriver → createDetachedTurnResumeDriver (WS4)

* feat(improvement): improve() — the one pluggable RSI verb (facade over selfImprove; generator defaulted from surface)

* test(improvement): offline improve() facade test (scripted generator, no creds)

* refactor(examples): delete run-router.ts + loop.ts, rewrite sandbox/bridge runners onto supervise()

run-router.ts duplicated examples/supervise/supervise.ts (router brain + router-tools
backend). loop.ts's runSupervisorLoop/makeWorkerAgent duplicated supervise()/workerFromBackend.
The sandbox + bridge runners now call supervise() with only their load-bearing per-backend
seam; the shared demo task + scripted brain move to shared.ts.

* refactor(examples): rewrite run-supervisor-mcp onto workerFromBackend() (single-sourced worker seam)

Replaces the bespoke makeWorker (executor construction + per-worker file plumbing) with
workerFromBackend(backend, deliverable); the deployable check now reads the worker's real
output for ANSWER=42 (completion oracle, not a self-report). Keeps the cli-bridge harness
supervisor arm that drives spawn_agent natively over the coordination MCP.

* style(improvement,mcp): biome import ordering after WS4 rename + improve() exports

* docs(examples): point READMEs at the pruned set + add an offline supervise() example test

* docs(simplification): record WS4/WS2 closed decisions (AgentRunSpec deferred, improvementDriver/runLoop kept, improve surface boundary)

* fix(scripts): align verify-package-exports with the 6-subpath surface (drop stale ./workflow)

The 13→6 export collapse (e6ff2a2) removed the ./workflow subpath but left
verify-package-exports.mjs asserting it (requiredExports + a runtime import), so the gate
failed on a subpath the package intentionally no longer exposes. Verify the real subpaths
(., ./agent, ./intelligence, ./loops, ./profiles, ./mcp) instead.

* refactor: post-audit cleanups — rename internal runToolLoop→runBrainLoop, fix improve() default model

- runToolLoop name collided with the public streaming runToolLoop; the internal brain-loop seam is now runBrainLoop (one grep = one concept).
- improve()'s zero-config default reflection model was the dead anthropic/claude-sonnet-4.6 → deepseek-v4-flash (router-served).

* docs: front-door supervise()/improve() — the #1 audit fix

The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the
verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn /
supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE
row and routes self-improvement to improve().

* refactor: delete the standalone workflow-script engine (src/workflow, 2775 LOC + tests)

A third orchestration substrate (a workflow-as-a-script DSL runner with its own checkpoints/budget/
delegates) that does NOT use the supervisor and is NOT self-improving — redundant with the
Scope/Supervisor + supervise() path (the architecture's 'two substrates, do not invent a third').
Zero in-repo or fleet consumers; its ./workflow subpath was already dropped in WS3.

* feat(mcp): spawn_agent accepts an optional per-spawn budget (supervisor can vary budget per worker)

* feat(runtime): allowedModels guard on supervise()/improve() (fail-loud model-subset restriction)

* docs(examples): strategy-evolution — the policy-search research journey (runStrategyEvolution + promotionGate)

* feat(conversation): evalPersona() facade + examples/product-eval (user-sim product evals, one-call)

* docs(examples): improve() — the RSI verb, offline scripted example

* docs(examples): intelligence-recommend — connect traces→findings→improve() (the intelligence loop)

* docs(examples): list the 4 new examples in the index README

* docs(api): regenerate API reference for allowedModels guard + evalPersona facade

* fix: address PR #347 review — restore contentAddress export, guard improve() JSON.parse, test flagged fns

- HIGH: contentAddress was dropped from the runtime barrel by WS3 → bench/atom-humaneval + atom-mcp-e2e fail to compile (a content-addressing helper bench legitimately uses). Re-exported from the barrel.
- MEDIUM: applyWinnerToProfile's JSON.parse threw a raw SyntaxError after a ship verdict on a malformed winner → parseWinnerJson guards it with a typed ConfigError + a test.
- MEDIUM: finalizeBestDelivered + runBrainLoop had no direct tests → added focused unit tests (the blob store's content-address invariant is exercised).
- LOW: supervise() decision-table/README rows implied backend is required (it's optional) → { budget, backend? }.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant