fix: persist final runtime stream failures#1
Merged
Conversation
tangletools
pushed a commit
that referenced
this pull request
Jun 4, 2026
…-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.
drewstone
added a commit
that referenced
this pull request
Jun 4, 2026
…r runLoop (backend-blind) (#150) * feat(loops): opt-in session continuation + checkpoint-fork lineage (backend-blind) Two @experimental, default-OFF seams on runLoop so a loop can CONTINUE a sandbox session across iterations (same box + sessionId, no prompt-text replay) and FORK fanout branches from a parent checkpoint (shared context prefix) — both behind a capability probe so the kernel asks 'can I fork?' (client.criuStatus) and never names Docker/Firecracker, degrading to fresh boxes when CRIU is absent. - sandbox-capabilities.ts: memoized, fail-closed criuStatus probe -> {canFork}. - sandbox-lineage.ts: createSandboxLineage owns box+session handles with start/continue/fork/teardown; reuses the kernel's acquireSandbox / buildBackendOptions / deleteBoxSafe; fail-loud if the probe says canFork but the box has no fork(). - run-loop.ts: RunLoopOptions.lineage (sessionContinuity / forkFanout); refine continues, fanout forks-once, else fresh-through-lineage. Default OFF is byte-identical to today, so random@k stays N independent fresh boxes (the compute-control invariant). Rejects lineage + onWorkerBox (both own boxes). - 7 new unit tests (continuation reuses session; fork when canFork; fresh fallback; default-off invariant). Full suite 621 pass, typecheck clean. * fix(loops): address PR #150 review — bound forks, prune lineage, fail-loud session continuity Resolve all six findings from the review (none blocked landing; #1 gated enabling, #3/#4 wanted documenting). Lineage remains default-OFF and byte-identical to the fresh-box path when both flags are unset. - #1 sessionContinuity silent no-op: `continue` now asserts the session is still known to the sandbox via `box.session(id).status()` before streaming. A `null` (platform never honored the client-minted id, or it was reaped) raises a ValidationError, which executeIteration now propagates as a hard structural failure instead of degrading to a soft empty iteration — so a non-honoring platform errors loudly rather than running contextless turns. - #2 unbounded fork creation: `fork` provisions child boxes through `mapWithConcurrency` bounded by the loop's `maxConcurrency`, not a single `Promise.all` over all N branches. - #3 fork ignores per-branch specs: documented on `fork` and `LoopLineageOptions.forkFanout` that a real CRIU fork inherits the parent image/profile (per-branch specs apply only on the degraded fresh path). - #4 lineage holds every box to loop end: kernel prunes boxes no future round can descend from after each round, gated on a kernel-inferred (monotonic) branch point — skipped when the driver authors its own `parentIndex`. The unprunable case is documented as the box ceiling. - #5 abort during fork: documented the SDK's signal-less fork; abort is now checked per branch (between bounded waves) + an abort-under-lineage test. - #6 export order: alphabetized the loops barrel. Adds `mapWithConcurrency` util and six lineage tests (session-liveness pass/ fail, bounded-fork peak, mid-loop prune, no-prune-under-authored-parent, abort-under-lineage). 627 tests pass, typecheck + biome clean.
This was referenced Jun 9, 2026
drewstone
added a commit
that referenced
this pull request
Jun 10, 2026
…uard, prompts in artifact (#246) Grid run #1 silently degenerated: the thinking worker model burned the compressor's token cap on reasoning and emitted a ~1-word prompt — a different treatment (prompt REMOVAL) than the requested 50% ratio. The compressor is now a fixed non-thinking model (COMPRESSOR_MODEL, default deepseek-v4-flash), a degenerate output (<20% of target words) fails loud, and the artifact persists the actual cell prompts so the treatment is inspectable.
drewstone
added a commit
that referenced
this pull request
Jun 20, 2026
The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn / supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE row and routes self-improvement to improve().
drewstone
added a commit
that referenced
this pull request
Jun 20, 2026
…cs that can't lie, supervise() one-call (#347) * Revert "feat(runtime): durable run loop — wire supervisor resume + journal the kernel loop (#346)" This reverts commit edc1d54. * docs(simplification): master tracker — converged design, scratch list, full doc/module/example inventory + completion criteria * docs(simplification): red-team corrections — 4 verbs (run/improve/certify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor * docs(simplification): improve is ONE verb with a PLUGGABLE CandidateGenerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine' * refactor(runtime): extract the canonical runToolLoop; routerToolLoop becomes a thin adapter (keystone 1/4) * refactor(runtime): unify the supervisor brain on the canonical ToolLoopChat seam (keystone) Delete DriverChat + routerDriverChat; the coordination-driver brain is now the canonical ToolLoopChat and its loop runs through runToolLoop (routerBrain = 4 lines, was 60). The equal-k driver-inference metering is preserved exactly. Three tool-loop copies collapse to one. * docs(simplification): keystone WS1 is two phases — 1a (seam unified) done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next * refactor(runtime): internalize leaked recursion/seam/journal/trace plumbing from the public barrel * refactor(runtime): internalize durable spawn-journal + spawn-tree types from the public barrel * refactor(api): collapse public export subpaths 13→6 (fold audit into profiles; drop unused duplicates) * docs: fix 3 stale/fabricated symbol references (DriverChat, runSteeringExperiment, refineGepa label) * docs: consolidate 26→19 + archive (shrink canonical-api 984→76, merge 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes) * feat(docs-gate): CLASS 6 prose-symbol check — every backticked symbol in curated docs must resolve Scans canonical-api/concepts/architecture for backticked symbols outside code fences; reddens on any call-shaped or PascalCase symbol that resolves to no src/bench/substrate export or concept-whitelist entry. Walks every substrate dist/**/*.d.ts (not just index barrels). Closes the gap that let gepaDriver/refineGepa live in the docs unchecked. * chore(profiles): sort barrel exports after the audit fold (biome) * docs(simplification): mark WS1a/WS3/WS5 shipped * feat(runtime): supervisorAgent resolves the brain from profile.harness (WS1b) A supervisor is now an AgentProfile: harness null -> the in-process router tool-loop (coordinationDriverAgent; routerBrain becomes an internal detail), a coding-CLI harness (claude-code/opencode/codex) -> a sandboxed harness driving the coordination verbs via serveCoordinationMcp. Both arms share makeWorkerAgent + the keep-best-delivered oracle. Closes the critique's A2 (driver brain was router-only). Proven offline both arms. * docs(simplification): mark WS1b shipped (supervisorAgent — brain from profile.harness) * feat(runtime): supervise() one-call convenience + workerFromBackend supervise(profile, task, { backend|makeWorkerAgent, budget }) defaults blobs/perWorker/ journal/executors/maxDepth so 'just invoke the supervisor' is a one-liner. workerFromBackend derives the worker seam from a backend config + an optional completion oracle (settled⟺delivered). The raw seams (supervisorAgent + createSupervisor().run) stay for power use. * docs(simplification): table the supervisor/driver/worker multi-round design (round vs turn, prompt-policy retry, real-time trace self-correction) * docs(examples): canonical supervise() one-call example (the DX payoff — profile + goal, scaffolding defaulted) * refactor(mcp): rename spawn_worker → spawn_agent (the verb spawns ANY agent, incl. a sub-supervisor) The coordination verb always took a worker OR a driver profile and resolves a sub-supervisor via the role marker — the name lied. Renamed across the tool def, the LLM-facing descriptions, the scripted-brain tests, the examples, and the hand docs. WS4 (naming taxonomy). * refactor(mcp): rename observe_worker/steer_worker → observe_agent/steer_agent (consistent verb family) The coordination verbs operate on any spawned agent (a leaf worker OR a sub-supervisor), so the family is now spawn_agent / observe_agent / steer_agent. WS4 (naming taxonomy). * refactor(runtime): depthDriver/breadthDriver→depthStrategy/breadthStrategy, supervisorSkill→supervisorInstructions (WS4) They are strategy combinators and a prompt-instruction builder, not 'drivers'/'skills' — reserving 'Driver' for the agent-orchestration layer (coordinationDriverAgent/driverChild). * refactor(mcp): rename createDriveTurnResumeDriver → createDetachedTurnResumeDriver (WS4) * feat(improvement): improve() — the one pluggable RSI verb (facade over selfImprove; generator defaulted from surface) * test(improvement): offline improve() facade test (scripted generator, no creds) * refactor(examples): delete run-router.ts + loop.ts, rewrite sandbox/bridge runners onto supervise() run-router.ts duplicated examples/supervise/supervise.ts (router brain + router-tools backend). loop.ts's runSupervisorLoop/makeWorkerAgent duplicated supervise()/workerFromBackend. The sandbox + bridge runners now call supervise() with only their load-bearing per-backend seam; the shared demo task + scripted brain move to shared.ts. * refactor(examples): rewrite run-supervisor-mcp onto workerFromBackend() (single-sourced worker seam) Replaces the bespoke makeWorker (executor construction + per-worker file plumbing) with workerFromBackend(backend, deliverable); the deployable check now reads the worker's real output for ANSWER=42 (completion oracle, not a self-report). Keeps the cli-bridge harness supervisor arm that drives spawn_agent natively over the coordination MCP. * style(improvement,mcp): biome import ordering after WS4 rename + improve() exports * docs(examples): point READMEs at the pruned set + add an offline supervise() example test * docs(simplification): record WS4/WS2 closed decisions (AgentRunSpec deferred, improvementDriver/runLoop kept, improve surface boundary) * fix(scripts): align verify-package-exports with the 6-subpath surface (drop stale ./workflow) The 13→6 export collapse (e6ff2a2) removed the ./workflow subpath but left verify-package-exports.mjs asserting it (requiredExports + a runtime import), so the gate failed on a subpath the package intentionally no longer exposes. Verify the real subpaths (., ./agent, ./intelligence, ./loops, ./profiles, ./mcp) instead. * refactor: post-audit cleanups — rename internal runToolLoop→runBrainLoop, fix improve() default model - runToolLoop name collided with the public streaming runToolLoop; the internal brain-loop seam is now runBrainLoop (one grep = one concept). - improve()'s zero-config default reflection model was the dead anthropic/claude-sonnet-4.6 → deepseek-v4-flash (router-served). * docs: front-door supervise()/improve() — the #1 audit fix The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn / supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE row and routes self-improvement to improve(). * refactor: delete the standalone workflow-script engine (src/workflow, 2775 LOC + tests) A third orchestration substrate (a workflow-as-a-script DSL runner with its own checkpoints/budget/ delegates) that does NOT use the supervisor and is NOT self-improving — redundant with the Scope/Supervisor + supervise() path (the architecture's 'two substrates, do not invent a third'). Zero in-repo or fleet consumers; its ./workflow subpath was already dropped in WS3. * feat(mcp): spawn_agent accepts an optional per-spawn budget (supervisor can vary budget per worker) * feat(runtime): allowedModels guard on supervise()/improve() (fail-loud model-subset restriction) * docs(examples): strategy-evolution — the policy-search research journey (runStrategyEvolution + promotionGate) * feat(conversation): evalPersona() facade + examples/product-eval (user-sim product evals, one-call) * docs(examples): improve() — the RSI verb, offline scripted example * docs(examples): intelligence-recommend — connect traces→findings→improve() (the intelligence loop) * docs(examples): list the 4 new examples in the index README * docs(api): regenerate API reference for allowedModels guard + evalPersona facade * fix: address PR #347 review — restore contentAddress export, guard improve() JSON.parse, test flagged fns - HIGH: contentAddress was dropped from the runtime barrel by WS3 → bench/atom-humaneval + atom-mcp-e2e fail to compile (a content-addressing helper bench legitimately uses). Re-exported from the barrel. - MEDIUM: applyWinnerToProfile's JSON.parse threw a raw SyntaxError after a ship verdict on a malformed winner → parseWinnerJson guards it with a typed ConfigError + a test. - MEDIUM: finalizeBestDelivered + runBrainLoop had no direct tests → added focused unit tests (the blob store's content-address invariant is exercised). - LOW: supervise() decision-table/README rows implied backend is required (it's optional) → { budget, backend? }.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
Product sessions need reliable replay/resume/debug data. Failed streams previously yielded final to the caller but did not persist that final event, and backend cleanup could mask the original runtime failure.
Validation