Skip to content

feat(runtime): durable run loop — wire supervisor resume + journal the kernel loop#346

Merged
drewstone merged 1 commit into
mainfrom
feat/run-loop-durability
Jun 20, 2026
Merged

feat(runtime): durable run loop — wire supervisor resume + journal the kernel loop#346
drewstone merged 1 commit into
mainfrom
feat/run-loop-durability

Conversation

@drewstone

@drewstone drewstone commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Durable run loop

Makes an agent run survive a worker crash or restart by wiring the run loop onto durable journals, reusing the runtime's existing resume primitives.

  • Supervisor resumecreateSupervisor().run() loads an existing tree and rehydrates settled children (replaySpawnTree / materializeTreeView) instead of always starting fresh. These primitives were already built and tested; this wires them in.
  • Durable journal by default — the run context uses the file-backed SpawnJournal instead of the in-memory one, so a run's progress is durable without the caller opting in.
  • Kernel loop journal — each committed iteration is journaled; a restart resumes from the last committed offset rather than re-running from the start.

Verification

A run killed mid-flight and reloaded resumes from the last committed step and does not redo committed work.

Relationship to the shared package

The kernel journal and tool-call repair are the agent-runtime consumer of the shared durability primitives (@tangle-network/sdk-core/durability); once that package publishes, this re-points onto the shared contract so the runtime and the sandbox/cli-bridge paths share one recovery story.

…e kernel loop [WIP]

Build-stage durability work captured as a draft for review/resume. See PR comment for spec, checklist, completion criteria, ranked alternatives, decisions, and resume steps.
@drewstone drewstone marked this pull request as ready for review June 20, 2026 14:05

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 3e0c0f49

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-20T14:05:37Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Value Audit — better-approach-exists

Verdict better-approach-exists
Concerns 2 (1 medium-concern, 1 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 417.1s (2 bridge agents)
Total 417.1s

💰 Value — better-approach-exists

Adds coherent crash-resume durability to runLoop and supervised runs, but LoopJournal duplicates the existing ConversationJournal adapter layer instead of reusing a shared substrate.

  • What it does: Wires durable crash-resume into two runtime paths. (1) runLoop gains a LoopJournal interface (InMemoryLoopJournal + FileLoopJournal) that loads prior iterations on start, appends each committed round before planning the next, and records end on finalization; resumed runs skip committed iterations and ended runs short-circuit to the recorded result. (2) Supervisor.run now loads the spawn journal tr
  • Goals it achieves: Make agent runs survive a worker or driver process crash/restart without redoing already-committed work; give real (non-test) supervised runs a durable-by-directory context; align the kernel loop with the existing ConversationJournal durability pattern and the supervisor with its own already-built replay primitives.
  • Assessment: Good change that achieves its stated goal coherently. The commit boundary is well-placed (after workers drain and outputs fold), file appends fsync, ended runs are idempotent, and the tests in tests/loops/run-loop-durability.test.ts prove no-redo resume for both kernel and supervisor paths. The Supervisor wiring correctly reuses existing tested primitives rather than inventing new replay logic.
  • Better / existing approach: LoopJournal's InMemoryLoopJournal and FileLoopJournal are near-identical structural clones of InMemoryConversationJournal and FileConversationJournal in src/conversation/journal.ts (compare journal.ts:58-194 with loop-journal.ts:58-193). Both use the same Map-with-defensive-copy, JSONL line-per-record, fsync-on-append, begin/append/end record shapes, isNoEntError helper, and startedAt-mismatch gua
  • Model: kimi-code/kimi-for-coding
  • Bridge attempts: 3
  • Bridge warning: opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.1: bridge stream ended without value-audit content

🎯 Usefulness — sound

Wires crash-durability into both the supervised run and kernel loop using the codebase's existing journal patterns — reload recovers committed work without redoing it.

  • Integration: Reachable through two paths. (1) Supervisor: createFileRunContext(dir) or createInMemoryRunContext({dir}) → spread into SupervisorOpts → supervisor.ts:102-116 loads prior tree, rehydrates via replaySpawnTree/materializeTreeView, passes resumeFrom to scope → scope exposes scope.resume (scope.ts:433-438, types.ts:332) consumable by any Agent.act. The examples/supervisor-loop/loop.ts:208 caller alrea
  • Fit with existing patterns: Mirrors the codebase's two existing journal patterns precisely: ConversationJournal (src/conversation/journal.ts — begin/append/load shape, in-memory+JSONL+SQL impls) and SpawnJournal (src/durable/spawn-journal.ts:132-250 — same shape, same fsync-per-append). The LoopJournal file adapter (loop-journal.ts:127-217) is a structural clone of FileSpawnJournal (same JSONL format, same isNoEntError guard
  • Real-world viability: 339-line test (tests/loops/run-loop-durability.test.ts) proves end-to-end: file journal round-trips across instances (L160-197), in-memory resume skips committed rounds and only executes un-committed ones (L81-124, creates=2 not 3), ended run short-circuits to recorded result with zero boxes (L126-158), supervisor file-context resume rehydrates children from disk and does not re-execute (L274-338,
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: magic number added tests/loops/run-loop-durability.test.ts

  •        budget: { maxIterations: 1, maxTokens: 1000 },
    

💰 Value Audit

🟠 LoopJournal adapters duplicate ConversationJournal adapters [duplication] ``

src/runtime/loop-journal.ts:58-193 mirrors src/conversation/journal.ts:58-194 almost line-for-line: Map storage, defensive copy, JSONL file adapter with fsync, record kind shapes, corruption guards, and isNoEntError. Better approach: extract a generic append-only journal substrate in this repo, or move directly to the shared durability package the PR says will replace this. This also closes the SQL-adapter gap: ConversationJournal has SqlConversationJournal (src/conversation/journal-sql.ts) w


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260620T141416Z

@drewstone drewstone merged commit edc1d54 into main Jun 20, 2026
1 check failed
drewstone added a commit that referenced this pull request Jun 20, 2026
…cs that can't lie, supervise() one-call (#347)

* Revert "feat(runtime): durable run loop — wire supervisor resume + journal the kernel loop (#346)"

This reverts commit edc1d54.

* docs(simplification): master tracker — converged design, scratch list, full doc/module/example inventory + completion criteria

* docs(simplification): red-team corrections — 4 verbs (run/improve/certify/refuse), steer-in-run, milestone-oracle gap, 8 skills to vendor

* docs(simplification): improve is ONE verb with a PLUGGABLE CandidateGenerator (GEPA/skillOpt/autoresearch) + surface param — not 'one engine'

* refactor(runtime): extract the canonical runToolLoop; routerToolLoop becomes a thin adapter (keystone 1/4)

* refactor(runtime): unify the supervisor brain on the canonical ToolLoopChat seam (keystone)

Delete DriverChat + routerDriverChat; the coordination-driver brain is now the canonical
ToolLoopChat and its loop runs through runToolLoop (routerBrain = 4 lines, was 60). The
equal-k driver-inference metering is preserved exactly. Three tool-loop copies collapse to one.

* docs(simplification): keystone WS1 is two phases — 1a (seam unified) done, 1b (brain-from-profile/harness-as-data, sandbox supervisor) next

* refactor(runtime): internalize leaked recursion/seam/journal/trace plumbing from the public barrel

* refactor(runtime): internalize durable spawn-journal + spawn-tree types from the public barrel

* refactor(api): collapse public export subpaths 13→6 (fold audit into profiles; drop unused duplicates)

* docs: fix 3 stale/fabricated symbol references (DriverChat, runSteeringExperiment, refineGepa label)

* docs: consolidate 26→19 + archive (shrink canonical-api 984→76, merge 4 architecture docs→1, merge PLAIN→README, archive 5 niche notes)

* feat(docs-gate): CLASS 6 prose-symbol check — every backticked symbol in curated docs must resolve

Scans canonical-api/concepts/architecture for backticked symbols outside code fences;
reddens on any call-shaped or PascalCase symbol that resolves to no src/bench/substrate
export or concept-whitelist entry. Walks every substrate dist/**/*.d.ts (not just index
barrels). Closes the gap that let gepaDriver/refineGepa live in the docs unchecked.

* chore(profiles): sort barrel exports after the audit fold (biome)

* docs(simplification): mark WS1a/WS3/WS5 shipped

* feat(runtime): supervisorAgent resolves the brain from profile.harness (WS1b)

A supervisor is now an AgentProfile: harness null -> the in-process router tool-loop
(coordinationDriverAgent; routerBrain becomes an internal detail), a coding-CLI harness
(claude-code/opencode/codex) -> a sandboxed harness driving the coordination verbs via
serveCoordinationMcp. Both arms share makeWorkerAgent + the keep-best-delivered oracle.
Closes the critique's A2 (driver brain was router-only). Proven offline both arms.

* docs(simplification): mark WS1b shipped (supervisorAgent — brain from profile.harness)

* feat(runtime): supervise() one-call convenience + workerFromBackend

supervise(profile, task, { backend|makeWorkerAgent, budget }) defaults blobs/perWorker/
journal/executors/maxDepth so 'just invoke the supervisor' is a one-liner. workerFromBackend
derives the worker seam from a backend config + an optional completion oracle (settled⟺delivered).
The raw seams (supervisorAgent + createSupervisor().run) stay for power use.

* docs(simplification): table the supervisor/driver/worker multi-round design (round vs turn, prompt-policy retry, real-time trace self-correction)

* docs(examples): canonical supervise() one-call example (the DX payoff — profile + goal, scaffolding defaulted)

* refactor(mcp): rename spawn_worker → spawn_agent (the verb spawns ANY agent, incl. a sub-supervisor)

The coordination verb always took a worker OR a driver profile and resolves a sub-supervisor
via the role marker — the name lied. Renamed across the tool def, the LLM-facing descriptions,
the scripted-brain tests, the examples, and the hand docs. WS4 (naming taxonomy).

* refactor(mcp): rename observe_worker/steer_worker → observe_agent/steer_agent (consistent verb family)

The coordination verbs operate on any spawned agent (a leaf worker OR a sub-supervisor), so the
family is now spawn_agent / observe_agent / steer_agent. WS4 (naming taxonomy).

* refactor(runtime): depthDriver/breadthDriver→depthStrategy/breadthStrategy, supervisorSkill→supervisorInstructions (WS4)

They are strategy combinators and a prompt-instruction builder, not 'drivers'/'skills' — reserving
'Driver' for the agent-orchestration layer (coordinationDriverAgent/driverChild).

* refactor(mcp): rename createDriveTurnResumeDriver → createDetachedTurnResumeDriver (WS4)

* feat(improvement): improve() — the one pluggable RSI verb (facade over selfImprove; generator defaulted from surface)

* test(improvement): offline improve() facade test (scripted generator, no creds)

* refactor(examples): delete run-router.ts + loop.ts, rewrite sandbox/bridge runners onto supervise()

run-router.ts duplicated examples/supervise/supervise.ts (router brain + router-tools
backend). loop.ts's runSupervisorLoop/makeWorkerAgent duplicated supervise()/workerFromBackend.
The sandbox + bridge runners now call supervise() with only their load-bearing per-backend
seam; the shared demo task + scripted brain move to shared.ts.

* refactor(examples): rewrite run-supervisor-mcp onto workerFromBackend() (single-sourced worker seam)

Replaces the bespoke makeWorker (executor construction + per-worker file plumbing) with
workerFromBackend(backend, deliverable); the deployable check now reads the worker's real
output for ANSWER=42 (completion oracle, not a self-report). Keeps the cli-bridge harness
supervisor arm that drives spawn_agent natively over the coordination MCP.

* style(improvement,mcp): biome import ordering after WS4 rename + improve() exports

* docs(examples): point READMEs at the pruned set + add an offline supervise() example test

* docs(simplification): record WS4/WS2 closed decisions (AgentRunSpec deferred, improvementDriver/runLoop kept, improve surface boundary)

* fix(scripts): align verify-package-exports with the 6-subpath surface (drop stale ./workflow)

The 13→6 export collapse (e6ff2a2) removed the ./workflow subpath but left
verify-package-exports.mjs asserting it (requiredExports + a runtime import), so the gate
failed on a subpath the package intentionally no longer exposes. Verify the real subpaths
(., ./agent, ./intelligence, ./loops, ./profiles, ./mcp) instead.

* refactor: post-audit cleanups — rename internal runToolLoop→runBrainLoop, fix improve() default model

- runToolLoop name collided with the public streaming runToolLoop; the internal brain-loop seam is now runBrainLoop (one grep = one concept).
- improve()'s zero-config default reflection model was the dead anthropic/claude-sonnet-4.6 → deepseek-v4-flash (router-served).

* docs: front-door supervise()/improve() — the #1 audit fix

The two flagship verbs were invisible in every gated doc, so a reader was routed back onto the
verbose legacy path the PR replaced. README now leads with the 3 entry points (chat turn /
supervise / improve); canonical-api §2 makes supervise() the 'just run a supervisor' START-HERE
row and routes self-improvement to improve().

* refactor: delete the standalone workflow-script engine (src/workflow, 2775 LOC + tests)

A third orchestration substrate (a workflow-as-a-script DSL runner with its own checkpoints/budget/
delegates) that does NOT use the supervisor and is NOT self-improving — redundant with the
Scope/Supervisor + supervise() path (the architecture's 'two substrates, do not invent a third').
Zero in-repo or fleet consumers; its ./workflow subpath was already dropped in WS3.

* feat(mcp): spawn_agent accepts an optional per-spawn budget (supervisor can vary budget per worker)

* feat(runtime): allowedModels guard on supervise()/improve() (fail-loud model-subset restriction)

* docs(examples): strategy-evolution — the policy-search research journey (runStrategyEvolution + promotionGate)

* feat(conversation): evalPersona() facade + examples/product-eval (user-sim product evals, one-call)

* docs(examples): improve() — the RSI verb, offline scripted example

* docs(examples): intelligence-recommend — connect traces→findings→improve() (the intelligence loop)

* docs(examples): list the 4 new examples in the index README

* docs(api): regenerate API reference for allowedModels guard + evalPersona facade

* fix: address PR #347 review — restore contentAddress export, guard improve() JSON.parse, test flagged fns

- HIGH: contentAddress was dropped from the runtime barrel by WS3 → bench/atom-humaneval + atom-mcp-e2e fail to compile (a content-addressing helper bench legitimately uses). Re-exported from the barrel.
- MEDIUM: applyWinnerToProfile's JSON.parse threw a raw SyntaxError after a ship verdict on a malformed winner → parseWinnerJson guards it with a typed ConfigError + a test.
- MEDIUM: finalizeBestDelivered + runBrainLoop had no direct tests → added focused unit tests (the blob store's content-address invariant is exercised).
- LOW: supervise() decision-table/README rows implied backend is required (it's optional) → { budget, backend? }.
@drewstone drewstone mentioned this pull request Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants