refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis#320
Closed
drewstone wants to merge 10 commits into
Closed
refactor(supervise): substrate-agnostic TraceSource — sandbox-first trace analysis#320drewstone wants to merge 10 commits into
drewstone wants to merge 10 commits into
Conversation
… resume_worker
Close the bus to 100% bidirectional. The parent→child down-leg routes to the child
inbox (scope.send→deliver) AND records a queue:false event on the same bus: it lands
in history() + reaches subscribers for the audit trail, but is never pulled back by
the parent. New: resume_worker (continue a parked worker — the protocol had {resume}
but no verb); answer_question now routes the answer DOWN to the asking worker, unparking
it. EventBus gains PublishOptions.queue for record-only events.
down-leg + bidirectional history tests; full suite 1000 pass; typecheck/build/lint clean.
…iew gaps Address PR #318 review: - BLOCKING: answer_question computed `delivered` but returned only { question } — now returns { question, delivered }, consistent with steer_worker/resume_worker (no longer hides whether the answer reached a live worker). - tests: answer routed down to a LIVE worker (delivered:true happy path); resume_worker delivered:false path; a focused event-bus queue:false unit test (history+subscribers see it, pull queue never does). - resume_worker added to OPERATOR_TOOLS + the driver system prompt so the driver is actually prompted to use it.
Make the down-leg actually move a live worker (was observable-only). New createInbox (supervise/inbox.ts) is the receive end an executor exposes as Executor.deliver; the owned tool-loop (routerToolsInlineExecutor) drains it two ways: - QUEUED (default): flush at each step boundary AND before the worker may settle — it can't finish while a steer/answer it never read is pending. - FORCEFUL (steer_worker interrupt:true): aborts the in-flight turn so the worker re-plans immediately, breaking it off a wrong path mid-task. Black-box CLI harnesses can't be interrupted mid-step → down-leg degrades to next spawn. inbox 4 + executor-drains-inbox integration test (flush-before-settle proven end to end through the real executor); full suite 1008 pass; typecheck/build/lint clean.
…sendDown covers answer PR #318 audit follow-ups (non-blocking): - resume_worker description no longer implies a park/resume lifecycle the scope model lacks — a settled (drained) worker is gone; says so and points to spawning fresh. - sendDown now covers the 'answer' down-leg too (removes the inline bus.publish duplication; one helper for all three down kinds). - history() docstring lists the down-leg event kinds. full suite 1008 pass; typecheck/lint clean.
Simplify without losing capability:
- MERGE steer_worker + resume_worker → one steer_worker (any live worker; the only
real axis was interrupt forceful-vs-queued, already a param). 'Resume' = a non-
interrupt steer. Removes a redundant verb + dissolves the resume-vs-steer prompt nits.
- REMOVE await_next — it was a strict subset of await_event({kinds:['settled']}).
One wait-verb now; callers/prompts pass kinds:['settled'] for the next finished worker.
- DROP bus.peek() — speculative, only its own test used it (YAGNI).
Down-leg event union + inbox shed the dead 'resume' kind. Full suite 1007 pass;
typecheck/build/lint clean.
…gent-eval kernel) createDetectorMonitor (supervise/detector-monitor.ts) — the online analyst on the live worker pipe. Folds each tool step through agent-eval 0.93.0's published streaming kernel (repeatedActionDetector/errorStreakDetector — the SAME kernel control-runtime folds; no detection logic reimplemented) and fires onSignal → a finding on the bus the moment a worker loops or error-storms. routerToolsInlineExecutor feeds it via a new onToolStep seam. Bumps agent-eval ^0.93.0. monitor tests (4); full suite 1011 pass; typecheck/build/lint clean.
Last mile: createCoordinationTools.raiseFinding (exposed on the MCP handle) — the seam an ONLINE detector uses to publish a finding on the live bus mid-run. Proven end-to-end: a stuck-loop on the worker pipe → monitor → raiseFinding → await_event surfaces it. Review fixes (audit on the earlier commit): - HIGH: AbortSignal.any (needs Node 20.3, floor is 20) → portable mergeAbortSignals. - forceful interrupt: docstring no longer overpromises (aborts in-flight inference, a tool mid-exec finishes first); interrupted turns no longer count toward maxTurns; added the e2e test (forceful steer aborts the turn, re-plans, aborted turn is free). - answer to a BLOCKING question is now delivered forcefully (interrupt) to unpark the worker immediately, not at its next boundary. - sendDown 'answer' now REQUIRES questionId (overload; no silent ?? '' mask). - tool-step status captured (error vs ok) for the error-streak detector. - stale await_next purged from bench prompts + docs; history() docstring drops 'resume'. - added tests: answer delivered:false + return asserted; await_event idle-on-mismatch. full suite 1014 pass; typecheck/build/lint clean.
…es agent-eval) createTrajectoryRecorder (supervise/trajectory-recorder.ts) — the post-hoc half of the analyst pipe. Replays a worker's captured tool steps as agent-eval spans (InMemoryTraceStore) and runs its PUBLISHED batch analyzers — buildTrajectory (structured run summary), stuckLoopView (full-run repeated-call view, complementing the online consecutive detector), toolWasteView. No analysis reimplemented; the thin bridge from live tool steps to the substrate trace model. Feeds from the same onToolStep seam as the online monitor. 3 recorder tests (real spans → real agent-eval findings); full suite 1017 pass; typecheck/build/lint clean. Closes both legs: online (mid-run) + settle (post-hoc).
…, comment accuracy)
- mergeAbortSignals listener leak: pre-link external signals ONCE; per-turn add+remove the
listener (no accumulation on long-lived signals over maxTurns).
- interrupt catch now requires a real AbortError (DOMException) — a network fault coincident
with an interrupt is no longer swallowed; rethrown.
- corrected the comment: an interrupted+re-planned turn DOES consume a maxTurns slot (bounded
backstop, not a hang) — it just doesn't bill a turn.
- onToolStep is an observability side-channel: wrapped so a throwing monitor can't crash the
worker loop; detector-monitor.observeToolStep also defends argHash on circular/unhashable args.
- projectEvent preserves questionId on the answer branch.
- stale await_next purged from skills/{supervise,loop-writer}; trimmed CLAUDE.md redundancy;
softened the recorder's per-span-duration claim.
full suite 1018 pass; typecheck/build/lint clean.
… replace router-only seam The detector/analyzer were built router-only (onToolStep/ToolStep) — premature; production is sandbox/fleet. Corrected to one interface over agent-eval's ToolSpan: - TraceSource (trace-source.ts): a worker's tool calls as ToolSpans, from an OWNED loop (createPushTraceSource — router/cli-bridge dispatch) OR a SANDBOX box (sandboxSessionTraceSource(box, sessionId) → box.messages() session parts → decodeToolPart, defensive across OpenAI + harness shapes). The SDK exposes tool calls via the session (SessionMessage.parts / streamPrompt), NOT exportTrace (sandbox telemetry) — corrected. - watchTrace (online) + analyzeTrace (settle) now consume a TraceSource, not a router seam. - DELETED the router-only createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep. Common currency = ToolSpan; same agent-eval detectors + batch analyzers over any substrate. trace-source 11 + watchTrace 5 + analyzeTrace 2 tests incl. the sandbox box path (mock box → session parts → loop detected); full suite 1023 pass; typecheck/build/lint clean. Live-box validation of the exact harness part-shape pending (decoder is defensive).
tangletools
approved these changes
Jun 17, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 1e7d7ffc
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-17T10:08:32Z
Contributor
Author
|
Re-opening from a correctly-based branch (this one carried the unsquashed #318 commits → false conflict). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Corrects the trace-analysis layer I built router-only. Production is sandbox/fleet — the detectors must run there, and the SDK exposes the tool calls via the session (
SessionMessage.parts/streamPrompt), notexportTrace(which is sandbox telemetry — my earlier red herring).The fix
One interface over agent-eval's
ToolSpan(the common currency), two source implementations:createPushTraceSource— owned loops (router-tools, cli-bridge tool dispatch): the looprecords each tool call.sandboxSessionTraceSource(box, sessionId)— the box:box.messages({sessionId})→ session parts →decodeToolPart(defensive across OpenAIfunction+ harnesstool/tool_useshapes) → spans.Two consumers ride the source:
watchTrace(online →findingon the bus) andanalyzeTrace(settle → agent-eval batch analyzersbuildTrajectory/stuckLoopView/toolWasteView).Deleted the premature router-only
createDetectorMonitor/ToolStep/createTrajectoryRecorder/RecordedToolStep.Why
This is the §1.5 'author the interface, materialize per substrate' rule I violated by building a router-only
onToolStepseam.ToolSpanis the shared currency; a new substrate implements one interface; the same published agent-eval detectors + analyzers run everywhere. Local testing via cli-bridge/router; staging/prod via sandboxes.Verification
partsschema is confirmed against the running SDK, not the .d.ts.🤖 Generated with Claude Code