feat(supervise): profile-richness gate + per-tool-call timing + real worker metering#348
Conversation
…unified timeline assessAuthoredProfile(profile) OBSERVES an authored AgentProfile (no judge verdict read, so it steers past assertTraceDerivedFindings) and flags THIN — a short/few-line system prompt, or no tools, or no skills, or no MCP when the task needs one — closing the gap where the existing gates only reject a fully EMPTY prompt. profileRichnessFinding turns the verdict into a bus-routable AnalystFinding (area profile-quality) so a supervisor can self-correct and re-author. Both re-exported from /loops; makeFinding/computeFindingId/ AnalystFinding + CoordinationEvent re-surfaced there too. routerToolsInlineExecutor now stamps real startedAt/endedAt/durationMs on each tool call and threads them through ToolStepInput -> toToolSpan, so a push TraceSource carries non-zero span durations onto the unified timeline instead of collapsing every span to a single instant.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 3c87c8d0
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:02:20Z
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 79 | 73 | 73 |
| Confidence | 65 | 65 | 65 |
| Correctness | 79 | 73 | 73 |
| Security | 79 | 73 | 73 |
| Testing | 79 | 73 | 73 |
| Architecture | 79 | 73 | 73 |
Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM No test coverage for assessAuthoredProfile / profileRichnessFinding — src/runtime/supervise/authoring.ts
The existing tests/loops/supervisor-authoring.test.ts imports only asAuthoredProfile/supervisorInstructions (line 4-8) and does not exercise assessAuthoredProfile or profileRichnessFinding. These are public exports (index.ts:289-294) that emit bus-routable AnalystFindings capable of steering the supervisor. The thin/rich classification (authoring.ts:221), the needsMcp branch (line 215), severity scaling ([lines 252-256](https://github.com/tangle-network/agent-runtime/blob/3c87c8d0967b4afdcdd7ccde448428565b4d44a3/src/run
🟠 MEDIUM No test coverage for profile-richness gate or timing changes — src/runtime/supervise/authoring.ts
assessAuthoredProfile (line 180) and profileRichnessFinding (line 243) — 165 net new lines of logic with zero tests. tests/loops/supervisor-authoring.test.ts tests asAuthoredProfile and supervisorInstructions but imports none of the 6 new exports. The toToolSpan timing fallback (line 49-50) and onToolStep timing capture ([line 441-460](https://github.com/tangle-network/agent-runtime/
🟡 LOW Dead runId parameter in profileRichnessFinding — src/runtime/supervise/authoring.ts
profileRichnessFinding accepts opts?: { analystId?: string; runId?: string } (line 245) but
runIdis never read in the function body. Theid_basispassed to computeFindingId does not include runId. Either wire it into the finding or remove it from the opts type.
🟡 LOW Redundant double-hashing via id_basis = computeFindingId(...) — src/runtime/supervise/authoring.ts
At line 274, profileRichnessFinding passes
id_basis: computeFindingId({analyst_id, area, subject, claim: 'richness:...'})to makeFinding. Verified against the agent-eval impl (chunk-45EEMHTC.js:15-29): makeFinding internally calls computeFindingId again using id_basis as the claim-override, so the final finding_id is sha256(sha256(...)). The result is stable (cross-run diffing works), so this is not a correctness bug, but it is semantically wrong and wasteful. The agent-eval docstring (types-2VVIL04s.d.ts:259-261) documents id_basis as a static claim-overrides string, not a pre-computed hash. Fix: `id_basis: richness.thin ? 'richness:thin' : 'richness:r
🟡 LOW Thin gate checks hasMcp unconditionally but reasons only surface when needsMcp — src/runtime/supervise/authoring.ts
Line 221:
const thin = promptThin || (!hasTools && !hasSkills && !hasMcp)— hasMcp factors into thinness regardless ofneedsMcp. But line 215: the reasons array only adds an MCP-related reason whenopts?.needsMcp && !hasMcp. A profile with tools+skills but no MCP could be marked thin without an MCP reason appearing inreasons. If the intent is that MCP is not a lever when needsMcp is false, the thin gate should also condition on needsMcp:(!hasTools && !hasSkills && (!opts?.needsMcp || !hasMcp)).
🟡 LOW resolveSystemPrompt reads non-canonical prompt.system field — src/runtime/supervise/authoring.ts
At line 173, the function checks
o.systemas a fallback, but the published AgentProfilePrompt (agent-interface agent-profile.d.ts:127-136) defines onlysystemPrompt?andinstructions?— there is nosystemfield. The branch is behind a runtime typeof guard so it never crashes, but against canonical agent-interface profiles it is dead code and the docstring claim ('the sandbox prompt.system convention') is unverified against any in-repo profile shape. Low impact; flag for accuracy.
🟡 LOW sentenceCount / hasSubagents computed but never drive the verdict — src/runtime/supervise/authoring.ts
sentenceCount (line 191-193) and hasSubagents (line 203) are returned in ProfileRichness but are not part of the
signalsarray (line 218) nor thethinpredicate (line 221). They exist for rendering only. Additionally, the sentenceCount express
🟡 LOW onToolStep timing not bridged to createPushTraceSource().record() within this shot — src/runtime/supervise/runtime.ts
runtime.ts:454 now emits {startedAt, endedAt, durationMs} into seam.onToolStep, and trace-source.ts:49-50 now honors startedAt/endedAt in toToolSpan — but the pipe that forwards an onToolStep callback into createPushTraceSource().record({startedAt, endedAt}) does not exist in any of the 4 changed files (grep of src/ + tests/ shows onToolStep is referenced only at the interface def and the single call site in runtime.ts). The feature is correct at both endpoints; the bridge lives outside this shot. Flagging as an integration note for the global verifier: if no caller wires onToolStep→record with the new timing fields, the per-call durations never reach the unified ToolSpan timeline and the trace-source.ts change is inert. No defect in the changed files themselves.
tangletools · 2026-06-21T12:09:02Z · trace
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 3 (3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 366.7s (2 bridge agents) |
| Total | 366.7s |
💰 Value — sound-with-nits
Adds a heuristic profile-richness gate that emits AnalystFindings, real per-tool-call wall-clock timing through TraceSource, and re-exports coordination/finding primitives from /loops — coherent observability substrate with no existing equivalent.
- What it does: 1)
assessAuthoredProfilescores an authoredAgentProfileagainst tunable thresholds (system-prompt length/lines, presence of tools/skills/MCP/description/subagents) and returns aProfileRichnessverdict. 2)profileRichnessFindingturns that verdict into a bus-routableAnalystFinding(areaprofile-quality). 3)routerToolsInlineExecutornow stamps realstartedAt/endedAt/`durationMs - Goals it achieves: Make supervisor-authored worker profile quality measurable and gate-able automatically (closing the gap where a two-sentence stub passes the existing empty-string check). Make the unified tool-call timeline carry true per-action latency instead of zero-duration spans. Surface the substrate primitives a harness needs to compute metrics like
thinProfileRatioand observe the coordination bus from t - Assessment: Good change. It is additive, fits the codebase's grain (
AnalystFinding-based observability,TraceSourceabstraction, /loops as the canonical loop surface), and addresses real gaps. The implementation is straightforward and backward-compatible for consumers who do not setonToolStep. No existing equivalent found. - Better / existing approach: none — this is the right approach. I searched:
src/intelligence/for existing profile validation/richness assessment (none);src/runtime/supervise/for existing profile-quality gates (onlyasAuthoredProfile, which rejects an empty prompt);src/mcp/tools/checks.tsfor existing finding factories (pattern exists for trace/run checks, not for profile authoring); the whole tree for `profileRich - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
Coherent, additive observability substrate — gate + timing + re-exports are well-formed and in the grain of the codebase; no production wiring exists yet but the named caller (supervisor-lab) is plausible and the surfaces are correctly shaped for it.
- Integration: All three pieces are reachable from
/runtime(index.ts:15, 41, 289-294). The richness gate has zero internal callers — only its own definition + re-export (grep acrosssrc/returns matches only inauthoring.tsandindex.ts). The per-tool timing is correctly emitted byrouterToolsInlineExecutor(runtime.ts:454-461) but ZERORouterToolsSeamconstruction site in repo suppliesonToolStep - Fit with existing patterns: Matches the codebase cleanly. The richness gate emits a real
AnalystFindingvia the samemakeFinding/computeFindingIdpath already used byobserve.ts:165andchecks.ts:149, on the same coordination bus pulled viaawait_event({kinds:['finding']})— no competing pattern.assessAuthoredProfilereads fields that are canonical onAgentProfile(prompt.systemPrompt,tools, `resources. - Real-world viability: Internally robust on happy and error paths. The timing stamp at runtime.ts:441/448 wraps
executeToolCallin try/catch and computesendedAtregardless of outcome, so errored calls still get real durations. TheonToolStepinvocation is itself try/catch-wrapped (runtime.ts:453-464) so a throwing observer can't crash the worker loop.assessAuthoredProfiledegrades safely on missing fields (all - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
💰 Value Audit
🟡 New profile-richness primitives have no in-repo test coverage [maintenance] ``
I searched
tests/,bench/, andexamples/forassessAuthoredProfile,profileRichnessFinding, anddefaultProfileRichnessThresholdsand found zero usages. The 1052 passing tests cited in the PR do not exercise these new code paths. Suggest adding focused unit tests intests/loops/supervisor-authoring.test.tsor a new file to cover thin/rich verdicts, threshold overrides, and the generatedAnalystFindingshape.
🟡 RouterToolsSeam.onToolStep callback signature is a breaking type change [maintenance] ``
The callback parameter now requires
startedAt,endedAt, anddurationMs. No in-repo callers setonToolStep, so tests pass, but external consumers who typed their callback narrowly will need to update. This is acceptable for an@experimentalpackage and is the correct type for the new real-timing behavior; noting it only so the changelog/API surface is clear.
🎯 Usefulness Audit
🟡 No in-repo wiring example for onToolStep → push-trace-source → timeline [ergonomics] ``
The per-call timing is emitted at runtime.ts:454-461 and the propagation slot exists at trace-source.ts:33-34/48-50, but composing them is left entirely to a future caller: nothing in repo constructs a
RouterToolsSeamwithonToolStep, nothing callscreatePushTraceSource().record({startedAt, endedAt}), and no test exercises the through-path (tests at coordination.test.ts:526, trajectory-recorder.test.ts:27, detector-monitor.test.ts:24-52 record steps WITHOUT startedAt/endedAt). A harness au
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…pure manager A supervisor/driver is a full agent: it can ACT (do work with its own tools) OR SPAWN (delegate). coordinationDriverAgent carried only the coordination verbs, so it was a pure manager that could do nothing itself. Add optional work tools: - CoordinationDriverOptions.extraTools + executeExtraTool. Extra specs merge into the tool set alongside the coordination verbs; execute tries executeExtraTool FIRST (a non-null return is the result), else falls through to the coordination dispatch. Fail loud: extraTools needs executeExtraTool, and a work tool that shadows a reserved coordination verb throws at construction (exported coordinationVerbNames) — not buried in a swallowed act() throw. - Threaded the same two options through supervisorAgent (router arm) and supervise(). - Additive: default behavior unchanged when unset.
…dress #348 review nits - Rename the in-process driver: coordinationDriverAgent->driverAgent, CoordinationDriverOptions->DriverAgentOptions. 'coordination' became misleading once the driver gained extraTools — it ACTS or SPAWNS now, it is not a pure coordinator. The three-noun name also reified the 'driver' role into a type-name. - nit: onToolStep step timing (startedAt/endedAt/durationMs) is now OPTIONAL — additive + non-breaking for any external RouterToolsSeam; the owned-loop executor still always supplies it (toToolSpan already defaulted gracefully). - nit: add tests/loops/profile-richness.test.ts — assessAuthoredProfile thin/rich + threshold override + needsMcp, profileRichnessFinding shape, and the toToolSpan timing through-path. - Regenerate docs/api for the rename + the CoordinationEvent re-export.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — dc2c9118
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:36:31Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 4b381824
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:38:23Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 5 (1 low, 4 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 309.3s (2 bridge agents) |
| Total | 309.3s |
💰 Value — sound-with-nits
Adds coherent supervise-path observability: a tunable profile-richness gate, real per-tool-call timing on the owned router-tools loop, driver work-tools, and re-exports so harnesses can read the bus and build findings — but the gate is not auto-fired and the public driver name is broken without a sh
- What it does: 1)
assessAuthoredProfile/profileRichnessFinding(src/runtime/supervise/authoring.ts:180,:243) statically scores an authoredAgentProfilefor prompt depth + tools/skills/MCP and turns the verdict into anAnalystFindingriding the coordination bus. 2)routerToolsInlineExecutor(src/runtime/supervise/runtime.ts:444-464) now stamps realstartedAt/endedAt/durationMsaround each - Goals it achieves: Makes supervisor profile quality measurable instead of eyeballed, gives the unified timeline real per-action latency, lets the driver be a full agent (act or spawn), and lets external harnesses compute metrics like
thinProfileRatioand consume coordination-bus events without reaching into/mcporagent-evaldirectly. - Assessment: The change is in the grain of the codebase: it uses the existing
AnalystFinding/CoordinationEventbus, follows the §1.5 AgentProfile law, adds fail-loud validation, and is additive/non-breaking forRouterToolsSeamconsumers. Tests cover thin/rich thresholds, the timing through-path, work-tool dispatch, and driver-inference metering (tests/loops/profile-richness.test.ts, `tests/loops/coordi - Better / existing approach: No materially better architecture for the core capability. Two small improvements: keep a deprecated
coordinationDriverAgentalias for the renamed export to avoid breaking consumers of the experimental API, and optionally wireassessAuthoredProfileintocreateCoordinationTools(src/mcp/tools/coordination.ts:154) so the richness gate fires automatically onspawn_agentrather than requirin - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
Three additive observability primitives (profile-richness analyst, per-tool-call timing, finding/bus re-exports) plus an unmentioned extraTools driver capability — all aligned with existing patterns, fail-loud guarded, and backward-compatible; the only gaps are integration wiring that legitimately
- Integration: All new symbols are exported from
src/runtime/index.ts:15,41,289-294,311-314and reachable from the public surface. The producer side of per-tool-call timing is wired (routerToolsInlineExecutorcallsseam.onToolStepatsrc/runtime/supervise/runtime.ts:457with realstartedAt/endedAt/durationMs); the substrate side (toToolSpanattrace-source.ts:48-63) propagates timing into `ToolSpan - Fit with existing patterns: Excellent.
profileRichnessFindinguses the samemakeFindingsubstrate from@tangle-network/agent-evalthatobserve.ts:165,checks.ts:149, and theimprove/intelligence-recommendexamples already use;area: 'profile-quality'extends the existing area taxonomy (failure-mode/correctness/safety/cost/tool-use) without colliding. TheonToolSteptiming rides the existing observ - Real-world viability: Defensive where it needs to be. The
onToolStepcall atruntime.ts:456-467is wrapped in try/catch ('monitoring must not break the worker').toToolSpanis backward-compatible: missing timing collapses to the historical single-instant span (trace-source.ts:49-50).assessAuthoredProfilehandles all three prompt shapes (canonicalprompt.systemPrompt, sandboxprompt.system, bare-string) v - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts
budget: { maxIterations: 1000, maxTokens: 10_000_000, maxUsd: 0.1 }, // ~2-3 turns of $0.04 fit
🎯 Usefulness Audit
🟡 PR body omits the extraTools/executeExtraTool capability (a 4th feature) [ergonomics] ``
The PR title and body list three primitives (richness gate, per-tool timing, re-exports), but
b3adf3falso lands a newextraToolsWORK-tools seam ondriverAgent/supervise()(coordination-driver.ts:54-68,supervisor-agent.ts:59-71,supervise.ts:63-75). It is additive, well-tested, and fail-loud — not a blocker — but the body undersells the surface area a reviewer is being asked to evaluate. Suggest a one-line add to the PR body so the change matches what shipped.
🟡 Richness reasons lists 'no tools' / 'no skills' even for rich MCP-only profiles [problem-fit] ``
In
authoring.ts:212-214,reasons.push('no tools granted …')and'no skills attached …'fire whenever!hasTools/!hasSkills, independent of whether the profile has an MCP. A rich-prompt profile that legitimately acts through an MCP (the sandbox/cli-bridge arm insupervisor-agent.ts:99-124) is scored RICH (thethinflag is gated on!hasTools && !hasSkills && !hasMcp), but itsreasonsarray still carries misleading 'no tools' / 'no skills' entries. Today this is harmless — `profi
💰 Value Audit
🟡 Public export renamed without a back-compat alias [maintenance] ``
coordinationDriverAgent/CoordinationDriverOptionswere renamed todriverAgent/DriverAgentOptionsacrosssrc/runtime/supervise/coordination-driver.ts:110,src/runtime/index.ts:311, and all callers (bench/src/atom-humaneval.mts, tests). The symbol is@experimental, but it is a published/loopsexport. A one-release deprecated alias would let downstream code migrate without a hard break.
🟡 Profile-richness gate is exported but not auto-fired on spawn [proportion] ``
assessAuthoredProfileandprofileRichnessFinding(src/runtime/supervise/authoring.ts:180,:243) provide the substrate, yet nothing in the supervise path calls them. For the stated goal of turning profile quality into an automated gate, an optionalassessOnSpawnhook increateCoordinationToolswould make the gate self-firing instead of leaving it to every harness.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
What
Three small, additive observability primitives on the supervise path — built to make "does the supervisor author rich worker profiles, and how long does every action take?" measurable.
assessAuthoredProfile(profile)+profileRichnessFinding(src/runtime/supervise/authoring.ts) — flags a worker profile as THIN when the system prompt is short / it has no skills / no tools, and emits a realAnalystFinding(rides the existing coordination bus). The only prior check was empty-string, so a 2-sentence stub passed silently.routerToolsInlineExecutorstamps realstartedAt/endedAt/durationMsper tool call (runtime.ts);toToolSpancarries the real duration (trace-source.ts, was hard-zero). The unified timeline now has true per-action latency.makeFinding/computeFindingId/AnalystFinding/CoordinationEventon/loopsso a harness can computethinProfileRatioand read the bus.Why
This is the substrate behind the supervisor proof: it turns "eyeball the prompts" into an automated richness gate, and "faked $0 spend" into a real timeline.
Proof
Green from a clean tree:
tsc --noEmit0,biomeclean,pnpm test1052 passed. Proven live (supervisor-lab): the gate fires on thin profiles and confirms rich ones; per-tool-call durations populate the timeline.