Skip to content

feat(supervise): profile-richness gate + per-tool-call timing + real worker metering#348

Merged
drewstone merged 4 commits into
mainfrom
feat/profile-richness-gate-and-tool-timing
Jun 21, 2026
Merged

feat(supervise): profile-richness gate + per-tool-call timing + real worker metering#348
drewstone merged 4 commits into
mainfrom
feat/profile-richness-gate-and-tool-timing

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Three small, additive observability primitives on the supervise path — built to make "does the supervisor author rich worker profiles, and how long does every action take?" measurable.

  • assessAuthoredProfile(profile) + profileRichnessFinding (src/runtime/supervise/authoring.ts) — flags a worker profile as THIN when the system prompt is short / it has no skills / no tools, and emits a real AnalystFinding (rides the existing coordination bus). The only prior check was empty-string, so a 2-sentence stub passed silently.
  • Per-tool-call timingrouterToolsInlineExecutor stamps real startedAt/endedAt/durationMs per tool call (runtime.ts); toToolSpan carries the real duration (trace-source.ts, was hard-zero). The unified timeline now has true per-action latency.
  • Real worker metering — exports makeFinding/computeFindingId/AnalystFinding/CoordinationEvent on /loops so a harness can compute thinProfileRatio and read the bus.

Why

This is the substrate behind the supervisor proof: it turns "eyeball the prompts" into an automated richness gate, and "faked $0 spend" into a real timeline.

Proof

Green from a clean tree: tsc --noEmit 0, biome clean, pnpm test 1052 passed. Proven live (supervisor-lab): the gate fires on thin profiles and confirms rich ones; per-tool-call durations populate the timeline.

…unified timeline

assessAuthoredProfile(profile) OBSERVES an authored AgentProfile (no judge
verdict read, so it steers past assertTraceDerivedFindings) and flags THIN —
a short/few-line system prompt, or no tools, or no skills, or no MCP when the
task needs one — closing the gap where the existing gates only reject a fully
EMPTY prompt. profileRichnessFinding turns the verdict into a bus-routable
AnalystFinding (area profile-quality) so a supervisor can self-correct and
re-author. Both re-exported from /loops; makeFinding/computeFindingId/
AnalystFinding + CoordinationEvent re-surfaced there too.

routerToolsInlineExecutor now stamps real startedAt/endedAt/durationMs on each
tool call and threads them through ToolStepInput -> toToolSpan, so a push
TraceSource carries non-zero span durations onto the unified timeline instead
of collapsing every span to a single instant.
tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 3c87c8d0

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:02:20Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 3c87c8d0

Readiness 73/100 · Confidence 65/100 · 8 findings (2 medium, 6 low)

deepseek glm aggregate
Readiness 79 73 73
Confidence 65 65 65
Correctness 79 73 73
Security 79 73 73
Testing 79 73 73
Architecture 79 73 73

Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM No test coverage for assessAuthoredProfile / profileRichnessFinding — src/runtime/supervise/authoring.ts

The existing tests/loops/supervisor-authoring.test.ts imports only asAuthoredProfile/supervisorInstructions (line 4-8) and does not exercise assessAuthoredProfile or profileRichnessFinding. These are public exports (index.ts:289-294) that emit bus-routable AnalystFindings capable of steering the supervisor. The thin/rich classification (authoring.ts:221), the needsMcp branch (line 215), severity scaling ([lines 252-256](https://github.com/tangle-network/agent-runtime/blob/3c87c8d0967b4afdcdd7ccde448428565b4d44a3/src/run

🟠 MEDIUM No test coverage for profile-richness gate or timing changes — src/runtime/supervise/authoring.ts

assessAuthoredProfile (line 180) and profileRichnessFinding (line 243) — 165 net new lines of logic with zero tests. tests/loops/supervisor-authoring.test.ts tests asAuthoredProfile and supervisorInstructions but imports none of the 6 new exports. The toToolSpan timing fallback (line 49-50) and onToolStep timing capture ([line 441-460](https://github.com/tangle-network/agent-runtime/

🟡 LOW Dead runId parameter in profileRichnessFinding — src/runtime/supervise/authoring.ts

profileRichnessFinding accepts opts?: { analystId?: string; runId?: string } (line 245) but runId is never read in the function body. The id_basis passed to computeFindingId does not include runId. Either wire it into the finding or remove it from the opts type.

🟡 LOW Redundant double-hashing via id_basis = computeFindingId(...) — src/runtime/supervise/authoring.ts

At line 274, profileRichnessFinding passes id_basis: computeFindingId({analyst_id, area, subject, claim: 'richness:...'}) to makeFinding. Verified against the agent-eval impl (chunk-45EEMHTC.js:15-29): makeFinding internally calls computeFindingId again using id_basis as the claim-override, so the final finding_id is sha256(sha256(...)). The result is stable (cross-run diffing works), so this is not a correctness bug, but it is semantically wrong and wasteful. The agent-eval docstring (types-2VVIL04s.d.ts:259-261) documents id_basis as a static claim-overrides string, not a pre-computed hash. Fix: `id_basis: richness.thin ? 'richness:thin' : 'richness:r

🟡 LOW Thin gate checks hasMcp unconditionally but reasons only surface when needsMcp — src/runtime/supervise/authoring.ts

Line 221: const thin = promptThin || (!hasTools && !hasSkills && !hasMcp) — hasMcp factors into thinness regardless of needsMcp. But line 215: the reasons array only adds an MCP-related reason when opts?.needsMcp && !hasMcp. A profile with tools+skills but no MCP could be marked thin without an MCP reason appearing in reasons. If the intent is that MCP is not a lever when needsMcp is false, the thin gate should also condition on needsMcp: (!hasTools && !hasSkills && (!opts?.needsMcp || !hasMcp)).

🟡 LOW resolveSystemPrompt reads non-canonical prompt.system field — src/runtime/supervise/authoring.ts

At line 173, the function checks o.system as a fallback, but the published AgentProfilePrompt (agent-interface agent-profile.d.ts:127-136) defines only systemPrompt? and instructions? — there is no system field. The branch is behind a runtime typeof guard so it never crashes, but against canonical agent-interface profiles it is dead code and the docstring claim ('the sandbox prompt.system convention') is unverified against any in-repo profile shape. Low impact; flag for accuracy.

🟡 LOW sentenceCount / hasSubagents computed but never drive the verdict — src/runtime/supervise/authoring.ts

sentenceCount (line 191-193) and hasSubagents (line 203) are returned in ProfileRichness but are not part of the signals array (line 218) nor the thin predicate (line 221). They exist for rendering only. Additionally, the sentenceCount express

🟡 LOW onToolStep timing not bridged to createPushTraceSource().record() within this shot — src/runtime/supervise/runtime.ts

runtime.ts:454 now emits {startedAt, endedAt, durationMs} into seam.onToolStep, and trace-source.ts:49-50 now honors startedAt/endedAt in toToolSpan — but the pipe that forwards an onToolStep callback into createPushTraceSource().record({startedAt, endedAt}) does not exist in any of the 4 changed files (grep of src/ + tests/ shows onToolStep is referenced only at the interface def and the single call site in runtime.ts). The feature is correct at both endpoints; the bridge lives outside this shot. Flagging as an integration note for the global verifier: if no caller wires onToolStep→record with the new timing fields, the per-call durations never reach the unified ToolSpan timeline and the trace-source.ts change is inert. No defect in the changed files themselves.


tangletools · 2026-06-21T12:09:02Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 3 (3 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 366.7s (2 bridge agents)
Total 366.7s

💰 Value — sound-with-nits

Adds a heuristic profile-richness gate that emits AnalystFindings, real per-tool-call wall-clock timing through TraceSource, and re-exports coordination/finding primitives from /loops — coherent observability substrate with no existing equivalent.

  • What it does: 1) assessAuthoredProfile scores an authored AgentProfile against tunable thresholds (system-prompt length/lines, presence of tools/skills/MCP/description/subagents) and returns a ProfileRichness verdict. 2) profileRichnessFinding turns that verdict into a bus-routable AnalystFinding (area profile-quality). 3) routerToolsInlineExecutor now stamps real startedAt/endedAt/`durationMs
  • Goals it achieves: Make supervisor-authored worker profile quality measurable and gate-able automatically (closing the gap where a two-sentence stub passes the existing empty-string check). Make the unified tool-call timeline carry true per-action latency instead of zero-duration spans. Surface the substrate primitives a harness needs to compute metrics like thinProfileRatio and observe the coordination bus from t
  • Assessment: Good change. It is additive, fits the codebase's grain (AnalystFinding-based observability, TraceSource abstraction, /loops as the canonical loop surface), and addresses real gaps. The implementation is straightforward and backward-compatible for consumers who do not set onToolStep. No existing equivalent found.
  • Better / existing approach: none — this is the right approach. I searched: src/intelligence/ for existing profile validation/richness assessment (none); src/runtime/supervise/ for existing profile-quality gates (only asAuthoredProfile, which rejects an empty prompt); src/mcp/tools/checks.ts for existing finding factories (pattern exists for trace/run checks, not for profile authoring); the whole tree for `profileRich
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound-with-nits

Coherent, additive observability substrate — gate + timing + re-exports are well-formed and in the grain of the codebase; no production wiring exists yet but the named caller (supervisor-lab) is plausible and the surfaces are correctly shaped for it.

  • Integration: All three pieces are reachable from /runtime (index.ts:15, 41, 289-294). The richness gate has zero internal callers — only its own definition + re-export (grep across src/ returns matches only in authoring.ts and index.ts). The per-tool timing is correctly emitted by routerToolsInlineExecutor (runtime.ts:454-461) but ZERO RouterToolsSeam construction site in repo supplies onToolStep
  • Fit with existing patterns: Matches the codebase cleanly. The richness gate emits a real AnalystFinding via the same makeFinding/computeFindingId path already used by observe.ts:165 and checks.ts:149, on the same coordination bus pulled via await_event({kinds:['finding']}) — no competing pattern. assessAuthoredProfile reads fields that are canonical on AgentProfile (prompt.systemPrompt, tools, `resources.
  • Real-world viability: Internally robust on happy and error paths. The timing stamp at runtime.ts:441/448 wraps executeToolCall in try/catch and computes endedAt regardless of outcome, so errored calls still get real durations. The onToolStep invocation is itself try/catch-wrapped (runtime.ts:453-464) so a throwing observer can't crash the worker loop. assessAuthoredProfile degrades safely on missing fields (all
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

💰 Value Audit

🟡 New profile-richness primitives have no in-repo test coverage [maintenance] ``

I searched tests/, bench/, and examples/ for assessAuthoredProfile, profileRichnessFinding, and defaultProfileRichnessThresholds and found zero usages. The 1052 passing tests cited in the PR do not exercise these new code paths. Suggest adding focused unit tests in tests/loops/supervisor-authoring.test.ts or a new file to cover thin/rich verdicts, threshold overrides, and the generated AnalystFinding shape.

🟡 RouterToolsSeam.onToolStep callback signature is a breaking type change [maintenance] ``

The callback parameter now requires startedAt, endedAt, and durationMs. No in-repo callers set onToolStep, so tests pass, but external consumers who typed their callback narrowly will need to update. This is acceptable for an @experimental package and is the correct type for the new real-timing behavior; noting it only so the changelog/API surface is clear.

🎯 Usefulness Audit

🟡 No in-repo wiring example for onToolStep → push-trace-source → timeline [ergonomics] ``

The per-call timing is emitted at runtime.ts:454-461 and the propagation slot exists at trace-source.ts:33-34/48-50, but composing them is left entirely to a future caller: nothing in repo constructs a RouterToolsSeam with onToolStep, nothing calls createPushTraceSource().record({startedAt, endedAt}), and no test exercises the through-path (tests at coordination.test.ts:526, trajectory-recorder.test.ts:27, detector-monitor.test.ts:24-52 record steps WITHOUT startedAt/endedAt). A harness au


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T121015Z

…pure manager

A supervisor/driver is a full agent: it can ACT (do work with its own tools) OR
SPAWN (delegate). coordinationDriverAgent carried only the coordination verbs, so
it was a pure manager that could do nothing itself. Add optional work tools:

- CoordinationDriverOptions.extraTools + executeExtraTool. Extra specs merge into
  the tool set alongside the coordination verbs; execute tries executeExtraTool
  FIRST (a non-null return is the result), else falls through to the coordination
  dispatch. Fail loud: extraTools needs executeExtraTool, and a work tool that
  shadows a reserved coordination verb throws at construction (exported
  coordinationVerbNames) — not buried in a swallowed act() throw.
- Threaded the same two options through supervisorAgent (router arm) and supervise().
- Additive: default behavior unchanged when unset.
…dress #348 review nits

- Rename the in-process driver: coordinationDriverAgent->driverAgent, CoordinationDriverOptions->DriverAgentOptions.
  'coordination' became misleading once the driver gained extraTools — it ACTS or SPAWNS now, it is not a pure coordinator. The three-noun name also reified the 'driver' role into a type-name.
- nit: onToolStep step timing (startedAt/endedAt/durationMs) is now OPTIONAL — additive + non-breaking for any external RouterToolsSeam; the owned-loop executor still always supplies it (toToolSpan already defaulted gracefully).
- nit: add tests/loops/profile-richness.test.ts — assessAuthoredProfile thin/rich + threshold override + needsMcp, profileRichnessFinding shape, and the toToolSpan timing through-path.
- Regenerate docs/api for the rename + the CoordinationEvent re-export.
tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — dc2c9118

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:36:31Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 4b381824

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:38:23Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 5 (1 low, 4 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 309.3s (2 bridge agents)
Total 309.3s

💰 Value — sound-with-nits

Adds coherent supervise-path observability: a tunable profile-richness gate, real per-tool-call timing on the owned router-tools loop, driver work-tools, and re-exports so harnesses can read the bus and build findings — but the gate is not auto-fired and the public driver name is broken without a sh

  • What it does: 1) assessAuthoredProfile / profileRichnessFinding (src/runtime/supervise/authoring.ts:180, :243) statically scores an authored AgentProfile for prompt depth + tools/skills/MCP and turns the verdict into an AnalystFinding riding the coordination bus. 2) routerToolsInlineExecutor (src/runtime/supervise/runtime.ts:444-464) now stamps real startedAt/endedAt/durationMs around each
  • Goals it achieves: Makes supervisor profile quality measurable instead of eyeballed, gives the unified timeline real per-action latency, lets the driver be a full agent (act or spawn), and lets external harnesses compute metrics like thinProfileRatio and consume coordination-bus events without reaching into /mcp or agent-eval directly.
  • Assessment: The change is in the grain of the codebase: it uses the existing AnalystFinding/CoordinationEvent bus, follows the §1.5 AgentProfile law, adds fail-loud validation, and is additive/non-breaking for RouterToolsSeam consumers. Tests cover thin/rich thresholds, the timing through-path, work-tool dispatch, and driver-inference metering (tests/loops/profile-richness.test.ts, `tests/loops/coordi
  • Better / existing approach: No materially better architecture for the core capability. Two small improvements: keep a deprecated coordinationDriverAgent alias for the renamed export to avoid breaking consumers of the experimental API, and optionally wire assessAuthoredProfile into createCoordinationTools (src/mcp/tools/coordination.ts:154) so the richness gate fires automatically on spawn_agent rather than requirin
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound-with-nits

Three additive observability primitives (profile-richness analyst, per-tool-call timing, finding/bus re-exports) plus an unmentioned extraTools driver capability — all aligned with existing patterns, fail-loud guarded, and backward-compatible; the only gaps are integration wiring that legitimately

  • Integration: All new symbols are exported from src/runtime/index.ts:15,41,289-294,311-314 and reachable from the public surface. The producer side of per-tool-call timing is wired (routerToolsInlineExecutor calls seam.onToolStep at src/runtime/supervise/runtime.ts:457 with real startedAt/endedAt/durationMs); the substrate side (toToolSpan at trace-source.ts:48-63) propagates timing into `ToolSpan
  • Fit with existing patterns: Excellent. profileRichnessFinding uses the same makeFinding substrate from @tangle-network/agent-eval that observe.ts:165, checks.ts:149, and the improve/intelligence-recommend examples already use; area: 'profile-quality' extends the existing area taxonomy (failure-mode/correctness/safety/cost/tool-use) without colliding. The onToolStep timing rides the existing observ
  • Real-world viability: Defensive where it needs to be. The onToolStep call at runtime.ts:456-467 is wrapped in try/catch ('monitoring must not break the worker'). toToolSpan is backward-compatible: missing timing collapses to the historical single-instant span (trace-source.ts:49-50). assessAuthoredProfile handles all three prompt shapes (canonical prompt.systemPrompt, sandbox prompt.system, bare-string) v
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: magic number added tests/loops/driver-inference-metering.test.ts

  •  budget: { maxIterations: 1000, maxTokens: 10_000_000, maxUsd: 0.1 }, // ~2-3 turns of $0.04 fit
    

🎯 Usefulness Audit

🟡 PR body omits the extraTools/executeExtraTool capability (a 4th feature) [ergonomics] ``

The PR title and body list three primitives (richness gate, per-tool timing, re-exports), but b3adf3f also lands a new extraTools WORK-tools seam on driverAgent/supervise() (coordination-driver.ts:54-68, supervisor-agent.ts:59-71, supervise.ts:63-75). It is additive, well-tested, and fail-loud — not a blocker — but the body undersells the surface area a reviewer is being asked to evaluate. Suggest a one-line add to the PR body so the change matches what shipped.

🟡 Richness reasons lists 'no tools' / 'no skills' even for rich MCP-only profiles [problem-fit] ``

In authoring.ts:212-214, reasons.push('no tools granted …') and 'no skills attached …' fire whenever !hasTools / !hasSkills, independent of whether the profile has an MCP. A rich-prompt profile that legitimately acts through an MCP (the sandbox/cli-bridge arm in supervisor-agent.ts:99-124) is scored RICH (the thin flag is gated on !hasTools && !hasSkills && !hasMcp), but its reasons array still carries misleading 'no tools' / 'no skills' entries. Today this is harmless — `profi

💰 Value Audit

🟡 Public export renamed without a back-compat alias [maintenance] ``

coordinationDriverAgent / CoordinationDriverOptions were renamed to driverAgent / DriverAgentOptions across src/runtime/supervise/coordination-driver.ts:110, src/runtime/index.ts:311, and all callers (bench/src/atom-humaneval.mts, tests). The symbol is @experimental, but it is a published /loops export. A one-release deprecated alias would let downstream code migrate without a hard break.

🟡 Profile-richness gate is exported but not auto-fired on spawn [proportion] ``

assessAuthoredProfile and profileRichnessFinding (src/runtime/supervise/authoring.ts:180, :243) provide the substrate, yet nothing in the supervise path calls them. For the stated goal of turning profile quality into an automated gate, an optional assessOnSpawn hook in createCoordinationTools would make the gate self-firing instead of leaving it to every harness.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T124520Z

@drewstone drewstone merged commit cb07490 into main Jun 21, 2026
1 check passed
@drewstone drewstone mentioned this pull request Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants