feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch)#282
Conversation
The keystone that lets ANY AgentProfile be evaluated as a real multi-round
conversation and drops into runProfileMatrix as its dispatch — replacing the
per-agent hand-rolled 2-turn dispatchWithSurface bridges runProfileMatrix's
own contract forbids.
- runPersonaConversation({ worker, persona, backendFor, systemPromptOf }):
profiles-vs-profiles over the shipped conversation engine (turnOrder
'alternate'); persona is a driver AgentProfile (LLM user-sim) OR scripted
turns (deterministic fast-path). Only the WORKER is metered (the side under
test); persona-driver is the harness. Returns the persistent transcript +
worker-only cost/tokens.
- runPersonaDispatch(cfg): thin ProfileDispatchFn wrapper for runProfileMatrix
— meters the worker through ctx.cost so the matrix integrity guard sees real
usage. Same loop serves one cell and the whole matrix.
- 6 tests (multi-round ordering, worker-only metering, prompt injection,
fail-loud on empty/missing config). Built on createIterableBackend +
defineConversation + runConversation — no new engine.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 6f84aff4
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T00:42:12Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 6 (6 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 158.7s (2 bridge agents) |
| Total | 158.7s |
💰 Value — sound-with-nits
Adds a reusable persona-vs-worker multi-round conversation runner plus a runProfileMatrix dispatch adapter, both built on the existing defineConversation/runConversation engine rather than a new engine.
- What it does: Introduces
runPersonaConversation(src/conversation/run-persona.ts:130) that takes a workerAgentProfile, aPersonaDriver(either scripted user turns or another driverAgentProfile), and runs them as a two-party conversation where the persona leads and the worker answers, using the shippeddefineConversation+runConversationprimitives (src/conversation/run-persona.ts:160-172). On - Goals it achieves: The change turns every fleet agent eval into
(workerProfile, personaProfiles, judges) → one dispatch, one matrix, one gate, eliminating per-agent hand-rolled 2-turndispatchWithSurfacebridges. It gives eval/self-improvement a single, reusable multi-round conversation harness that meters only the side under test so the matrix backend-integrity guard sees real usage. - Assessment: The change is coherent and in-grain. It reuses the stable conversation engine (
src/conversation/run-conversation.ts,src/conversation/define-conversation.ts) and follows the same eval-dispatch adapter pattern already established byloopDispatch(src/runtime/loop-dispatch.ts). The tests cover multi-round ordering, worker-only metering, profile-prompt injection, and fail-loud validation. The - Better / existing approach: none — this is the right approach. I searched for an existing conversation-to-campaign dispatch adapter (
GrepforProfileDispatchFn,runProfileMatrix,dispatchWithSurfaceacrosssrc/) and found onlyruntime/loop-dispatch.ts(src/runtime/loop-dispatch.ts:114) forrunLoop; no conversation/persona runner existed. The runtimepersonifylayer (src/runtime/personify/types.ts, `src/ru
🎯 Usefulness — sound-with-nits
A coherent, well-placed bridge that adapts the shipped conversation engine to agent-eval's runProfileMatrix, mirroring the existing loop-dispatch pattern and directly addressing the hand-rolled dispatch bridge problem documented in agent-eval itself.
- Integration: Wires correctly and is reachable. The new exports surface through src/conversation/index.ts:55-62 and the package root at src/index.ts:81-89. No in-repo caller is present yet, but the PR explicitly states the next step is rewiring downstream tax/legal matrix builders, and the adapter's return type matches the agent-eval ProfileDispatchFn contract exactly (agent-eval/campaign/index.d.ts:944). It is
- Fit with existing patterns: Fits the codebase's grain. It builds on the existing defineConversation/runConversation engine (src/conversation/run-conversation.ts, define-conversation.ts) rather than introducing a new engine, and follows the same adapter pattern as loop-dispatch. The agent-eval runProfileMatrix docstring (agent-eval/campaign/index.d.ts:906-927) describes the exact anti-pattern this PR targets, confirming it so
- Real-world viability: Will hold up on the happy path and most error paths because it delegates to runConversation, which already handles abort signals, per-turn retries/circuit-breakers, and credit ceilings. The worker-only metering works when the worker backend emits llm_call events. Two realistic rough edges: (1) runPersonaDispatch does not bridge conversation turn events into ctx.trace, so a matrix run will report c
💰 Value Audit
🟡 Conversation events are not forwarded to the campaign trace [maintenance] ``
runPersonaDispatchreports cost/tokens toctx.costbut does not forward conversation stream events intoctx.trace, unlikeloopDispatchwhich maps loop trace events into campaign spans (src/runtime/loop-dispatch.ts:78-86and:100). For matrices that inspect per-cell traces, this leaves a gap. A better finish would runrunConversationStreamwithonEventand emit spans underctx.trace.
🟡 Cost fallback can include persona spend [maintenance] ``
runPersonaConversationreturnscounter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100(src/conversation/run-persona.ts:177). The counter captures the metered worker, but the fallback uses the conversation-widespentCreditsCents, which for aprofile-kind persona includes the persona driver's ownllm_callevents. That contradicts the documented worker-only metering contract when the worker backend emits nollm_call. A safer fallback would sumtranscriptusage for th
🟡 Scripted persona backend is not retry-safe [maintenance] ``
scriptedPersonaBackendadvances a mutableidxon everystream()call (src/conversation/run-persona.ts:102-123). BecauserunConversationcan retry a failed turn viacallPolicy, a retry would advance to the next scripted turn instead of replaying the current one. If retries are ever enabled for the harness, derive the turn from context/turn index rather than mutating state.
🟡 Term 'persona' overlaps with the runtime personify layer [duplication] ``
The codebase already exports a
Personatype fromsrc/runtime/personify/types.ts:67(@tangle-network/agent-runtime/runtime). The newPersonaDriver(src/conversation/run-persona.ts:32) uses the same word for a different concept (eval user-sim vs. loop content seam). The types do not collide, but consider a qualifier likeEvalPersonaorUserPersonato avoid confusing consumers.
🎯 Usefulness Audit
🟡 No trace bridge from conversation events to campaign trace [robustness] ``
loop-dispatch forwards runLoop trace events into ctx.trace so matrix cells are observable (tests/loops/loop-dispatch.test.ts:138-139). runPersonaDispatch reports cost/tokens but never forwards runConversation's turn_start/turn_end/conversation_end events. For production eval debugging, a matrix cell will show spend without the conversation structure that produced it. Consider adding an onEvent handler that writes conversation spans to ctx.trace, or document that conversation telemetry lives outs
🟡 costUsd fallback can fold persona spend into worker-only result [robustness] ``
run-persona.ts:177 returns
counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100.spentCreditsCentsis aggregated across all participants by runConversation. If a profile-kind persona LLM emits cost but the worker backend emits only tokens (e.g., createOpenAICompatibleBackend emits usage but leaves costUsd undefined), the fallback silently includes persona spend in the returned worker cost. Since the design intent is worker-only metering, consider falling back only when no l
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 79 | 69 | 69 |
| Confidence | 70 | 70 | 70 |
| Correctness | 79 | 69 | 69 |
| Security | 79 | 69 | 69 |
| Testing | 79 | 69 | 69 |
| Architecture | 79 | 69 | 69 |
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Profile-driven persona execution path has zero test coverage — src/conversation/run-persona.test.ts
The module's stated purpose is 'profiles-vs-profiles' (lines 6-10: 'the persona is itself a driver AgentProfile — an LLM role-playing the user'). All 5 runPersonaConversation tests use scripted personas. The profile-kind persona is tested ONLY for its error throw ('requires maxTurns', line 109). The actual execution path (run-persona.ts:149-158 — withProfilePrompt applied to the persona without a counter, persona system-prompt injection, turn alternation with variable-length persona responses) has no test. A profil
🟠 MEDIUM costUsd fallback double-counts harness (persona) credits when worker emits no llm_call events — src/conversation/run-persona.ts
Line 177:
costUsd: counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. When the worker backend produces nollm_callevents (counter remains 0), the fallback readsConversationResult.spentCreditsCentswhich is the TOTAL conversation credit spend — including the persona's LLM calls whenpersona.kind === 'profile'. This inflates the worker's reported cost by including harness/infra costs. For scripted personas the fallback is correct (no persona LLM spend). For profile-driven personas this over-reports worker cost, which propagates throughrunPersonaDispatchintoctx.cost.observe()([line 222](https://github.com/tangle-network/ag
🟠 MEDIUM costUsd fallback leaks persona spend into worker metering for profile-driven personas — src/conversation/run-persona.ts
Line 177:
costUsd: counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100.result.spentCreditsCentsis the aggregate across ALL participants (run-conversation.ts:100,333 sums every participant's llm_call.costUsd). The module docstring (lines 8-10) states 'Only the WORKER is metered', and PersonaConversationResult.costUsd is documented as 'Worker-only spend'. For a profile-driven persona ([lines 149-153](https://github.com/tangle-network/agent-runtime/blob/6f84aff468d9229ef9c0da03802b45365241a43b/src/conve
🟡 LOW No happy-path test for profile-driven persona (kind: 'profile') — src/conversation/run-persona.test.ts
Tests cover the error case (maxTurns rejection at line 109-117) but never exercise a successful run with
persona.kind === 'profile'. ThewithProfilePromptpath for personas (line 150-153),backendForcalled with role'persona', and the profile-driven conversation loop are entirely untested. This leaves the LLM-driven persona path — the primary use case that differentiates this from a simple scripted-runner — without any integration-level confidence.
🟡 LOW Test stub fakeCtx().cost.observe has wrong arity vs real UsageSink contract — src/conversation/run-persona.test.ts
Line 56:
observe(usd: number)takes one arg, but the real UsageSink.observe(amountUsd, source) takes two (report-usage.ts:25). Production callsctx.cost.observe(result.costUsd, 'persona-conversation')(run-persona.ts:222) with the source label. Theas unknown as DispatchContextcast (line 64) silences the type error, so the test never verifies the source parameter. The test passes because JS ignores extra args, but the contract is undertested. Fix:observe(usd: number, _source?: string)in the stub.
🟡 LOW spentCreditsCents fallback path has zero test coverage — src/conversation/run-persona.test.ts
The cost fallback at line 177 (
counter.costUsd > 0 ? ... : result.spentCreditsCents / 100) is never exercised. The fakeWorker always emitsllm_callevents withcostUsd: 0.02, socounter.costUsdis always > 0 in every test. A test using a worker that emits text_delta but no llm_call would verify the fallback works and reveal the double-counting issue (finding #1).
🟡 LOW withProfilePrompt prepends system prompt in stream but not in start/resume — src/conversation/run-persona.ts
Lines 81-83: start/resume/stop pass
inputthrough unchanged — only stream (line 89) injects the system prompt. run-conversation.ts:412-423 calls start with the same backendInput that stream receives, so start sees messages WITHOUT the system prompt while stream sees them WITH it. Current backends (createIterableBackend identity, createSandboxPromptBackend, createOpenAICompatibleBackend) don't inspect messages in start, so this is latent. But a future backend that pre-fills a session from input.messages in start would miss
tangletools · 2026-06-14T00:51:04Z · trace
…profile path Addresses the two MEDIUM review findings on the persona loop runner: - costUsd no longer falls back to the conversation's aggregate spend for a profile-driven persona (that includes the persona-driver's spend → over-counts the worker). Aggregate fallback now applies ONLY to scripted personas (no LLM harness cost); profile personas report the worker's metered spend. - adds end-to-end coverage of the profile-driven persona path (previously only its error throw was tested): multi-round execution, per-side prompt injection (worker vs persona each get their own profile prompt), and worker-only metering proving the persona-driver's spend never leaks into the worker's usage.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 795d40ca
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T01:02:34Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 0 (none) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 30.0s (2 bridge agents) |
| Total | 30.0s |
No concerns — sound change, no better or existing approach found. ✅
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
- feat(conversation): runPersonaConversation + runPersonaDispatch — the persona loop runner; any AgentProfile evaluated as a multi-round conversation, drops into runProfileMatrix as dispatch (#282) - feat(personify): connect the dormant analyst→steer wire + registryScopeAnalyst (#284) - feat(skills): build-with-agent-runtime canonical spine (#285) - fix(tool-loop): strict-model tool-call history (#286) - docs: canonical-api.md API reference (#283)
…rsation/runConversation) (#289) Another session shipped src/conversation/ (#282): runPersonaConversation (worker under test ⟷ simulated-user persona driver, worker-only metered, drops into runProfileMatrix via runPersonaDispatch) + runConversation (two profiles head-to-head). These are a DIFFERENT layer from runPersonified/loopUntil (recursive-atom execution) — the doc now distinguishes them and stops listing 'runConversation' as a thing-to-avoid (it's a shipped canonical primitive). Adds the decision-table rows + a §3.1 subsection.
What
The keystone for fleet-wide eval + self-improvement: a persona loop runner that runs any worker
AgentProfileas a real multi-round conversation against a persona, over the persistent transcript — and drops straight intorunProfileMatrixas itsdispatch.This replaces the per-agent hand-rolled 2-turn
dispatchWithSurfacebridges (tax + legal each copy-paste one) thatrunProfileMatrix's own docstring calls "exactly the pile of bespoke eval scripts the adoption skills keep trying (and failing) to forbid."API
runPersonaConversation({ worker, persona, backendFor, systemPromptOf, maxTurns? })— the loop runner. Profiles vs profiles: the persona is a driverAgentProfile(an LLM role-playing the user) or scripted turns (deterministic fast-path). Runs on the shipped conversation engine (turnOrder: 'alternate'). Only the worker is metered (the side under test); the persona-driver is the harness. Returns{ transcript, turns, halted, costUsd, tokensIn, tokensOut }.runPersonaDispatch(cfg)— thinProfileDispatchFnwrapper for the matrix; meters the worker throughctx.costso the backend-integrity guard sees real usage. One loop serves a single cell and the whole matrix.Why this shape
A persona is the user side (driver), not the agent's prompt; the worker's
AgentProfileis what self-improvement optimizes; some personas are held out for generalization. The runner makes every fleet agent just(workerProfile, personaProfiles, judges)→ one dispatch, one matrix, one gate.Tests
6 tests: multi-round ordering (persona leads, worker answers each), worker-only metering from
llm_callevents, profile-prompt injection into the worker, fail-loud on empty/missing config.tscclean onsrc; builds; exported from the package root. Built oncreateIterableBackend+defineConversation+runConversation— no new engine.Next
Rewire tax's
buildMatrix/dispatchWithSurfaceontorunProfileMatrix({ dispatch: runPersonaDispatch(...) })(parity proof), then the rest of the fleet, then layer the surface drivers (skills/memory/rag/tools) on the clean base.Scope
Eval/runtime primitive; no auth/billing/lifecycle surface. Pure addition.