feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch) by drewstone · Pull Request #282 · tangle-network/agent-runtime

drewstone · 2026-06-14T00:42:05Z

What

The keystone for fleet-wide eval + self-improvement: a persona loop runner that runs any worker AgentProfile as a real multi-round conversation against a persona, over the persistent transcript — and drops straight into runProfileMatrix as its dispatch.

This replaces the per-agent hand-rolled 2-turn dispatchWithSurface bridges (tax + legal each copy-paste one) that runProfileMatrix's own docstring calls "exactly the pile of bespoke eval scripts the adoption skills keep trying (and failing) to forbid."

API

runPersonaConversation({ worker, persona, backendFor, systemPromptOf, maxTurns? }) — the loop runner. Profiles vs profiles: the persona is a driver AgentProfile (an LLM role-playing the user) or scripted turns (deterministic fast-path). Runs on the shipped conversation engine (turnOrder: 'alternate'). Only the worker is metered (the side under test); the persona-driver is the harness. Returns { transcript, turns, halted, costUsd, tokensIn, tokensOut }.
runPersonaDispatch(cfg) — thin ProfileDispatchFn wrapper for the matrix; meters the worker through ctx.cost so the backend-integrity guard sees real usage. One loop serves a single cell and the whole matrix.

Why this shape

A persona is the user side (driver), not the agent's prompt; the worker's AgentProfile is what self-improvement optimizes; some personas are held out for generalization. The runner makes every fleet agent just (workerProfile, personaProfiles, judges) → one dispatch, one matrix, one gate.

Tests

6 tests: multi-round ordering (persona leads, worker answers each), worker-only metering from llm_call events, profile-prompt injection into the worker, fail-loud on empty/missing config. tsc clean on src; builds; exported from the package root. Built on createIterableBackend + defineConversation + runConversation — no new engine.

Scope

Eval/runtime primitive; no auth/billing/lifecycle surface. Pure addition.

The keystone that lets ANY AgentProfile be evaluated as a real multi-round conversation and drops into runProfileMatrix as its dispatch — replacing the per-agent hand-rolled 2-turn dispatchWithSurface bridges runProfileMatrix's own contract forbids. - runPersonaConversation({ worker, persona, backendFor, systemPromptOf }): profiles-vs-profiles over the shipped conversation engine (turnOrder 'alternate'); persona is a driver AgentProfile (LLM user-sim) OR scripted turns (deterministic fast-path). Only the WORKER is metered (the side under test); persona-driver is the harness. Returns the persistent transcript + worker-only cost/tokens. - runPersonaDispatch(cfg): thin ProfileDispatchFn wrapper for runProfileMatrix — meters the worker through ctx.cost so the matrix integrity guard sees real usage. Same loop serves one cell and the whole matrix. - 6 tests (multi-round ordering, worker-only metering, prompt injection, fail-loud on empty/missing config). Built on createIterableBackend + defineConversation + runConversation — no new engine.

tangletools

✅ Auto-approved PR — `6f84aff4`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T00:42:12Z}

tangletools

🟡 Value Audit — sound-with-nits


Verdict	sound-with-nits
Concerns	6 (6 weak-concern)
Heuristic	0.0s
Duplication	0.0s
Interrogation	158.7s (2 bridge agents)
Total	158.7s

💰 Value — sound-with-nits

Adds a reusable persona-vs-worker multi-round conversation runner plus a runProfileMatrix dispatch adapter, both built on the existing defineConversation/runConversation engine rather than a new engine.

What it does: Introduces runPersonaConversation (src/conversation/run-persona.ts:130) that takes a worker AgentProfile, a PersonaDriver (either scripted user turns or another driver AgentProfile), and runs them as a two-party conversation where the persona leads and the worker answers, using the shipped defineConversation + runConversation primitives (src/conversation/run-persona.ts:160-172). On
Goals it achieves: The change turns every fleet agent eval into (workerProfile, personaProfiles, judges) → one dispatch, one matrix, one gate, eliminating per-agent hand-rolled 2-turn dispatchWithSurface bridges. It gives eval/self-improvement a single, reusable multi-round conversation harness that meters only the side under test so the matrix backend-integrity guard sees real usage.
Assessment: The change is coherent and in-grain. It reuses the stable conversation engine (src/conversation/run-conversation.ts, src/conversation/define-conversation.ts) and follows the same eval-dispatch adapter pattern already established by loopDispatch (src/runtime/loop-dispatch.ts). The tests cover multi-round ordering, worker-only metering, profile-prompt injection, and fail-loud validation. The
Better / existing approach: none — this is the right approach. I searched for an existing conversation-to-campaign dispatch adapter (Grep for ProfileDispatchFn, runProfileMatrix, dispatchWithSurface across src/) and found only runtime/loop-dispatch.ts (src/runtime/loop-dispatch.ts:114) for runLoop; no conversation/persona runner existed. The runtime personify layer (src/runtime/personify/types.ts, `src/ru

🎯 Usefulness — sound-with-nits

A coherent, well-placed bridge that adapts the shipped conversation engine to agent-eval's runProfileMatrix, mirroring the existing loop-dispatch pattern and directly addressing the hand-rolled dispatch bridge problem documented in agent-eval itself.

Integration: Wires correctly and is reachable. The new exports surface through src/conversation/index.ts:55-62 and the package root at src/index.ts:81-89. No in-repo caller is present yet, but the PR explicitly states the next step is rewiring downstream tax/legal matrix builders, and the adapter's return type matches the agent-eval ProfileDispatchFn contract exactly (agent-eval/campaign/index.d.ts:944). It is
Fit with existing patterns: Fits the codebase's grain. It builds on the existing defineConversation/runConversation engine (src/conversation/run-conversation.ts, define-conversation.ts) rather than introducing a new engine, and follows the same adapter pattern as loop-dispatch. The agent-eval runProfileMatrix docstring (agent-eval/campaign/index.d.ts:906-927) describes the exact anti-pattern this PR targets, confirming it so
Real-world viability: Will hold up on the happy path and most error paths because it delegates to runConversation, which already handles abort signals, per-turn retries/circuit-breakers, and credit ceilings. The worker-only metering works when the worker backend emits llm_call events. Two realistic rough edges: (1) runPersonaDispatch does not bridge conversation turn events into ctx.trace, so a matrix run will report c

💰 Value Audit

🟡 Conversation events are not forwarded to the campaign trace [maintenance] ``

runPersonaDispatch reports cost/tokens to ctx.cost but does not forward conversation stream events into ctx.trace, unlike loopDispatch which maps loop trace events into campaign spans (src/runtime/loop-dispatch.ts:78-86 and :100). For matrices that inspect per-cell traces, this leaves a gap. A better finish would run runConversationStream with onEvent and emit spans under ctx.trace.

🟡 Cost fallback can include persona spend [maintenance] ``

runPersonaConversation returns counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100 (src/conversation/run-persona.ts:177). The counter captures the metered worker, but the fallback uses the conversation-wide spentCreditsCents, which for a profile-kind persona includes the persona driver's own llm_call events. That contradicts the documented worker-only metering contract when the worker backend emits no llm_call. A safer fallback would sum transcript usage for th

🟡 Scripted persona backend is not retry-safe [maintenance] ``

scriptedPersonaBackend advances a mutable idx on every stream() call (src/conversation/run-persona.ts:102-123). Because runConversation can retry a failed turn via callPolicy, a retry would advance to the next scripted turn instead of replaying the current one. If retries are ever enabled for the harness, derive the turn from context/turn index rather than mutating state.

🟡 Term 'persona' overlaps with the runtime personify layer [duplication] ``

The codebase already exports a Persona type from src/runtime/personify/types.ts:67 (@tangle-network/agent-runtime/runtime). The new PersonaDriver (src/conversation/run-persona.ts:32) uses the same word for a different concept (eval user-sim vs. loop content seam). The types do not collide, but consider a qualifier like EvalPersona or UserPersona to avoid confusing consumers.

🎯 Usefulness Audit

🟡 No trace bridge from conversation events to campaign trace [robustness] ``

loop-dispatch forwards runLoop trace events into ctx.trace so matrix cells are observable (tests/loops/loop-dispatch.test.ts:138-139). runPersonaDispatch reports cost/tokens but never forwards runConversation's turn_start/turn_end/conversation_end events. For production eval debugging, a matrix cell will show spend without the conversation structure that produced it. Consider adding an onEvent handler that writes conversation spans to ctx.trace, or document that conversation telemetry lives outs

🟡 costUsd fallback can fold persona spend into worker-only result [robustness] ``

run-persona.ts:177 returns counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. spentCreditsCents is aggregated across all participants by runConversation. If a profile-kind persona LLM emits cost but the worker backend emits only tokens (e.g., createOpenAICompatibleBackend emits usage but leaves costUsd undefined), the fallback silently includes persona spend in the returned worker cost. Since the design intent is worker-only metering, consider falling back only when no l

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T004744Z}

tangletools · 2026-06-14T00:51:06Z

✅ No Blockers — `6f84aff4`

Readiness 69/100 · Confidence 70/100 · 7 findings (3 medium, 4 low)

	deepseek	glm	aggregate
Readiness	79	69	69
Confidence	70	70	70
Correctness	79	69	69
Security	79	69	69
Testing	79	69	69
Architecture	79	69	69

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Profile-driven persona execution path has zero test coverage — src/conversation/run-persona.test.ts

The module's stated purpose is 'profiles-vs-profiles' (lines 6-10: 'the persona is itself a driver AgentProfile — an LLM role-playing the user'). All 5 runPersonaConversation tests use scripted personas. The profile-kind persona is tested ONLY for its error throw ('requires maxTurns', line 109). The actual execution path (run-persona.ts:149-158 — withProfilePrompt applied to the persona without a counter, persona system-prompt injection, turn alternation with variable-length persona responses) has no test. A profil

🟠 MEDIUM costUsd fallback double-counts harness (persona) credits when worker emits no llm_call events — src/conversation/run-persona.ts

Line 177: costUsd: counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. When the worker backend produces no llm_call events (counter remains 0), the fallback reads ConversationResult.spentCreditsCents which is the TOTAL conversation credit spend — including the persona's LLM calls when persona.kind === 'profile'. This inflates the worker's reported cost by including harness/infra costs. For scripted personas the fallback is correct (no persona LLM spend). For profile-driven personas this over-reports worker cost, which propagates through runPersonaDispatch into ctx.cost.observe() ([line 222](https://github.com/tangle-network/ag

🟠 MEDIUM costUsd fallback leaks persona spend into worker metering for profile-driven personas — src/conversation/run-persona.ts

Line 177: costUsd: counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. result.spentCreditsCents is the aggregate across ALL participants (run-conversation.ts:100,333 sums every participant's llm_call.costUsd). The module docstring (lines 8-10) states 'Only the WORKER is metered', and PersonaConversationResult.costUsd is documented as 'Worker-only spend'. For a profile-driven persona ([lines 149-153](https://github.com/tangle-network/agent-runtime/blob/6f84aff468d9229ef9c0da03802b45365241a43b/src/conve

🟡 LOW No happy-path test for profile-driven persona (kind: 'profile') — src/conversation/run-persona.test.ts

Tests cover the error case (maxTurns rejection at line 109-117) but never exercise a successful run with persona.kind === 'profile'. The withProfilePrompt path for personas (line 150-153), backendFor called with role 'persona', and the profile-driven conversation loop are entirely untested. This leaves the LLM-driven persona path — the primary use case that differentiates this from a simple scripted-runner — without any integration-level confidence.

🟡 LOW Test stub fakeCtx().cost.observe has wrong arity vs real UsageSink contract — src/conversation/run-persona.test.ts

Line 56: observe(usd: number) takes one arg, but the real UsageSink.observe(amountUsd, source) takes two (report-usage.ts:25). Production calls ctx.cost.observe(result.costUsd, 'persona-conversation') (run-persona.ts:222) with the source label. The as unknown as DispatchContext cast (line 64) silences the type error, so the test never verifies the source parameter. The test passes because JS ignores extra args, but the contract is undertested. Fix: observe(usd: number, _source?: string) in the stub.

🟡 LOW spentCreditsCents fallback path has zero test coverage — src/conversation/run-persona.test.ts

The cost fallback at line 177 (counter.costUsd > 0 ? ... : result.spentCreditsCents / 100) is never exercised. The fakeWorker always emits llm_call events with costUsd: 0.02, so counter.costUsd is always > 0 in every test. A test using a worker that emits text_delta but no llm_call would verify the fallback works and reveal the double-counting issue (finding #1).

🟡 LOW withProfilePrompt prepends system prompt in stream but not in start/resume — src/conversation/run-persona.ts

Lines 81-83: start/resume/stop pass input through unchanged — only stream (line 89) injects the system prompt. run-conversation.ts:412-423 calls start with the same backendInput that stream receives, so start sees messages WITHOUT the system prompt while stream sees them WITH it. Current backends (createIterableBackend identity, createSandboxPromptBackend, createOpenAICompatibleBackend) don't inspect messages in start, so this is latent. But a future backend that pre-fills a session from input.messages in start would miss

_{tangletools · 2026-06-14T00:51:04Z · trace}

…profile path Addresses the two MEDIUM review findings on the persona loop runner: - costUsd no longer falls back to the conversation's aggregate spend for a profile-driven persona (that includes the persona-driver's spend → over-counts the worker). Aggregate fallback now applies ONLY to scripted personas (no LLM harness cost); profile personas report the worker's metered spend. - adds end-to-end coverage of the profile-driven persona path (previously only its error throw was tested): multi-round execution, per-side prompt injection (worker vs persona each get their own profile prompt), and worker-only metering proving the persona-driver's spend never leaks into the worker's usage.

tangletools

✅ Auto-approved PR — `795d40ca`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T01:02:34Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	0 (none)
Heuristic	0.0s
Duplication	0.0s
Interrogation	30.0s (2 bridge agents)
Total	30.0s

No concerns — sound change, no better or existing approach found. ✅

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260614T010445Z}

- feat(conversation): runPersonaConversation + runPersonaDispatch — the persona loop runner; any AgentProfile evaluated as a multi-round conversation, drops into runProfileMatrix as dispatch (#282) - feat(personify): connect the dormant analyst→steer wire + registryScopeAnalyst (#284) - feat(skills): build-with-agent-runtime canonical spine (#285) - fix(tool-loop): strict-model tool-call history (#286) - docs: canonical-api.md API reference (#283)

…rsation/runConversation) (#289) Another session shipped src/conversation/ (#282): runPersonaConversation (worker under test ⟷ simulated-user persona driver, worker-only metered, drops into runProfileMatrix via runPersonaDispatch) + runConversation (two profiles head-to-head). These are a DIFFERENT layer from runPersonified/loopUntil (recursive-atom execution) — the doc now distinguishes them and stops listing 'runConversation' as a thing-to-avoid (it's a shipped canonical primitive). Adds the decision-table rows + a §3.1 subsection.

tangletools previously approved these changes Jun 14, 2026

View reviewed changes

tangletools reviewed Jun 14, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 795d40c June 14, 2026 01:02

tangletools approved these changes Jun 14, 2026

View reviewed changes

tangletools reviewed Jun 14, 2026

View reviewed changes

drewstone merged commit 19ba908 into main Jun 14, 2026
1 check passed

drewstone mentioned this pull request Jun 14, 2026

docs(canonical-api): add the conversation/eval layer (runPersonaConversation) #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch)#282

feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch)#282
drewstone merged 2 commits into
mainfrom
feat/run-persona-loop-runner

drewstone commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 14, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 14, 2026

What

API

Why this shape

Tests

Next

Scope

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 6f84aff4

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟡 Value Audit — sound-with-nits

💰 Value — sound-with-nits

🎯 Usefulness — sound-with-nits

💰 Value Audit

🎯 Usefulness Audit

Uh oh!

tangletools commented Jun 14, 2026

✅ No Blockers — 6f84aff4

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 795d40ca

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `6f84aff4`

✅ No Blockers — `6f84aff4`

✅ Auto-approved PR — `795d40ca`