Skip to content

feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch)#282

Merged
drewstone merged 2 commits into
mainfrom
feat/run-persona-loop-runner
Jun 14, 2026
Merged

feat(conversation): runPersonaConversation — the persona loop runner (kills hand-rolled eval dispatch)#282
drewstone merged 2 commits into
mainfrom
feat/run-persona-loop-runner

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

The keystone for fleet-wide eval + self-improvement: a persona loop runner that runs any worker AgentProfile as a real multi-round conversation against a persona, over the persistent transcript — and drops straight into runProfileMatrix as its dispatch.

This replaces the per-agent hand-rolled 2-turn dispatchWithSurface bridges (tax + legal each copy-paste one) that runProfileMatrix's own docstring calls "exactly the pile of bespoke eval scripts the adoption skills keep trying (and failing) to forbid."

API

  • runPersonaConversation({ worker, persona, backendFor, systemPromptOf, maxTurns? }) — the loop runner. Profiles vs profiles: the persona is a driver AgentProfile (an LLM role-playing the user) or scripted turns (deterministic fast-path). Runs on the shipped conversation engine (turnOrder: 'alternate'). Only the worker is metered (the side under test); the persona-driver is the harness. Returns { transcript, turns, halted, costUsd, tokensIn, tokensOut }.
  • runPersonaDispatch(cfg) — thin ProfileDispatchFn wrapper for the matrix; meters the worker through ctx.cost so the backend-integrity guard sees real usage. One loop serves a single cell and the whole matrix.

Why this shape

A persona is the user side (driver), not the agent's prompt; the worker's AgentProfile is what self-improvement optimizes; some personas are held out for generalization. The runner makes every fleet agent just (workerProfile, personaProfiles, judges) → one dispatch, one matrix, one gate.

Tests

6 tests: multi-round ordering (persona leads, worker answers each), worker-only metering from llm_call events, profile-prompt injection into the worker, fail-loud on empty/missing config. tsc clean on src; builds; exported from the package root. Built on createIterableBackend + defineConversation + runConversation — no new engine.

Next

Rewire tax's buildMatrix/dispatchWithSurface onto runProfileMatrix({ dispatch: runPersonaDispatch(...) }) (parity proof), then the rest of the fleet, then layer the surface drivers (skills/memory/rag/tools) on the clean base.

Scope

Eval/runtime primitive; no auth/billing/lifecycle surface. Pure addition.

The keystone that lets ANY AgentProfile be evaluated as a real multi-round
conversation and drops into runProfileMatrix as its dispatch — replacing the
per-agent hand-rolled 2-turn dispatchWithSurface bridges runProfileMatrix's
own contract forbids.

- runPersonaConversation({ worker, persona, backendFor, systemPromptOf }):
  profiles-vs-profiles over the shipped conversation engine (turnOrder
  'alternate'); persona is a driver AgentProfile (LLM user-sim) OR scripted
  turns (deterministic fast-path). Only the WORKER is metered (the side under
  test); persona-driver is the harness. Returns the persistent transcript +
  worker-only cost/tokens.
- runPersonaDispatch(cfg): thin ProfileDispatchFn wrapper for runProfileMatrix
  — meters the worker through ctx.cost so the matrix integrity guard sees real
  usage. Same loop serves one cell and the whole matrix.
- 6 tests (multi-round ordering, worker-only metering, prompt injection,
  fail-loud on empty/missing config). Built on createIterableBackend +
  defineConversation + runConversation — no new engine.
tangletools
tangletools previously approved these changes Jun 14, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 6f84aff4

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T00:42:12Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 6 (6 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 158.7s (2 bridge agents)
Total 158.7s

💰 Value — sound-with-nits

Adds a reusable persona-vs-worker multi-round conversation runner plus a runProfileMatrix dispatch adapter, both built on the existing defineConversation/runConversation engine rather than a new engine.

  • What it does: Introduces runPersonaConversation (src/conversation/run-persona.ts:130) that takes a worker AgentProfile, a PersonaDriver (either scripted user turns or another driver AgentProfile), and runs them as a two-party conversation where the persona leads and the worker answers, using the shipped defineConversation + runConversation primitives (src/conversation/run-persona.ts:160-172). On
  • Goals it achieves: The change turns every fleet agent eval into (workerProfile, personaProfiles, judges) → one dispatch, one matrix, one gate, eliminating per-agent hand-rolled 2-turn dispatchWithSurface bridges. It gives eval/self-improvement a single, reusable multi-round conversation harness that meters only the side under test so the matrix backend-integrity guard sees real usage.
  • Assessment: The change is coherent and in-grain. It reuses the stable conversation engine (src/conversation/run-conversation.ts, src/conversation/define-conversation.ts) and follows the same eval-dispatch adapter pattern already established by loopDispatch (src/runtime/loop-dispatch.ts). The tests cover multi-round ordering, worker-only metering, profile-prompt injection, and fail-loud validation. The
  • Better / existing approach: none — this is the right approach. I searched for an existing conversation-to-campaign dispatch adapter (Grep for ProfileDispatchFn, runProfileMatrix, dispatchWithSurface across src/) and found only runtime/loop-dispatch.ts (src/runtime/loop-dispatch.ts:114) for runLoop; no conversation/persona runner existed. The runtime personify layer (src/runtime/personify/types.ts, `src/ru

🎯 Usefulness — sound-with-nits

A coherent, well-placed bridge that adapts the shipped conversation engine to agent-eval's runProfileMatrix, mirroring the existing loop-dispatch pattern and directly addressing the hand-rolled dispatch bridge problem documented in agent-eval itself.

  • Integration: Wires correctly and is reachable. The new exports surface through src/conversation/index.ts:55-62 and the package root at src/index.ts:81-89. No in-repo caller is present yet, but the PR explicitly states the next step is rewiring downstream tax/legal matrix builders, and the adapter's return type matches the agent-eval ProfileDispatchFn contract exactly (agent-eval/campaign/index.d.ts:944). It is
  • Fit with existing patterns: Fits the codebase's grain. It builds on the existing defineConversation/runConversation engine (src/conversation/run-conversation.ts, define-conversation.ts) rather than introducing a new engine, and follows the same adapter pattern as loop-dispatch. The agent-eval runProfileMatrix docstring (agent-eval/campaign/index.d.ts:906-927) describes the exact anti-pattern this PR targets, confirming it so
  • Real-world viability: Will hold up on the happy path and most error paths because it delegates to runConversation, which already handles abort signals, per-turn retries/circuit-breakers, and credit ceilings. The worker-only metering works when the worker backend emits llm_call events. Two realistic rough edges: (1) runPersonaDispatch does not bridge conversation turn events into ctx.trace, so a matrix run will report c

💰 Value Audit

🟡 Conversation events are not forwarded to the campaign trace [maintenance] ``

runPersonaDispatch reports cost/tokens to ctx.cost but does not forward conversation stream events into ctx.trace, unlike loopDispatch which maps loop trace events into campaign spans (src/runtime/loop-dispatch.ts:78-86 and :100). For matrices that inspect per-cell traces, this leaves a gap. A better finish would run runConversationStream with onEvent and emit spans under ctx.trace.

🟡 Cost fallback can include persona spend [maintenance] ``

runPersonaConversation returns counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100 (src/conversation/run-persona.ts:177). The counter captures the metered worker, but the fallback uses the conversation-wide spentCreditsCents, which for a profile-kind persona includes the persona driver's own llm_call events. That contradicts the documented worker-only metering contract when the worker backend emits no llm_call. A safer fallback would sum transcript usage for th

🟡 Scripted persona backend is not retry-safe [maintenance] ``

scriptedPersonaBackend advances a mutable idx on every stream() call (src/conversation/run-persona.ts:102-123). Because runConversation can retry a failed turn via callPolicy, a retry would advance to the next scripted turn instead of replaying the current one. If retries are ever enabled for the harness, derive the turn from context/turn index rather than mutating state.

🟡 Term 'persona' overlaps with the runtime personify layer [duplication] ``

The codebase already exports a Persona type from src/runtime/personify/types.ts:67 (@tangle-network/agent-runtime/runtime). The new PersonaDriver (src/conversation/run-persona.ts:32) uses the same word for a different concept (eval user-sim vs. loop content seam). The types do not collide, but consider a qualifier like EvalPersona or UserPersona to avoid confusing consumers.

🎯 Usefulness Audit

🟡 No trace bridge from conversation events to campaign trace [robustness] ``

loop-dispatch forwards runLoop trace events into ctx.trace so matrix cells are observable (tests/loops/loop-dispatch.test.ts:138-139). runPersonaDispatch reports cost/tokens but never forwards runConversation's turn_start/turn_end/conversation_end events. For production eval debugging, a matrix cell will show spend without the conversation structure that produced it. Consider adding an onEvent handler that writes conversation spans to ctx.trace, or document that conversation telemetry lives outs

🟡 costUsd fallback can fold persona spend into worker-only result [robustness] ``

run-persona.ts:177 returns counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. spentCreditsCents is aggregated across all participants by runConversation. If a profile-kind persona LLM emits cost but the worker backend emits only tokens (e.g., createOpenAICompatibleBackend emits usage but leaves costUsd undefined), the fallback silently includes persona spend in the returned worker cost. Since the design intent is worker-only metering, consider falling back only when no l


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T004744Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 6f84aff4

Readiness 69/100 · Confidence 70/100 · 7 findings (3 medium, 4 low)

deepseek glm aggregate
Readiness 79 69 69
Confidence 70 70 70
Correctness 79 69 69
Security 79 69 69
Testing 79 69 69
Architecture 79 69 69

Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Profile-driven persona execution path has zero test coverage — src/conversation/run-persona.test.ts

The module's stated purpose is 'profiles-vs-profiles' (lines 6-10: 'the persona is itself a driver AgentProfile — an LLM role-playing the user'). All 5 runPersonaConversation tests use scripted personas. The profile-kind persona is tested ONLY for its error throw ('requires maxTurns', line 109). The actual execution path (run-persona.ts:149-158 — withProfilePrompt applied to the persona without a counter, persona system-prompt injection, turn alternation with variable-length persona responses) has no test. A profil

🟠 MEDIUM costUsd fallback double-counts harness (persona) credits when worker emits no llm_call events — src/conversation/run-persona.ts

Line 177: costUsd: counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. When the worker backend produces no llm_call events (counter remains 0), the fallback reads ConversationResult.spentCreditsCents which is the TOTAL conversation credit spend — including the persona's LLM calls when persona.kind === 'profile'. This inflates the worker's reported cost by including harness/infra costs. For scripted personas the fallback is correct (no persona LLM spend). For profile-driven personas this over-reports worker cost, which propagates through runPersonaDispatch into ctx.cost.observe() ([line 222](https://github.com/tangle-network/ag

🟠 MEDIUM costUsd fallback leaks persona spend into worker metering for profile-driven personas — src/conversation/run-persona.ts

Line 177: costUsd: counter.costUsd > 0 ? counter.costUsd : result.spentCreditsCents / 100. result.spentCreditsCents is the aggregate across ALL participants (run-conversation.ts:100,333 sums every participant's llm_call.costUsd). The module docstring (lines 8-10) states 'Only the WORKER is metered', and PersonaConversationResult.costUsd is documented as 'Worker-only spend'. For a profile-driven persona ([lines 149-153](https://github.com/tangle-network/agent-runtime/blob/6f84aff468d9229ef9c0da03802b45365241a43b/src/conve

🟡 LOW No happy-path test for profile-driven persona (kind: 'profile') — src/conversation/run-persona.test.ts

Tests cover the error case (maxTurns rejection at line 109-117) but never exercise a successful run with persona.kind === 'profile'. The withProfilePrompt path for personas (line 150-153), backendFor called with role 'persona', and the profile-driven conversation loop are entirely untested. This leaves the LLM-driven persona path — the primary use case that differentiates this from a simple scripted-runner — without any integration-level confidence.

🟡 LOW Test stub fakeCtx().cost.observe has wrong arity vs real UsageSink contract — src/conversation/run-persona.test.ts

Line 56: observe(usd: number) takes one arg, but the real UsageSink.observe(amountUsd, source) takes two (report-usage.ts:25). Production calls ctx.cost.observe(result.costUsd, 'persona-conversation') (run-persona.ts:222) with the source label. The as unknown as DispatchContext cast (line 64) silences the type error, so the test never verifies the source parameter. The test passes because JS ignores extra args, but the contract is undertested. Fix: observe(usd: number, _source?: string) in the stub.

🟡 LOW spentCreditsCents fallback path has zero test coverage — src/conversation/run-persona.test.ts

The cost fallback at line 177 (counter.costUsd > 0 ? ... : result.spentCreditsCents / 100) is never exercised. The fakeWorker always emits llm_call events with costUsd: 0.02, so counter.costUsd is always > 0 in every test. A test using a worker that emits text_delta but no llm_call would verify the fallback works and reveal the double-counting issue (finding #1).

🟡 LOW withProfilePrompt prepends system prompt in stream but not in start/resume — src/conversation/run-persona.ts

Lines 81-83: start/resume/stop pass input through unchanged — only stream (line 89) injects the system prompt. run-conversation.ts:412-423 calls start with the same backendInput that stream receives, so start sees messages WITHOUT the system prompt while stream sees them WITH it. Current backends (createIterableBackend identity, createSandboxPromptBackend, createOpenAICompatibleBackend) don't inspect messages in start, so this is latent. But a future backend that pre-fills a session from input.messages in start would miss


tangletools · 2026-06-14T00:51:04Z · trace

…profile path

Addresses the two MEDIUM review findings on the persona loop runner:
- costUsd no longer falls back to the conversation's aggregate spend for a
  profile-driven persona (that includes the persona-driver's spend → over-counts
  the worker). Aggregate fallback now applies ONLY to scripted personas (no LLM
  harness cost); profile personas report the worker's metered spend.
- adds end-to-end coverage of the profile-driven persona path (previously only
  its error throw was tested): multi-round execution, per-side prompt injection
  (worker vs persona each get their own profile prompt), and worker-only metering
  proving the persona-driver's spend never leaks into the worker's usage.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 795d40ca

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-14T01:02:34Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 30.0s (2 bridge agents)
Total 30.0s

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260614T010445Z

@drewstone drewstone merged commit 19ba908 into main Jun 14, 2026
1 check passed
drewstone added a commit that referenced this pull request Jun 14, 2026
- feat(conversation): runPersonaConversation + runPersonaDispatch — the persona
  loop runner; any AgentProfile evaluated as a multi-round conversation, drops
  into runProfileMatrix as dispatch (#282)
- feat(personify): connect the dormant analyst→steer wire + registryScopeAnalyst (#284)
- feat(skills): build-with-agent-runtime canonical spine (#285)
- fix(tool-loop): strict-model tool-call history (#286)
- docs: canonical-api.md API reference (#283)
drewstone added a commit that referenced this pull request Jun 14, 2026
…rsation/runConversation) (#289)

Another session shipped src/conversation/ (#282): runPersonaConversation (worker
under test ⟷ simulated-user persona driver, worker-only metered, drops into
runProfileMatrix via runPersonaDispatch) + runConversation (two profiles head-to-head).
These are a DIFFERENT layer from runPersonified/loopUntil (recursive-atom execution) —
the doc now distinguishes them and stops listing 'runConversation' as a thing-to-avoid
(it's a shipped canonical primitive). Adds the decision-table rows + a §3.1 subsection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants