Stabilize Tailnet Evo X2 draft quality and runtime metrics

## Problem

Tailnet Evo X2 is the correct primary runtime path, but generation quality and latency vary enough that single-case validation is no longer sufficient. The product now supports multiple publishing targets, so runtime quality must be measured across the media matrix rather than only a Terisuke/note-style scenario.

Evidence from 2026-05-02:

- Endpoint: `http://evo-x2:11434/v1`
- Transport: Tailscale VPN/MagicDNS, no SSH tunnel
- Model set: `gemma4:31b` draft, `gemma4:latest` style, `gemma4:e2b` brief
- Elapsed: `1396.80s`
- Result: failed quality gates
- Score: `82.0`
- Draft length: `2653` runes, below the 2800 scenario gate
- First-person score: `49`, below the 60 strict threshold

Evidence from 2026-05-03:

- `cmd/scenario/media_matrix` now defines varied cases for note, Cor blog, Zenn, Qiita, and homepage output.
- Live source acquisition succeeded for `note:cor_instrument`, `zenn:cloudia`, `qiita:Cloudia_Cor_Inc`, `rss:https://cor-jp.com/rss.xml`, and `github:Cor-Incorporated/corsweb2024/src/content/blog/ja`.
- The full live LLM run should use this matrix but should not be forced into every PR because Evo X2 runs are long and can occupy the primary runtime.

This is separate from Issue #36, which is about workstation-local llama.cpp fallback quality. This issue is for the primary Tailnet Evo X2 path.

## Scope

- Make Tailnet Evo X2 scenarios repeatable enough to distinguish model variance from prompt/regression bugs.
- Add scenario output that records endpoint, model, elapsed time, attempts, rune count, score, failed metrics, verification result, and whether the run was primary or fallback.
- Tune draft generation for primary Evo X2 so it reliably hits length, first-person, and format-specific requirements.
- Use `cmd/scenario/media_matrix` as the canonical cross-media input set.
- Track the live runner implementation under the child issue "Add live LLM media-matrix runner and aggregate evaluator".

## Media matrix cases

- `terisuke_note_essay` — note, reflective essay, `note:cor_instrument`
- `cor_blog_technical_report` — company blog, technical report, Cor GitHub Markdown
- `cor_blog_vision_sharing` — company blog, vision sharing, Cor GitHub Markdown
- `cloudia_zenn_tutorial` — Zenn tutorial, `zenn:cloudia`
- `cloudia_qiita_how_to` — Qiita how-to, `qiita:Cloudia_Cor_Inc`
- `cor_homepage_section` — homepage section, format sanity check

The final user-facing publishing comparison is note / Qiita / Zenn / company blog. Homepage remains useful as a shorter format-control case.

## Acceptance criteria

- [ ] Three consecutive Tailnet Evo X2 scenario runs produce at least two passes under the strict gates for the selected phase slice.
- [ ] Each run records endpoint, model, elapsed time, score, rune count, failed metrics, verification result, and attempt count.
- [ ] Media-matrix live runs produce comparable rows for note, Qiita, Zenn, and company blog.
- [ ] The draft service does not silently return a failed draft as successful.
- [ ] The UI can show failed evaluation details without discarding the generated draft.
- [ ] Failures are grouped by source selector, persona, output format, target length, verifier result, and runtime path.

## Related

- #18: streaming/cancellation makes long primary-runtime runs usable.
- #26: persistence should land before the full expensive media run if possible.
- #36: local llama.cpp fallback quality; separate from this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize Tailnet Evo X2 draft quality and runtime metrics #40

Problem

Scope

Media matrix cases

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stabilize Tailnet Evo X2 draft quality and runtime metrics #40

Description

Problem

Scope

Media matrix cases

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions