You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tailnet Evo X2 is the correct primary runtime path, but generation quality and latency vary enough that single-case validation is no longer sufficient. The product now supports multiple publishing targets, so runtime quality must be measured across the media matrix rather than only a Terisuke/note-style scenario.
Evidence from 2026-05-02:
Endpoint: http://evo-x2:11434/v1
Transport: Tailscale VPN/MagicDNS, no SSH tunnel
Model set: gemma4:31b draft, gemma4:latest style, gemma4:e2b brief
Elapsed: 1396.80s
Result: failed quality gates
Score: 82.0
Draft length: 2653 runes, below the 2800 scenario gate
First-person score: 49, below the 60 strict threshold
Evidence from 2026-05-03:
cmd/scenario/media_matrix now defines varied cases for note, Cor blog, Zenn, Qiita, and homepage output.
Live source acquisition succeeded for note:cor_instrument, zenn:cloudia, qiita:Cloudia_Cor_Inc, rss:https://cor-jp.com/rss.xml, and github:Cor-Incorporated/corsweb2024/src/content/blog/ja.
The full live LLM run should use this matrix but should not be forced into every PR because Evo X2 runs are long and can occupy the primary runtime.
This is separate from Issue #36, which is about workstation-local llama.cpp fallback quality. This issue is for the primary Tailnet Evo X2 path.
Scope
Make Tailnet Evo X2 scenarios repeatable enough to distinguish model variance from prompt/regression bugs.
Add scenario output that records endpoint, model, elapsed time, attempts, rune count, score, failed metrics, verification result, and whether the run was primary or fallback.
Tune draft generation for primary Evo X2 so it reliably hits length, first-person, and format-specific requirements.
Use cmd/scenario/media_matrix as the canonical cross-media input set.
Track the live runner implementation under the child issue "Add live LLM media-matrix runner and aggregate evaluator".
Problem
Tailnet Evo X2 is the correct primary runtime path, but generation quality and latency vary enough that single-case validation is no longer sufficient. The product now supports multiple publishing targets, so runtime quality must be measured across the media matrix rather than only a Terisuke/note-style scenario.
Evidence from 2026-05-02:
http://evo-x2:11434/v1gemma4:31bdraft,gemma4:lateststyle,gemma4:e2bbrief1396.80s82.02653runes, below the 2800 scenario gate49, below the 60 strict thresholdEvidence from 2026-05-03:
cmd/scenario/media_matrixnow defines varied cases for note, Cor blog, Zenn, Qiita, and homepage output.note:cor_instrument,zenn:cloudia,qiita:Cloudia_Cor_Inc,rss:https://cor-jp.com/rss.xml, andgithub:Cor-Incorporated/corsweb2024/src/content/blog/ja.This is separate from Issue #36, which is about workstation-local llama.cpp fallback quality. This issue is for the primary Tailnet Evo X2 path.
Scope
cmd/scenario/media_matrixas the canonical cross-media input set.Media matrix cases
terisuke_note_essay— note, reflective essay,note:cor_instrumentcor_blog_technical_report— company blog, technical report, Cor GitHub Markdowncor_blog_vision_sharing— company blog, vision sharing, Cor GitHub Markdowncloudia_zenn_tutorial— Zenn tutorial,zenn:cloudiacloudia_qiita_how_to— Qiita how-to,qiita:Cloudia_Cor_Inccor_homepage_section— homepage section, format sanity checkThe final user-facing publishing comparison is note / Qiita / Zenn / company blog. Homepage remains useful as a shorter format-control case.
Acceptance criteria
Related