Skip to content

Stabilize Tailnet Evo X2 draft quality and runtime metrics #40

@terisuke

Description

@terisuke

Problem

Tailnet Evo X2 is the correct primary runtime path, but generation quality and latency vary enough that single-case validation is no longer sufficient. The product now supports multiple publishing targets, so runtime quality must be measured across the media matrix rather than only a Terisuke/note-style scenario.

Evidence from 2026-05-02:

  • Endpoint: http://evo-x2:11434/v1
  • Transport: Tailscale VPN/MagicDNS, no SSH tunnel
  • Model set: gemma4:31b draft, gemma4:latest style, gemma4:e2b brief
  • Elapsed: 1396.80s
  • Result: failed quality gates
  • Score: 82.0
  • Draft length: 2653 runes, below the 2800 scenario gate
  • First-person score: 49, below the 60 strict threshold

Evidence from 2026-05-03:

  • cmd/scenario/media_matrix now defines varied cases for note, Cor blog, Zenn, Qiita, and homepage output.
  • Live source acquisition succeeded for note:cor_instrument, zenn:cloudia, qiita:Cloudia_Cor_Inc, rss:https://cor-jp.com/rss.xml, and github:Cor-Incorporated/corsweb2024/src/content/blog/ja.
  • The full live LLM run should use this matrix but should not be forced into every PR because Evo X2 runs are long and can occupy the primary runtime.

This is separate from Issue #36, which is about workstation-local llama.cpp fallback quality. This issue is for the primary Tailnet Evo X2 path.

Scope

  • Make Tailnet Evo X2 scenarios repeatable enough to distinguish model variance from prompt/regression bugs.
  • Add scenario output that records endpoint, model, elapsed time, attempts, rune count, score, failed metrics, verification result, and whether the run was primary or fallback.
  • Tune draft generation for primary Evo X2 so it reliably hits length, first-person, and format-specific requirements.
  • Use cmd/scenario/media_matrix as the canonical cross-media input set.
  • Track the live runner implementation under the child issue "Add live LLM media-matrix runner and aggregate evaluator".

Media matrix cases

  • terisuke_note_essay — note, reflective essay, note:cor_instrument
  • cor_blog_technical_report — company blog, technical report, Cor GitHub Markdown
  • cor_blog_vision_sharing — company blog, vision sharing, Cor GitHub Markdown
  • cloudia_zenn_tutorial — Zenn tutorial, zenn:cloudia
  • cloudia_qiita_how_to — Qiita how-to, qiita:Cloudia_Cor_Inc
  • cor_homepage_section — homepage section, format sanity check

The final user-facing publishing comparison is note / Qiita / Zenn / company blog. Homepage remains useful as a shorter format-control case.

Acceptance criteria

  • Three consecutive Tailnet Evo X2 scenario runs produce at least two passes under the strict gates for the selected phase slice.
  • Each run records endpoint, model, elapsed time, score, rune count, failed metrics, verification result, and attempt count.
  • Media-matrix live runs produce comparable rows for note, Qiita, Zenn, and company blog.
  • The draft service does not silently return a failed draft as successful.
  • The UI can show failed evaluation details without discarding the generated draft.
  • Failures are grouped by source selector, persona, output format, target length, verifier result, and runtime path.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions