Skip to content

Add live LLM media-matrix runner and aggregate evaluator #57

@terisuke

Description

@terisuke

Problem

cmd/scenario/media_matrix defines the planned cross-media cases, but there is not yet a live runner that executes those cases one by one against Evo X2 Tailnet and aggregates comparable metrics. Running every case manually is error-prone, expensive, and hard to resume after a long failure.

This is a child implementation issue for #40.

Scope

Acceptance criteria

  • Offline mode remains the default.
  • Live mode requires explicit env vars and refuses to run against workstation-local fallback unless fallback mode is explicitly requested.
  • All six cases produce comparable rows.
  • Note, Qiita, Zenn, and company-blog rows are clearly marked as the final publishing-target comparison.
  • Failures are grouped by source, persona, format, target length, verifier, and runtime path.
  • The aggregate report links each generated draft and verification artifact.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions