Add live LLM media-matrix runner and aggregate evaluator

## Problem

`cmd/scenario/media_matrix` defines the planned cross-media cases, but there is not yet a live runner that executes those cases one by one against Evo X2 Tailnet and aggregates comparable metrics. Running every case manually is error-prone, expensive, and hard to resume after a long failure.

This is a child implementation issue for #40.

## Scope

- Consume `cmd/scenario/media_matrix` outputs.
- Run the six live cases one by one against Evo X2 Tailnet OpenAI-compatible API.
- Reuse source selectors proven by #22/#56.
- Record:
  - endpoint and runtime path,
  - model per phase,
  - elapsed seconds,
  - first-token or first-chunk latency when streaming is available,
  - style score,
  - runes,
  - verification result,
  - failed metrics,
  - output path,
  - scenario case id.
- Support resume/skip so a failed long run does not restart the whole matrix.
- Write one aggregate JSON report and one Markdown report for #40.

## Acceptance criteria

- [ ] Offline mode remains the default.
- [ ] Live mode requires explicit env vars and refuses to run against workstation-local fallback unless fallback mode is explicitly requested.
- [ ] All six cases produce comparable rows.
- [ ] Note, Qiita, Zenn, and company-blog rows are clearly marked as the final publishing-target comparison.
- [ ] Failures are grouped by source, persona, format, target length, verifier, and runtime path.
- [ ] The aggregate report links each generated draft and verification artifact.

## Dependencies

- #40 owns the runtime-quality target.
- #26 should land before the full expensive run if we want the results to become product history instead of loose artifacts.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add live LLM media-matrix runner and aggregate evaluator #57

Problem

Scope

Acceptance criteria

Dependencies

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add live LLM media-matrix runner and aggregate evaluator #57

Description

Problem

Scope

Acceptance criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions