Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions docs/adrs/0002-multi-persona-multi-format-extension.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,13 +217,15 @@ Current implementation status as of 2026-05-03:
- The 2026-05-03 browser 500 analysis showed an implementation drift: plain web-app startup still defaulted to workstation-local `127.0.0.1:8081`, while this ADR requires Evo X2 Tailnet as primary. Issue [#63](https://github.com/terisuke/note_maker/issues/63) restores the default order to Evo X2 Ollama over Tailnet → Evo X2 llama.cpp → workstation-local llama.cpp and makes the UI show the actual endpoint/model reported by SSE.
- The interview question set was simplified before the next Evo X2 run ([#66](https://github.com/terisuke/note_maker/issues/66)): broad editorial questions are now split into smaller plain-Japanese prompts, medium-specific prompts cover note/Zenn/Qiita/Cor blog needs, and optional questions can be advanced as `未定`. Validation is recorded in [Issue 66 plain brief questions validation](../validation/issue-66-plain-brief-questions-2026-05-03.md).
- Style analysis is now persona/format-aware ([#68](https://github.com/terisuke/note_maker/issues/68)): the web UI shows a general `文体ソース` selector instead of `Noteユーザー名`, defaults it to note/Zenn/Qiita/Cor GitHub Markdown based on the selected mode, and makes persona presets include output-format notes. Validation is recorded in [Issue 68 media-aware style source validation](../validation/issue-68-media-aware-style-source-2026-05-03.md).
- The 2026-05-03 full Tailnet Evo X2 media-matrix run proved that the runtime path works but also proved that the current scenario is not sufficient as an interview-template acceptance test: only `terisuke_note_essay` passed, Cor blog failed on assistant preamble leakage, Zenn/Qiita failed on cross-format notation leakage, and homepage failed long-form gates despite being a short HTML section. Runtime stabilization is therefore decomposed under epic [#40](https://github.com/terisuke/note_maker/issues/40) into [#70](https://github.com/terisuke/note_maker/issues/70) template/brief scenario coverage, [#71](https://github.com/terisuke/note_maker/issues/71) failed draft artifacts, [#72](https://github.com/terisuke/note_maker/issues/72) bounded format repair, [#73](https://github.com/terisuke/note_maker/issues/73) output-format-specific gates, and [#74](https://github.com/terisuke/note_maker/issues/74) staged Evo X2 reruns.

Near-term execution order:

1. Browser sanity check for the #66/#68 setup — confirm both the smaller questions and the style source change when switching note/Zenn/Qiita/Cor blog modes.
2. Phase C2/C3 ([#27](https://github.com/terisuke/note_maker/issues/27), [#28](https://github.com/terisuke/note_maker/issues/28)) — expose persisted sessions, guides, briefs, drafts, and verification artifacts in the web app.
3. Runtime stabilization ([#40](https://github.com/terisuke/note_maker/issues/40)) — first run one bounded media-matrix case through `cmd/scenario/live_media_matrix`, then run the full Note/Qiita/Zenn/Cor blog Evo X2 comparison once the UI can reuse the stored outputs.
4. Browser E2E ([#13](https://github.com/terisuke/note_maker/issues/13)) — cover persona/format switching, edit/fork, streaming, section regeneration, and persisted-history recovery after C2/C3 has visible browser surface.
1. Add an interview-template scenario ([#70](https://github.com/terisuke/note_maker/issues/70)) before spending more Evo X2 runtime. It must prove that the simplified questions are small, medium-specific, and able to produce distinct `ArticleBrief` outputs for note, Cor blog, Zenn, Qiita, and homepage.
2. Make failed generation diagnosable ([#71](https://github.com/terisuke/note_maker/issues/71)) and recoverable when the issue is format-only ([#72](https://github.com/terisuke/note_maker/issues/72)).
3. Split scenario gates by output format ([#73](https://github.com/terisuke/note_maker/issues/73)) so homepage HTML is not judged as a long article while note/Zenn/Qiita/Cor blog remain strict.
4. Re-run Evo X2 in stages ([#74](https://github.com/terisuke/note_maker/issues/74)): template scenario, offline media matrix, one previously failing live case, then the full note/Qiita/Zenn/Cor blog comparison.
5. Continue Phase C2/C3 ([#27](https://github.com/terisuke/note_maker/issues/27), [#28](https://github.com/terisuke/note_maker/issues/28)) and Browser E2E ([#13](https://github.com/terisuke/note_maker/issues/13)) in parallel where write scopes do not conflict.

## Tracked issues

Expand All @@ -243,6 +245,7 @@ Filed 2026-05-02 as part of the PR that introduced this ADR.
- C3 — [#28](https://github.com/terisuke/note_maker/issues/28) Render brief and style guide as human-readable cards
- D1 — [#29](https://github.com/terisuke/note_maker/issues/29) HTTP handler tests for `internal/handlers/workflow.go` — implemented in the current cut with 80.0% handler package coverage.
- Runtime runner — [#57](https://github.com/terisuke/note_maker/issues/57) Add live LLM media-matrix runner and aggregate evaluator, feeding [#40](https://github.com/terisuke/note_maker/issues/40) — implemented in the current cut.
- Runtime stabilization epic — [#40](https://github.com/terisuke/note_maker/issues/40) Stabilize Tailnet Evo X2 draft quality and runtime metrics. Sub-issues: [#70](https://github.com/terisuke/note_maker/issues/70), [#71](https://github.com/terisuke/note_maker/issues/71), [#72](https://github.com/terisuke/note_maker/issues/72), [#73](https://github.com/terisuke/note_maker/issues/73), [#74](https://github.com/terisuke/note_maker/issues/74).

## Consequences

Expand Down
10 changes: 8 additions & 2 deletions docs/implementation-plans/issue-adr-guardrails.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,20 @@ Open issues that ADR 0002 reframes (see [ADR 0002 — Tracked issues](../adrs/00
| [#14](https://github.com/terisuke/note_maker/issues/14) | Persistent queryable database | ADR 0002 §Persistence direction | SQLite migration is the acceptance for #14; multi-persona schema is mandatory. |
| [#15](https://github.com/terisuke/note_maker/issues/15) | Desktop launcher packaging | Out of ADR 0002 scope | Tracked separately; depends on Phase C completion before packaging makes sense. |
| [#36](https://github.com/terisuke/note_maker/issues/36) | local llama.cpp fallback quality | ADR 0001/0002 runtime validation | Non-blocking for Phase A. Do not promote fallback as production-quality until it passes strict draft thresholds. |
| [#40](https://github.com/terisuke/note_maker/issues/40) | Tailnet Evo X2 primary quality and runtime metrics | ADR 0001/0002 runtime validation | Primary runtime must record endpoint/model/elapsed/score/runes and distinguish generation variance from transport failures. It now owns live runs from `cmd/scenario/media_matrix` across note, Qiita, Zenn, and Cor blog. |
| [#40](https://github.com/terisuke/note_maker/issues/40) | Tailnet Evo X2 primary quality and runtime metrics epic | ADR 0001/0002 runtime validation | Primary runtime must record endpoint/model/elapsed/score/runes and distinguish generation variance from transport failures. It owns live runs from `cmd/scenario/media_matrix`, but the 2026-05-03 result showed that template usability, failure artifacts, repair, and format-specific gates must land before claiming the full media-matrix result. |
| [#57](https://github.com/terisuke/note_maker/issues/57) | Live media-matrix runner and aggregate evaluator | ADR 0001/0002 runtime validation | Child of #40. Offline mode remains default; live mode must require explicit env vars and must refuse accidental workstation-local fallback for primary Evo X2 validation. |
| [#70](https://github.com/terisuke/note_maker/issues/70) | Interview-template scenario before Evo X2 media runs | ADR 0002 §Testing Strategy | The question-template change must be tested before draft-only live runs. Scenario output must prove small plain-Japanese questions and medium-specific `ArticleBrief` artifacts. |
| [#71](https://github.com/terisuke/note_maker/issues/71) | Failed draft artifacts and runtime metrics | ADR 0001/0002 runtime validation | Early validation failures must preserve raw output, elapsed time, endpoint, model, and failure JSON. Do not discard unusable drafts before diagnosis. |
| [#72](https://github.com/terisuke/note_maker/issues/72) | Bounded format-repair retry | ADR 0002 §Format-specific output | Validators remain strict. One repair retry may be attempted for recoverable preamble or cross-format notation failures, with original and repaired attempts preserved. |
| [#73](https://github.com/terisuke/note_maker/issues/73) | Output-format-specific scenario gates | ADR 0002 §Testing Strategy | Long-form note/Zenn/Qiita/Cor blog gates stay strict, while homepage HTML uses short-form structure and CTA gates instead of long-article length assumptions. |
| [#74](https://github.com/terisuke/note_maker/issues/74) | Staged Tailnet Evo X2 validation rerun | ADR 0001/0002 runtime validation | Re-run order is template scenario → offline media matrix → one previously failing live case → full note/Qiita/Zenn/Cor blog live comparison. |

Current cut status:

- [#26](https://github.com/terisuke/note_maker/issues/26) is implemented as `internal/infrastructure/repository/sqlite` plus `WORKFLOW_STORE_DRIVER=sqlite` web-app opt-in. [#14](https://github.com/terisuke/note_maker/issues/14) remains the broader queryable-history umbrella until the UI/API surface is exposed.
- [#29](https://github.com/terisuke/note_maker/issues/29) reaches the handler coverage gate: `go test ./internal/handlers -cover` reports 80.0%.
- [#57](https://github.com/terisuke/note_maker/issues/57) is implemented as `cmd/scenario/live_media_matrix`; it defaults to offline planned aggregate output and requires `RUN_LIVE_MEDIA_MATRIX=1` or `make scenario-media-matrix-live` for Evo X2 calls.
- [#40](https://github.com/terisuke/note_maker/issues/40) is now an epic with sub-issues [#70](https://github.com/terisuke/note_maker/issues/70)-[#74](https://github.com/terisuke/note_maker/issues/74). Do not close #40 until the staged validation and consecutive-run acceptance criteria are met.

Closed historical issues:

Expand Down Expand Up @@ -74,7 +80,7 @@ The phases in [ADR 0002](../adrs/0002-multi-persona-multi-format-extension.md) (
- SSH tunnels are allowed only as explicit developer diagnostics, not as the product default, because they depend on per-device SSH setup.
- Local llama.cpp (`http://127.0.0.1:8081/v1`) is fallback only. Do not set `LLM_BASE_URL` to local Ollama or local llama.cpp for Evo X2 validation unless the test is explicitly measuring fallback behavior.
- Runtime validation must report base URL, model, elapsed time, score, and draft length.
- Each implementation PR that touches interview, prompt, draft, or runtime behavior should add one scenario datapoint with a deliberately varied medium/persona/format. Do not force every PR to rerun every scenario; build averages by collecting one different slice per phase. Use `cmd/scenario/media_matrix` as the canonical matrix for final Note/Qiita/Zenn/Cor blog comparison.
- Each implementation PR that touches interview, prompt, draft, or runtime behavior should add one scenario datapoint with a deliberately varied medium/persona/format. If the PR touches question templates, the datapoint must come from the interview-template scenario rather than only draft generation. Do not force every PR to rerun every live scenario; build averages by collecting one different slice per phase. Use `cmd/scenario/media_matrix` as the canonical matrix for final Note/Qiita/Zenn/Cor blog comparison.
- Draft generation must run the lightweight final verification step before returning the final result; if verification reports NEEDS_REVIEW, surface the report instead of hiding it.
- If fallback validation fails the strict draft thresholds, keep Evo X2 primary enabled and track fallback hardening separately (Issue [#36](https://github.com/terisuke/note_maker/issues/36)).
- If Tailnet Evo X2 reaches the API but misses quality gates, track it under Issue [#40](https://github.com/terisuke/note_maker/issues/40), not as a transport regression.
Expand Down
37 changes: 23 additions & 14 deletions docs/implementation-plans/next-implementation-cut.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Open and active:
- History UI and readable artifacts: [#27](https://github.com/terisuke/note_maker/issues/27), [#28](https://github.com/terisuke/note_maker/issues/28).
- Browser E2E coverage: [#13](https://github.com/terisuke/note_maker/issues/13).
- Runtime evaluation: [#40](https://github.com/terisuke/note_maker/issues/40).
- Runtime evaluation sub-issues: [#70](https://github.com/terisuke/note_maker/issues/70), [#71](https://github.com/terisuke/note_maker/issues/71), [#72](https://github.com/terisuke/note_maker/issues/72), [#73](https://github.com/terisuke/note_maker/issues/73), [#74](https://github.com/terisuke/note_maker/issues/74).
- Fallback and packaging follow-up: [#36](https://github.com/terisuke/note_maker/issues/36), [#45](https://github.com/terisuke/note_maker/issues/45), [#15](https://github.com/terisuke/note_maker/issues/15).
- Runtime defect fixed by this cut: [#63](https://github.com/terisuke/note_maker/issues/63) makes the plain web-app default match the intended Evo X2 Tailnet primary path and records the 2026-05-03 draft-generation 500 root cause.
- Documentation and DDD audit: [#64](https://github.com/terisuke/note_maker/issues/64), with details in [Runtime and DDD alignment audit](../validation/runtime-ui-ddd-audit-2026-05-03.md).
Expand Down Expand Up @@ -62,33 +63,41 @@ Each live run must record:

## Before the full Evo X2 media run

The three prerequisites before running the full multi-medium Evo X2 evaluation are now mostly in place:
The previous prerequisites are in place, but the 2026-05-03 live result exposed a missing layer in the validation plan. A draft-only media matrix cannot prove that the revised question templates are usable, because it starts from completed `ArticleBrief` fixtures.

1. **Persistence first**: #26 adds SQLite storage for sessions, briefs, source snapshots, drafts, verification, and section-regeneration versions. #61/#62 makes the storage driver visible and switchable from the settings UI, so users do not have to choose it only through make/env setup.
2. **Handler coverage gate**: #29 raises `internal/handlers` coverage to 80.0%, including SSE, edit/fork, template, regenerate-section, and SQLite driver selection paths.
3. **Scenario ownership**: #57 adds the reusable live runner/aggregate evaluator. #40 remains the owner for actual Evo X2 Tailnet quality results.
The runtime stabilization work is now split under epic #40:

| Order | Issue | Purpose | Done when |
|---:|---|---|---|
| 1 | [#70](https://github.com/terisuke/note_maker/issues/70) | Add an interview-template scenario | note/Cor blog/Zenn/Qiita/homepage questions and generated briefs differ by mode and remain small enough to answer |
| 2 | [#71](https://github.com/terisuke/note_maker/issues/71) | Preserve failed draft artifacts | unusable drafts still write raw output, failure JSON, elapsed time, endpoint, and model |
| 3 | [#72](https://github.com/terisuke/note_maker/issues/72) | Add bounded format repair | preamble leakage and Zenn/Qiita notation leakage get one strict repair retry without relaxing validators |
| 4 | [#73](https://github.com/terisuke/note_maker/issues/73) | Split scenario gates by output format | homepage uses short HTML gates while long-form media keep strict length/style gates |
| 5 | [#74](https://github.com/terisuke/note_maker/issues/74) | Re-run staged Evo X2 validation | one previously failing medium passes first, then the full note/Qiita/Zenn/Cor blog live matrix is rerun |

## Parallel implementation plan

Use subagents with disjoint write scopes:

| Lane | Issue | Subagent role | Write scope | Done when |
|---|---|---|---|---|
| A | [#27](https://github.com/terisuke/note_maker/issues/27) / [#28](https://github.com/terisuke/note_maker/issues/28) | History/artifact UI worker | `static/*`, read APIs for projects/sessions/drafts once exposed | persona/session picker and human-readable brief/style cards use persisted state |
| B | [#13](https://github.com/terisuke/note_maker/issues/13) | Browser E2E worker | browser tests and fixtures | persona/format switching, edit/fork, streaming, regenerate-section, and legacy localStorage migration are covered |
| C | [#40](https://github.com/terisuke/note_maker/issues/40) | Scenario metrics worker | `docs/validation/*`, live run artifacts | media-matrix live runner records endpoint/model/elapsed/score/runes/verification in aggregate JSON/Markdown for actual Evo X2 runs |
| A | [#70](https://github.com/terisuke/note_maker/issues/70) | Template scenario worker | `cmd/scenario/*`, `internal/domain/brief/*`, validation docs | question-template usability is measured before draft-only live runs |
| B | [#71](https://github.com/terisuke/note_maker/issues/71) / [#72](https://github.com/terisuke/note_maker/issues/72) | Draft recovery worker | `internal/application/draft/*`, `internal/domain/article/*`, scenario output paths | failed drafts are diagnosable and recoverable format errors get one repair attempt |
| C | [#73](https://github.com/terisuke/note_maker/issues/73) | Scenario gate worker | `cmd/scenario/*`, validation docs | long-form and homepage gates are explicit and recorded |
| D | [#27](https://github.com/terisuke/note_maker/issues/27) / [#28](https://github.com/terisuke/note_maker/issues/28) | History/artifact UI worker | `static/*`, read APIs for projects/sessions/drafts once exposed | persona/session picker and human-readable brief/style cards use persisted state |
| E | [#13](https://github.com/terisuke/note_maker/issues/13) | Browser E2E worker | browser tests and fixtures | persona/format switching, edit/fork, streaming, regenerate-section, and legacy localStorage migration are covered |

Lane A and Lane B can run immediately in parallel. Lane C can start by implementing offline/resumable runner mechanics now, but the full multi-case Evo X2 run should wait until Lane A provides persistence or until the user explicitly wants a one-off artifact-file run.
Lanes A, B, and C can run in parallel if their write scopes stay separate. Lane D/E can continue in parallel when they do not need the same frontend files.

## Recommended order

1. Browser-check the #66/#68 setup with note, one technical format, and Cor company blog. Confirm both the question template and `文体ソース` default change before spending Evo X2 runtime.
2. Run one bounded Evo X2 live case through #57 and attach it to #40 to verify the runner with real latency/score data.
3. Start #27 and #28 in parallel so persisted sessions, guides, and draft artifacts become visible in the web app.
4. Start #13 once the history/artifact UI has enough stable browser surface.
5. Run the full note/Qiita/Zenn/company-blog media matrix under #40.
1. Implement #70 first. This proves the revised questions and generated briefs before any more expensive live draft runs.
2. Implement #71/#72/#73 in parallel where possible. These directly address the failures observed on 2026-05-03.
3. Run one bounded Evo X2 live case from a previously failing medium, not the already-passing note case.
4. Start or continue #27/#28 so expensive live outputs can be viewed and reused from the web app.
5. Run the full note/Qiita/Zenn/company-blog matrix under #74, then update #40 with the aggregate.
6. Keep #36/#45 as fallback/runtime P2 work and #15 as packaging after persistence/history are usable.

## Why not run the full Evo X2 matrix now?

The source and prompt matrix is ready, but full Evo X2 draft generation is expensive and can take 20+ minutes per run. Running all media cases before persistence would produce useful files but not durable product memory. The better sequence is to make the system capable of storing those expensive results, then use #40 to evaluate one varied slice per phase and finally run the full comparison table.
The source and prompt matrix is ready, but full Evo X2 draft generation is expensive and can take 20+ minutes per run. The 2026-05-03 full run also showed that draft-only evaluation can miss whether interview templates are actually usable. The better sequence is to prove the question-to-brief layer first, preserve failed outputs, repair recoverable format mistakes, then use #40/#74 to evaluate one varied failing slice before the full comparison table.
Loading