feat(loops): runStrategyEvolution — population × multi-generation strategy search#223
Conversation
- promotionGate: the statistical promotion decision as a package primitive — seeded paired bootstrap (agent-eval heldoutSignificance) over per-task holdout deltas, deterministic verdict, minimum-evidence floor (6 paired tasks), CI lower bound must clear the threshold. Replaces the bench-local unseeded pairedBootstrap whose verdict varied re-run to re-run. - authorStrategy: named fallbackModel retry (one attempt when the primary fails or returns no code block), temperature/maxTokens now passed through. - assertAuthoredCodeSafe -> assertStrategyContract: the lint enforces the harness's measurement invariants (author blindness + conserved dose) at the module boundary; docstring now says so in those terms. - bench: strategy-author.mts drops its duplicate authorStrategy/contractDoc and becomes the R0->R2 ladder CLI over the package primitive; flywheel-run authors and gates through the package; authored run artifacts gitignored and excluded from typecheck. - tests: regression coverage for harness-verified scoring, the empty-messages rule, the contract lint, and the gate's determinism/floor.
- strategyAuthorContract documents ShotSpec.persona — the LLM author can now write multi-agent strategies (researcher/engineer hand-offs, persona panels) over the same conserved budget; previously the suite's own multi-agent primitive was invisible to the authored path. - AuthorStrategyOptions.contract — caller-supplied contract text, making the author prompt itself a gateable optimization coordinate. - AgenticOptions.analystModel — the critic can run on a different model than the worker (stronger critic, cheaper worker). - BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every cell's runAgentic (the watchdog/route-auditor seam was unreachable from the benchmark path). - vitest excludes .claude/worktrees/** (worktree agents' copies were swept into the root test run). - tests: persona-in-contract pin, analystModel routing, hooks pass-through, contract override.
The convergence onto the package authorStrategy dropped the transport-level max_tokens the bench client sent by default; deepseek-v4-pro returns EMPTY content on the authoring prompt without it (reproduced), and with it can still hit the edge 524 on a long generation. maxTokens restored at the call sites; the fallback default becomes deepseek-v4-flash — fast enough to clear both failure modes (verified: authors a loadable strategy with and without maxTokens).
… available Installed/dev pin to ^0.89.0 (bench too); the peer floor stays >=0.83.0 because the package consumes no 0.89-only API yet. The deltas this unlocks for follow-ups: anytime-valid sequential gates (multi-generation promotion), preflightModels (operator preflight), experimentTracker/CostLedger/ partitionHeldOut (evolution bookkeeping).
…ategy search The multi-generation closed loop as a package primitive: per generation the system authors POP candidate strategies from the latest tournament's losses, plays them against the incumbent at equal budget, and advances a champion; ONE promotion decision runs on a never-before-used holdout slice through promotionGate (no adaptive reuse of evaluation data enters the verdict). - selectChampion: search-side champion policy — 'score' or 'costAware' (scores within epsilon are ties; the cheapest wins — a scalar champion rule hid a Pareto-frontier authored strategy in the first clean run). - Archive nodes carry lineage (generation, parent) so a descendant- productivity parent selector can land without a schema change; every authored artifact records its gzip bits (description-length analysis comes free with every run). - Author failures are recorded per candidate and survivors still compete; an all-failed evolution throws. - bench/src/flywheel-evolve.mts: the EOPS runner (GENS/POP/CHAMPION envs). - vitest aliases the package's own /loops subpath so authored modules dynamically import under test. - chore: biome 2.4.16 mechanical fixes from the dep refresh.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 1369e473
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:19:41Z
…safe labels, honest telemetry - compactLosses replaces the pretty-printed prefix slice: the author now sees EVERY train task (a prefix slice hid the tail, biasing which failure modes the author could target). - The name-collision wrapper renames the returned agent AND its deliverable mode label — report keys and observability labels can no longer diverge for a renamed candidate; regression test added. - trajectory documented as search telemetry (per-generation re-measurements, unpaired across generations); the verdict is the only evidence-grade comparison. Archive score documented as 0-until-first-tournament.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — f06eb4f7
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:25:08Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — c0d647a9
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:45:00Z
Stacked on #221 (← #220). Retarget as the stack merges.
What
The multi-generation closed loop as a package primitive — the evidence-justified next mechanism after Clean Run #1 (HOLD verdict: a single candidate in a single generation couldn't displace a strong baseline, and a scalar champion rule hid a Pareto-frontier authored strategy).
runStrategyEvolution(src/runtime/strategy-evolution.ts): gen0 baselines → per generation, author POP candidates from the latest tournament's losses → candidates vs incumbent at equal budget → champion advances → ONE promotion decision on a never-before-used holdout slice throughpromotionGate. The no-adaptive-reuse rule is structural: the holdout offset is drawn past the train slice only after all authoring is done.selectChampion: search-side champion policy —'score'or'costAware'(default): scores within ε are ties, the cheapest wins. Search selection never owns the verdict; the gate does.(generation, parent)so a descendant-productivity parent selector (the HGM idea) can land without schema change; every authored artifact records gzip bits, so description-length-vs-holdout-gap (E2) is analyzable from any run's artifact.bench/src/flywheel-evolve.mts: the EOPS runner (GENS/POP/CHAMPION/HOLDOUT_OFFSETenvs), maxTokens 8192 + flash fallback baked in.Tests (6 new; suite 708 + 1 skipped)
The flagship test runs the complete cycle in-process: a scripted author emits a depth strategy, it displaces
sample, and promotes through the real seeded gate on a disjoint slice — plus: per-candidate failure recording, all-failed throws, costAware tie-breaking, the divergence instruction reaching the author, and the train/holdout offset-disjointness assertion.Verification
pnpm run typecheck✓,pnpm run lint✓ (13 pre-existing warnings),pnpm test708 ✓,cd bench && npx tsc --noEmit✓.