Skip to content

feat(loops): runStrategyEvolution — population × multi-generation strategy search#223

Merged
drewstone merged 9 commits into
mainfrom
feat/strategy-evolution
Jun 10, 2026
Merged

feat(loops): runStrategyEvolution — population × multi-generation strategy search#223
drewstone merged 9 commits into
mainfrom
feat/strategy-evolution

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Stacked on #221 (← #220). Retarget as the stack merges.

What

The multi-generation closed loop as a package primitive — the evidence-justified next mechanism after Clean Run #1 (HOLD verdict: a single candidate in a single generation couldn't displace a strong baseline, and a scalar champion rule hid a Pareto-frontier authored strategy).

  • runStrategyEvolution (src/runtime/strategy-evolution.ts): gen0 baselines → per generation, author POP candidates from the latest tournament's losses → candidates vs incumbent at equal budget → champion advances → ONE promotion decision on a never-before-used holdout slice through promotionGate. The no-adaptive-reuse rule is structural: the holdout offset is drawn past the train slice only after all authoring is done.
  • selectChampion: search-side champion policy — 'score' or 'costAware' (default): scores within ε are ties, the cheapest wins. Search selection never owns the verdict; the gate does.
  • Lineage + MDL built in: archive nodes carry (generation, parent) so a descendant-productivity parent selector (the HGM idea) can land without schema change; every authored artifact records gzip bits, so description-length-vs-holdout-gap (E2) is analyzable from any run's artifact.
  • Author resilience: per-candidate failures recorded (never silent), survivors compete, all-failed evolution throws.
  • bench/src/flywheel-evolve.mts: the EOPS runner (GENS/POP/CHAMPION/HOLDOUT_OFFSET envs), maxTokens 8192 + flash fallback baked in.
  • deps: agent-eval → 0.89 (typecheck + 708 tests green with zero changes; peer floor unchanged at >=0.83 since no 0.89-only API is consumed — the sequential e-process gates are the noted upgrade path for per-generation gating).

Tests (6 new; suite 708 + 1 skipped)

The flagship test runs the complete cycle in-process: a scripted author emits a depth strategy, it displaces sample, and promotes through the real seeded gate on a disjoint slice — plus: per-candidate failure recording, all-failed throws, costAware tie-breaking, the divergence instruction reaching the author, and the train/holdout offset-disjointness assertion.

Verification

pnpm run typecheck ✓, pnpm run lint ✓ (13 pre-existing warnings), pnpm test 708 ✓, cd bench && npx tsc --noEmit ✓.

- promotionGate: the statistical promotion decision as a package primitive —
  seeded paired bootstrap (agent-eval heldoutSignificance) over per-task
  holdout deltas, deterministic verdict, minimum-evidence floor (6 paired
  tasks), CI lower bound must clear the threshold. Replaces the bench-local
  unseeded pairedBootstrap whose verdict varied re-run to re-run.
- authorStrategy: named fallbackModel retry (one attempt when the primary
  fails or returns no code block), temperature/maxTokens now passed through.
- assertAuthoredCodeSafe -> assertStrategyContract: the lint enforces the
  harness's measurement invariants (author blindness + conserved dose) at the
  module boundary; docstring now says so in those terms.
- bench: strategy-author.mts drops its duplicate authorStrategy/contractDoc
  and becomes the R0->R2 ladder CLI over the package primitive; flywheel-run
  authors and gates through the package; authored run artifacts gitignored
  and excluded from typecheck.
- tests: regression coverage for harness-verified scoring, the
  empty-messages rule, the contract lint, and the gate's determinism/floor.
- strategyAuthorContract documents ShotSpec.persona — the LLM author can now
  write multi-agent strategies (researcher/engineer hand-offs, persona panels)
  over the same conserved budget; previously the suite's own multi-agent
  primitive was invisible to the authored path.
- AuthorStrategyOptions.contract — caller-supplied contract text, making the
  author prompt itself a gateable optimization coordinate.
- AgenticOptions.analystModel — the critic can run on a different model than
  the worker (stronger critic, cheaper worker).
- BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every
  cell's runAgentic (the watchdog/route-auditor seam was unreachable from the
  benchmark path).
- vitest excludes .claude/worktrees/** (worktree agents' copies were swept
  into the root test run).
- tests: persona-in-contract pin, analystModel routing, hooks pass-through,
  contract override.
The convergence onto the package authorStrategy dropped the transport-level
max_tokens the bench client sent by default; deepseek-v4-pro returns EMPTY
content on the authoring prompt without it (reproduced), and with it can
still hit the edge 524 on a long generation. maxTokens restored at the call
sites; the fallback default becomes deepseek-v4-flash — fast enough to clear
both failure modes (verified: authors a loadable strategy with and without
maxTokens).
… available

Installed/dev pin to ^0.89.0 (bench too); the peer floor stays >=0.83.0
because the package consumes no 0.89-only API yet. The deltas this unlocks
for follow-ups: anytime-valid sequential gates (multi-generation promotion),
preflightModels (operator preflight), experimentTracker/CostLedger/
partitionHeldOut (evolution bookkeeping).
…ategy search

The multi-generation closed loop as a package primitive: per generation the
system authors POP candidate strategies from the latest tournament's losses,
plays them against the incumbent at equal budget, and advances a champion;
ONE promotion decision runs on a never-before-used holdout slice through
promotionGate (no adaptive reuse of evaluation data enters the verdict).

- selectChampion: search-side champion policy — 'score' or 'costAware'
  (scores within epsilon are ties; the cheapest wins — a scalar champion
  rule hid a Pareto-frontier authored strategy in the first clean run).
- Archive nodes carry lineage (generation, parent) so a descendant-
  productivity parent selector can land without a schema change; every
  authored artifact records its gzip bits (description-length analysis
  comes free with every run).
- Author failures are recorded per candidate and survivors still compete;
  an all-failed evolution throws.
- bench/src/flywheel-evolve.mts: the EOPS runner (GENS/POP/CHAMPION envs).
- vitest aliases the package's own /loops subpath so authored modules
  dynamically import under test.
- chore: biome 2.4.16 mechanical fixes from the dep refresh.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 1369e473

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:19:41Z

…safe labels, honest telemetry

- compactLosses replaces the pretty-printed prefix slice: the author now sees
  EVERY train task (a prefix slice hid the tail, biasing which failure modes
  the author could target).
- The name-collision wrapper renames the returned agent AND its deliverable
  mode label — report keys and observability labels can no longer diverge
  for a renamed candidate; regression test added.
- trajectory documented as search telemetry (per-generation re-measurements,
  unpaired across generations); the verdict is the only evidence-grade
  comparison. Archive score documented as 0-until-first-tournament.
tangletools
tangletools previously approved these changes Jun 10, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — f06eb4f7

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:25:08Z

@drewstone drewstone changed the base branch from feat/author-surface-unlocks to main June 10, 2026 12:44
@drewstone drewstone dismissed tangletools’s stale review June 10, 2026 12:44

The base branch was changed.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — c0d647a9

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:45:00Z

@drewstone drewstone merged commit 8c41f59 into main Jun 10, 2026
1 check passed
@drewstone drewstone deleted the feat/strategy-evolution branch June 10, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants