feat(loops): runStrategyEvolution — population × multi-generation strategy search by drewstone · Pull Request #223 · tangle-network/agent-runtime

drewstone · 2026-06-10T12:19:35Z

Stacked on #221 (← #220). Retarget as the stack merges.

What

The multi-generation closed loop as a package primitive — the evidence-justified next mechanism after Clean Run #1 (HOLD verdict: a single candidate in a single generation couldn't displace a strong baseline, and a scalar champion rule hid a Pareto-frontier authored strategy).

runStrategyEvolution (src/runtime/strategy-evolution.ts): gen0 baselines → per generation, author POP candidates from the latest tournament's losses → candidates vs incumbent at equal budget → champion advances → ONE promotion decision on a never-before-used holdout slice through promotionGate. The no-adaptive-reuse rule is structural: the holdout offset is drawn past the train slice only after all authoring is done.
selectChampion: search-side champion policy — 'score' or 'costAware' (default): scores within ε are ties, the cheapest wins. Search selection never owns the verdict; the gate does.
Lineage + MDL built in: archive nodes carry (generation, parent) so a descendant-productivity parent selector (the HGM idea) can land without schema change; every authored artifact records gzip bits, so description-length-vs-holdout-gap (E2) is analyzable from any run's artifact.
Author resilience: per-candidate failures recorded (never silent), survivors compete, all-failed evolution throws.
bench/src/flywheel-evolve.mts: the EOPS runner (GENS/POP/CHAMPION/HOLDOUT_OFFSET envs), maxTokens 8192 + flash fallback baked in.
deps: agent-eval → 0.89 (typecheck + 708 tests green with zero changes; peer floor unchanged at >=0.83 since no 0.89-only API is consumed — the sequential e-process gates are the noted upgrade path for per-generation gating).

Tests (6 new; suite 708 + 1 skipped)

The flagship test runs the complete cycle in-process: a scripted author emits a depth strategy, it displaces sample, and promotes through the real seeded gate on a disjoint slice — plus: per-candidate failure recording, all-failed throws, costAware tie-breaking, the divergence instruction reaching the author, and the train/holdout offset-disjointness assertion.

Verification

pnpm run typecheck ✓, pnpm run lint ✓ (13 pre-existing warnings), pnpm test 708 ✓, cd bench && npx tsc --noEmit ✓.

- promotionGate: the statistical promotion decision as a package primitive — seeded paired bootstrap (agent-eval heldoutSignificance) over per-task holdout deltas, deterministic verdict, minimum-evidence floor (6 paired tasks), CI lower bound must clear the threshold. Replaces the bench-local unseeded pairedBootstrap whose verdict varied re-run to re-run. - authorStrategy: named fallbackModel retry (one attempt when the primary fails or returns no code block), temperature/maxTokens now passed through. - assertAuthoredCodeSafe -> assertStrategyContract: the lint enforces the harness's measurement invariants (author blindness + conserved dose) at the module boundary; docstring now says so in those terms. - bench: strategy-author.mts drops its duplicate authorStrategy/contractDoc and becomes the R0->R2 ladder CLI over the package primitive; flywheel-run authors and gates through the package; authored run artifacts gitignored and excluded from typecheck. - tests: regression coverage for harness-verified scoring, the empty-messages rule, the contract lint, and the gate's determinism/floor.

- strategyAuthorContract documents ShotSpec.persona — the LLM author can now write multi-agent strategies (researcher/engineer hand-offs, persona panels) over the same conserved budget; previously the suite's own multi-agent primitive was invisible to the authored path. - AuthorStrategyOptions.contract — caller-supplied contract text, making the author prompt itself a gateable optimization coordinate. - AgenticOptions.analystModel — the critic can run on a different model than the worker (stronger critic, cheaper worker). - BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every cell's runAgentic (the watchdog/route-auditor seam was unreachable from the benchmark path). - vitest excludes .claude/worktrees/** (worktree agents' copies were swept into the root test run). - tests: persona-in-contract pin, analystModel routing, hooks pass-through, contract override.

The convergence onto the package authorStrategy dropped the transport-level max_tokens the bench client sent by default; deepseek-v4-pro returns EMPTY content on the authoring prompt without it (reproduced), and with it can still hit the edge 524 on a long generation. maxTokens restored at the call sites; the fallback default becomes deepseek-v4-flash — fast enough to clear both failure modes (verified: authors a loadable strategy with and without maxTokens).

… available Installed/dev pin to ^0.89.0 (bench too); the peer floor stays >=0.83.0 because the package consumes no 0.89-only API yet. The deltas this unlocks for follow-ups: anytime-valid sequential gates (multi-generation promotion), preflightModels (operator preflight), experimentTracker/CostLedger/ partitionHeldOut (evolution bookkeeping).

…ategy search The multi-generation closed loop as a package primitive: per generation the system authors POP candidate strategies from the latest tournament's losses, plays them against the incumbent at equal budget, and advances a champion; ONE promotion decision runs on a never-before-used holdout slice through promotionGate (no adaptive reuse of evaluation data enters the verdict). - selectChampion: search-side champion policy — 'score' or 'costAware' (scores within epsilon are ties; the cheapest wins — a scalar champion rule hid a Pareto-frontier authored strategy in the first clean run). - Archive nodes carry lineage (generation, parent) so a descendant- productivity parent selector can land without a schema change; every authored artifact records its gzip bits (description-length analysis comes free with every run). - Author failures are recorded per candidate and survivors still compete; an all-failed evolution throws. - bench/src/flywheel-evolve.mts: the EOPS runner (GENS/POP/CHAMPION envs). - vitest aliases the package's own /loops subpath so authored modules dynamically import under test. - chore: biome 2.4.16 mechanical fixes from the dep refresh.

tangletools

✅ Auto-approved PR — `1369e473`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:19:41Z}

…safe labels, honest telemetry - compactLosses replaces the pretty-printed prefix slice: the author now sees EVERY train task (a prefix slice hid the tail, biasing which failure modes the author could target). - The name-collision wrapper renames the returned agent AND its deliverable mode label — report keys and observability labels can no longer diverge for a renamed candidate; regression test added. - trajectory documented as search telemetry (per-generation re-measurements, unpaired across generations); the verdict is the only evidence-grade comparison. Archive score documented as 0-until-first-tournament.

tangletools

✅ Auto-approved PR — `f06eb4f7`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:25:08Z}

The base branch was changed.

tangletools

✅ Auto-approved PR — `c0d647a9`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:45:00Z}

drewstone added 6 commits June 10, 2026 04:16

merge: author parity fix from the base branch

e600187

tangletools approved these changes Jun 10, 2026

View reviewed changes

tangletools previously approved these changes Jun 10, 2026

View reviewed changes

merge main (squashed #220) — branch side carries the superset

024c43e

drewstone changed the base branch from feat/author-surface-unlocks to main June 10, 2026 12:44

merge: main resolution from the unlocks branch

c0d647a

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone mentioned this pull request Jun 10, 2026

feat(loops): make the remaining optimizer coordinates addressable #224

Closed

drewstone merged commit 8c41f59 into main Jun 10, 2026
1 check passed

drewstone deleted the feat/strategy-evolution branch June 10, 2026 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(loops): runStrategyEvolution — population × multi-generation strategy search#223

feat(loops): runStrategyEvolution — population × multi-generation strategy search#223
drewstone merged 9 commits into
mainfrom
feat/strategy-evolution

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 10, 2026

What

Tests (6 new; suite 708 + 1 skipped)

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 1369e473

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — f06eb4f7

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — c0d647a9

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `1369e473`

✅ Auto-approved PR — `f06eb4f7`

✅ Auto-approved PR — `c0d647a9`