feat(loops): make the remaining optimizer coordinates addressable#224
Closed
drewstone wants to merge 5 commits into
Closed
feat(loops): make the remaining optimizer coordinates addressable#224drewstone wants to merge 5 commits into
drewstone wants to merge 5 commits into
Conversation
- promotionGate: the statistical promotion decision as a package primitive — seeded paired bootstrap (agent-eval heldoutSignificance) over per-task holdout deltas, deterministic verdict, minimum-evidence floor (6 paired tasks), CI lower bound must clear the threshold. Replaces the bench-local unseeded pairedBootstrap whose verdict varied re-run to re-run. - authorStrategy: named fallbackModel retry (one attempt when the primary fails or returns no code block), temperature/maxTokens now passed through. - assertAuthoredCodeSafe -> assertStrategyContract: the lint enforces the harness's measurement invariants (author blindness + conserved dose) at the module boundary; docstring now says so in those terms. - bench: strategy-author.mts drops its duplicate authorStrategy/contractDoc and becomes the R0->R2 ladder CLI over the package primitive; flywheel-run authors and gates through the package; authored run artifacts gitignored and excluded from typecheck. - tests: regression coverage for harness-verified scoring, the empty-messages rule, the contract lint, and the gate's determinism/floor.
- strategyAuthorContract documents ShotSpec.persona — the LLM author can now write multi-agent strategies (researcher/engineer hand-offs, persona panels) over the same conserved budget; previously the suite's own multi-agent primitive was invisible to the authored path. - AuthorStrategyOptions.contract — caller-supplied contract text, making the author prompt itself a gateable optimization coordinate. - AgenticOptions.analystModel — the critic can run on a different model than the worker (stronger critic, cheaper worker). - BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every cell's runAgentic (the watchdog/route-auditor seam was unreachable from the benchmark path). - vitest excludes .claude/worktrees/** (worktree agents' copies were swept into the root test run). - tests: persona-in-contract pin, analystModel routing, hooks pass-through, contract override.
The convergence onto the package authorStrategy dropped the transport-level max_tokens the bench client sent by default; deepseek-v4-pro returns EMPTY content on the authoring prompt without it (reproduced), and with it can still hit the edge 524 on a long generation. maxTokens restored at the call sites; the fallback default becomes deepseek-v4-flash — fast enough to clear both failure modes (verified: authors a loadable strategy with and without maxTokens).
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 024c43ee
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:43:58Z
Contributor
Author
|
Consolidated into #223 (one PR carrying the full line: addressability unlocks + author-parity fix + agent-eval 0.89 + runStrategyEvolution + review fixes). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #221 (auto-closed when its base branch was deleted post-#220-merge). Contains #221's changes + the author-parity fix (maxTokens 8192 + flash fallback — main's #220 squash predates it) + the merge of main.
What
Four unlocks so every genome coordinate is reachable by the one author→gate pipeline:
strategyAuthorContractdocumentsShotSpec.persona— authored strategies can now be multi-agent.AuthorStrategyOptions.contract(caller-supplied contract text) — the authoring contract is meta-optimizable, gated like any candidate.AgenticOptions.analystModel: the firewalled critic can run on a different model than the worker.BenchmarkConfig.hooks:RuntimeHooksflow throughrunBenchmarkto every cell (the observability seam was unreachable from the benchmark path).maxTokens: 8192restored at the call sites (deepseek-v4-pro returns empty content without it; reproduced live) + fallback defaultdeepseek-v4-flash(fast enough to clear the edge-524 mode; verified authors loadable strategies).**/.claude/worktrees/**.Verification
typecheck ✓, lint ✓, 702 tests ✓ (+4: persona-in-contract pin, analystModel routing, hooks pass-through, contract override). Verified live: the relaunched clean flywheel authored
critique-refinethrough this exact path and completed end-to-end.