Skip to content

feat(loops): make the remaining optimizer coordinates addressable#224

Closed
drewstone wants to merge 5 commits into
mainfrom
feat/author-surface-unlocks
Closed

feat(loops): make the remaining optimizer coordinates addressable#224
drewstone wants to merge 5 commits into
mainfrom
feat/author-surface-unlocks

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Supersedes #221 (auto-closed when its base branch was deleted post-#220-merge). Contains #221's changes + the author-parity fix (maxTokens 8192 + flash fallback — main's #220 squash predates it) + the merge of main.

What

Four unlocks so every genome coordinate is reachable by the one author→gate pipeline:

  • Persona reaches the LLM author: strategyAuthorContract documents ShotSpec.persona — authored strategies can now be multi-agent.
  • The author prompt is a coordinate: AuthorStrategyOptions.contract (caller-supplied contract text) — the authoring contract is meta-optimizable, gated like any candidate.
  • AgenticOptions.analystModel: the firewalled critic can run on a different model than the worker.
  • BenchmarkConfig.hooks: RuntimeHooks flow through runBenchmark to every cell (the observability seam was unreachable from the benchmark path).
  • fix(bench): author parity — maxTokens: 8192 restored at the call sites (deepseek-v4-pro returns empty content without it; reproduced live) + fallback default deepseek-v4-flash (fast enough to clear the edge-524 mode; verified authors loadable strategies).
  • vitest excludes **/.claude/worktrees/**.

Verification

typecheck ✓, lint ✓, 702 tests ✓ (+4: persona-in-contract pin, analystModel routing, hooks pass-through, contract override). Verified live: the relaunched clean flywheel authored critique-refine through this exact path and completed end-to-end.

- promotionGate: the statistical promotion decision as a package primitive —
  seeded paired bootstrap (agent-eval heldoutSignificance) over per-task
  holdout deltas, deterministic verdict, minimum-evidence floor (6 paired
  tasks), CI lower bound must clear the threshold. Replaces the bench-local
  unseeded pairedBootstrap whose verdict varied re-run to re-run.
- authorStrategy: named fallbackModel retry (one attempt when the primary
  fails or returns no code block), temperature/maxTokens now passed through.
- assertAuthoredCodeSafe -> assertStrategyContract: the lint enforces the
  harness's measurement invariants (author blindness + conserved dose) at the
  module boundary; docstring now says so in those terms.
- bench: strategy-author.mts drops its duplicate authorStrategy/contractDoc
  and becomes the R0->R2 ladder CLI over the package primitive; flywheel-run
  authors and gates through the package; authored run artifacts gitignored
  and excluded from typecheck.
- tests: regression coverage for harness-verified scoring, the
  empty-messages rule, the contract lint, and the gate's determinism/floor.
- strategyAuthorContract documents ShotSpec.persona — the LLM author can now
  write multi-agent strategies (researcher/engineer hand-offs, persona panels)
  over the same conserved budget; previously the suite's own multi-agent
  primitive was invisible to the authored path.
- AuthorStrategyOptions.contract — caller-supplied contract text, making the
  author prompt itself a gateable optimization coordinate.
- AgenticOptions.analystModel — the critic can run on a different model than
  the worker (stronger critic, cheaper worker).
- BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every
  cell's runAgentic (the watchdog/route-auditor seam was unreachable from the
  benchmark path).
- vitest excludes .claude/worktrees/** (worktree agents' copies were swept
  into the root test run).
- tests: persona-in-contract pin, analystModel routing, hooks pass-through,
  contract override.
The convergence onto the package authorStrategy dropped the transport-level
max_tokens the bench client sent by default; deepseek-v4-pro returns EMPTY
content on the authoring prompt without it (reproduced), and with it can
still hit the edge 524 on a long generation. maxTokens restored at the call
sites; the fallback default becomes deepseek-v4-flash — fast enough to clear
both failure modes (verified: authors a loadable strategy with and without
maxTokens).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 024c43ee

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T12:43:58Z

@drewstone

Copy link
Copy Markdown
Contributor Author

Consolidated into #223 (one PR carrying the full line: addressability unlocks + author-parity fix + agent-eval 0.89 + runStrategyEvolution + review fixes).

@drewstone drewstone closed this Jun 10, 2026
@drewstone drewstone deleted the feat/author-surface-unlocks branch June 10, 2026 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants