feat(loops): make the remaining optimizer coordinates addressable#221
feat(loops): make the remaining optimizer coordinates addressable#221drewstone wants to merge 2 commits into
Conversation
- strategyAuthorContract documents ShotSpec.persona — the LLM author can now write multi-agent strategies (researcher/engineer hand-offs, persona panels) over the same conserved budget; previously the suite's own multi-agent primitive was invisible to the authored path. - AuthorStrategyOptions.contract — caller-supplied contract text, making the author prompt itself a gateable optimization coordinate. - AgenticOptions.analystModel — the critic can run on a different model than the worker (stronger critic, cheaper worker). - BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every cell's runAgentic (the watchdog/route-auditor seam was unreachable from the benchmark path). - vitest excludes .claude/worktrees/** (worktree agents' copies were swept into the root test run). - tests: persona-in-contract pin, analystModel routing, hooks pass-through, contract override.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — e20925f6
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:25:54Z
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 79 | 77 | 77 |
| Confidence | 75 | 75 | 75 |
| Correctness | 79 | 77 | 77 |
| Security | 79 | 77 | 77 |
| Testing | 79 | 77 | 77 |
| Architecture | 79 | 77 | 77 |
Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Hook propagation test uses non-existent e.type property, assertion is vacuous — tests/loops/strategy-suite.test.ts
Line 305:
hooks: { onEvent: (e) => void events.push(e.type) }— RuntimeHookEvent (src/runtime-hooks.ts:35-46) has notypeproperty. The event interface hastarget,phase,id,runId,timestamp, etc. bute.typeis alwaysundefined. The assertionexpect(events.length).toBeGreaterThan(0)still passes becausepush(undefined)increments length, but it never checks WHAT events fired or with what data. A regression that changes hook event shapes or stops emitting certain event targets would pass uncaught. Fix: usee.phaseore.targetinstead, and assert on the actual values (e.g.expect(events).toContain('agent.spawn')).
🟡 LOW strategyAuthorContract inlined in requestAuthoredCode fallback still uses string concat — src/runtime/strategy-author.ts
Line 145: The template literal
${opts.contract ?? strategyAuthorContract}\n\nBASELINE RESULTS...is a long single-line expression. The contract text itself is already aconstat module level; the override works correctly viaopts.contract ?? strategyAuthorContract. No issue, just noting the line is 230+ chars wide. Purely cosmetic.
🟡 LOW Analyst token spend is always zeroed regardless of analystModel — src/runtime/strategy.ts
The analystExecutor at strategy.ts:344 reports
spent: spend(1)— which hardcodes 0 tokens and 0 USD. When analystModel selects a more expensive model than the worker, the actual router spend is invisible to the conserved pool and cost vector. This is pre-existing (the PR just adds the model-selector knob), but since the feature explicitly enables running the analyst on a different model, it's worth a comment or TODO. The observe() call at strategy.ts:228-241 makes a real chat completion whose token usage is never captured.
🟡 LOW observe() receives model via both chat.defaultModel and opts.model — redundant but harmless — src/runtime/strategy.ts
Lines 222-238:
createChatClientsetsdefaultModel: analystModeland thenobserve()is called withmodel: analystModel. Theobserve()function at observe.ts:143 does...(opts.model ? { model: opts.model } : {}), which overrides the chat client's defaultModel. Both paths converge on the same value so this is functionally correct, but it means the chat client's defaultModel is always overridden — the double-setting is a mild code smell. Not a bug.
🟡 LOW Temp directory leak in authorStrategy test — tests/loops/strategy-suite.test.ts
Lines 311:
mkdtempSync(join(tmpdir(), 'authored-test-'))creates a temp dir that is never cleaned up (normSyncin afterEach or after the test). Confirmed 14 leftover dirs in /tmp from test runs. Should addafterEach(() => { rmSync(dir, { recursive: true }) })or use a tracked cleanup in the existingafterEachblock. Low severity — test-only, no production impact, but pollutes CI runners over time.
🟡 LOW Temp directory never cleaned up, leaks on every test run — tests/loops/strategy-suite.test.ts
Line 311:
mkdtempSync(join(tmpdir(), 'authored-test-'))creates a real directory on disk that is never removed. Over repeated test runs this accumulates directories in $TMPDIR. Fix: addrmSync(dir, { recursive: true })in anafterEachorafterAllblock, or wrap in a try/finally that cleans up.
🟡 LOW authorStrategy test stub may break on non-unix platforms — tests/loops/strategy-suite.test.ts
Line 311:
join(tmpdir(), 'authored-test-')and the dynamicimport(file://${file})on strategy-author.ts:174 — thefile://URL construction works on linux/mac but may need encoding on Windows (spaces in temp path, backslashes). Non-blocking for current CI but worth noting for portability.
🟡 LOW critiqued test assertion assumes router model field in request body — tests/loops/strategy-suite.test.ts
Lines 285-287:
captured.map((r) => (r as { model?: string }).model)— this asserts the model name appears in the raw request body'smodelfield. This is correct becauserunShotat strategy.ts:156 sends{ model: modelOverride ?? opts.model, ... }viafetch. The assertion is valid and the cast is fine for test fixtures. No issue, just documenting the dependency chain is sound.
🟡 LOW Narrower exclude possible but not necessary — vitest.config.ts
The pattern '/.claude/worktrees/' is correct. A more precise alternative would be '.claude/worktrees/**' (without leading glob) since worktrees are always at repo root, but the double-star prefix is harmless and covers edge cases like nested clones. No action required.
tangletools · 2026-06-10T10:40:59Z · trace
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 76 | 76 | 76 |
| Confidence | 75 | 75 | 75 |
| Correctness | 76 | 76 | 76 |
| Security | 76 | 76 | 76 |
| Testing | 76 | 76 | 76 |
| Architecture | 76 | 76 | 76 |
Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Cost estimation uses wrong model when persona.model overrides the worker model — src/runtime/strategy.ts
In
shotExecutor(line 318-319),estimateCostandisModelPricedare called withopts.model, but the shot may have run witht.persona?.model(passed asmodelOverridetorunShotat line 298, used at line 156). When a persona-specified model has different pricing, the USD cost in the conserved pool is wrong — it estimates against the worker model's price table, not the actual model used. Impact: benchmar
🟠 MEDIUM mkdtempSync temp directory never cleaned up — tests/loops/strategy-suite.test.ts
Line 311:
mkdtempSync(join(tmpdir(), 'authored-test-'))creates a real temp directory on disk. There is noafterEachcleanup,fs.rmSync, ortry/finallyto remove it after the test. This is a resource leak — repeated test runs accumulate directories in /tmp. Add cleanup: store the dir in aletcaptured by anafterEach, or use vitest'stmpdirfixture if available.
🟡 LOW Contract string injection has no length/validation guard — src/runtime/strategy-author.ts
The new
opts.contractfield is interpolated directly into the LLM prompt at line 145 with no length cap or content validation. A caller passing an extremely long or malicious contract string could bloat the prompt, potentially pushing the losses table out of context or inflating token costs. Low severity becauseAuthorStrategyOptionsis a trusted caller API (not user-facing), and the existinglossesJsonalready has no length guard either, but worth noting for defense-in-depth if the contract coordinate is ever exposed to less-trusted callers.
🟡 LOW Empty-string analystModel bypasses default — src/runtime/strategy.ts
Line 221:
const analystModel = opts.analystModel ?? opts.model— the??operator only guards null/undefined, not empty string. If a caller passesanalystModel: '', the empty string reachescreateChatClientandobserveas a model name, which may produce opaque router errors. The contract says 'Omitted ⇒ the worker's model' so this is a caller-serviceability nit, not a code bug per se.
🟡 LOW Temp directory leak in authorStrategy test — tests/loops/strategy-suite.test.ts
Line 311:
mkdtempSync(join(tmpdir(), 'authored-test-'))creates a temp dir that is never cleaned up (normSyncorafterEachcleanup). Each test run leaks a directory under /tmp. Add cleanup after the assertion:rmSync(dir, { recursive: true })in a finally block.
🟡 LOW analystModel test checks model presence, not correct per-call routing — tests/loops/strategy-suite.test.ts
Lines 269-288: The assertion
expect(models).toContain('test-model')andexpect(models).toContain('critic-model')only verifies both models appear somewhere in all captured fetch calls — it does not verify that the shot call (index 0) usedtest-modeland the analyst call (index >=1) usedcritic-model. A bug where models were swapped or where an extra call used the wrong model could pass. Consider asserting on specific captured indices:expect(captured[0]).toHaveProperty('model', 'test-model')andexpect(captured[1]).toHaveProperty('model', 'critic-model').
🟡 LOW analystModel test couples to agent-eval transport internals — tests/loops/strategy-suite.test.ts
Lines 285-287: The test asserts that
capturedfetch bodies contain both 'test-model' and 'critic-model'. This works only becausecreateChatClient({ transport: 'router' })from@tangle-network/agent-evalhappens to use globalfetchinternally. If agent-eval switches to a different HTTP transport (e.g. undici, node:http), the stub won't capture the critic-model request and the test will fail with a misleading assertion error. Consider adding a comment documenting this coupling, or mocking at theChatClientinterface level instead.
🟡 LOW hooks test assertion is trivially true, doesn't match test name — tests/loops/strategy-suite.test.ts
Lines 290-308: The test name says 'passes lifecycle hooks through to every cell' but only asserts
expect(events.length).toBeGreaterThan(0). With budget=1, concurrency=1, tasks=[task], strategies=[oneShot], there is exactly 1 cell — the assertion is correct by construction but doesn't verify the 'every cell' property. If the test is expanded to multiple tasks/strategies, this assertion would pass even if hooks only fired for the first cell. Consider collecting events per-cell (e.g. resetting the array per task) or adding a stronger assertion likeevents.length >= strategies.length * tasks.length.
tangletools · 2026-06-10T11:34:21Z · trace
Stacked on #220 (retarget to main when it merges).
What
Four small unlocks so every genome coordinate is reachable by the one author→gate pipeline (search stays admission-gated; this PR only makes the coordinates addressable):
strategyAuthorContractnow documentsShotSpec.persona({ systemPrompt?, model? }per shot). The suite shipped multi-agent loops in feat(runtime): ShotPersona — multi-agent loops #214 but the contract never mentioned them — an authored strategy literally could not be multi-agent. One paragraph fixes that.AuthorStrategyOptions.contractaccepts caller-supplied contract text (default unchanged) — a GEPA/skill loop can now evolve the authoring contract and gate variants on the same frozen holdout.AgenticOptions.analystModel: the firewalled critic can run on a different model than the worker (stronger critic / cheaper worker). Omitted ⇒ worker model, as before.BenchmarkConfig.hooks:RuntimeHookspass throughrunBenchmark→runAgenticfor every cell — the feat(runtime): live observability (hooks passthrough) + the route-rigor auditor (auditIntent) #216 observability seam was unreachable from the benchmark/flywheel path (onlyonTaskexisted).Plus: vitest excludes
**/.claude/worktrees/**(background-agent worktree copies were being swept into the root test run).Verification
pnpm run typecheckclean,pnpm run lintclean,pnpm test702 passed / 1 skipped (+4 new tests: persona-in-contract pin, analystModel routing, hooks pass-through, contract override).