Skip to content

feat(loops): make the remaining optimizer coordinates addressable#221

Closed
drewstone wants to merge 2 commits into
feat/loops-promotion-gate-author-convergencefrom
feat/author-surface-unlocks
Closed

feat(loops): make the remaining optimizer coordinates addressable#221
drewstone wants to merge 2 commits into
feat/loops-promotion-gate-author-convergencefrom
feat/author-surface-unlocks

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Stacked on #220 (retarget to main when it merges).

What

Four small unlocks so every genome coordinate is reachable by the one author→gate pipeline (search stays admission-gated; this PR only makes the coordinates addressable):

  • Persona reaches the LLM author: strategyAuthorContract now documents ShotSpec.persona ({ systemPrompt?, model? } per shot). The suite shipped multi-agent loops in feat(runtime): ShotPersona — multi-agent loops #214 but the contract never mentioned them — an authored strategy literally could not be multi-agent. One paragraph fixes that.
  • The author prompt is a coordinate: AuthorStrategyOptions.contract accepts caller-supplied contract text (default unchanged) — a GEPA/skill loop can now evolve the authoring contract and gate variants on the same frozen holdout.
  • AgenticOptions.analystModel: the firewalled critic can run on a different model than the worker (stronger critic / cheaper worker). Omitted ⇒ worker model, as before.
  • BenchmarkConfig.hooks: RuntimeHooks pass through runBenchmarkrunAgentic for every cell — the feat(runtime): live observability (hooks passthrough) + the route-rigor auditor (auditIntent) #216 observability seam was unreachable from the benchmark/flywheel path (only onTask existed).

Plus: vitest excludes **/.claude/worktrees/** (background-agent worktree copies were being swept into the root test run).

Verification

pnpm run typecheck clean, pnpm run lint clean, pnpm test 702 passed / 1 skipped (+4 new tests: persona-in-contract pin, analystModel routing, hooks pass-through, contract override).

- strategyAuthorContract documents ShotSpec.persona — the LLM author can now
  write multi-agent strategies (researcher/engineer hand-offs, persona panels)
  over the same conserved budget; previously the suite's own multi-agent
  primitive was invisible to the authored path.
- AuthorStrategyOptions.contract — caller-supplied contract text, making the
  author prompt itself a gateable optimization coordinate.
- AgenticOptions.analystModel — the critic can run on a different model than
  the worker (stronger critic, cheaper worker).
- BenchmarkConfig.hooks — RuntimeHooks pass through runBenchmark to every
  cell's runAgentic (the watchdog/route-auditor seam was unreachable from the
  benchmark path).
- vitest excludes .claude/worktrees/** (worktree agents' copies were swept
  into the root test run).
- tests: persona-in-contract pin, analystModel routing, hooks pass-through,
  contract override.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — e20925f6

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:25:54Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — e20925f6

Readiness 77/100 · Confidence 75/100 · 9 findings (1 medium, 8 low)

deepseek glm aggregate
Readiness 79 77 77
Confidence 75 75 75
Correctness 79 77 77
Security 79 77 77
Testing 79 77 77
Architecture 79 77 77

Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Hook propagation test uses non-existent e.type property, assertion is vacuous — tests/loops/strategy-suite.test.ts

Line 305: hooks: { onEvent: (e) => void events.push(e.type) } — RuntimeHookEvent (src/runtime-hooks.ts:35-46) has no type property. The event interface has target, phase, id, runId, timestamp, etc. but e.type is always undefined. The assertion expect(events.length).toBeGreaterThan(0) still passes because push(undefined) increments length, but it never checks WHAT events fired or with what data. A regression that changes hook event shapes or stops emitting certain event targets would pass uncaught. Fix: use e.phase or e.target instead, and assert on the actual values (e.g. expect(events).toContain('agent.spawn')).

🟡 LOW strategyAuthorContract inlined in requestAuthoredCode fallback still uses string concat — src/runtime/strategy-author.ts

Line 145: The template literal ${opts.contract ?? strategyAuthorContract}\n\nBASELINE RESULTS... is a long single-line expression. The contract text itself is already a const at module level; the override works correctly via opts.contract ?? strategyAuthorContract. No issue, just noting the line is 230+ chars wide. Purely cosmetic.

🟡 LOW Analyst token spend is always zeroed regardless of analystModel — src/runtime/strategy.ts

The analystExecutor at strategy.ts:344 reports spent: spend(1) — which hardcodes 0 tokens and 0 USD. When analystModel selects a more expensive model than the worker, the actual router spend is invisible to the conserved pool and cost vector. This is pre-existing (the PR just adds the model-selector knob), but since the feature explicitly enables running the analyst on a different model, it's worth a comment or TODO. The observe() call at strategy.ts:228-241 makes a real chat completion whose token usage is never captured.

🟡 LOW observe() receives model via both chat.defaultModel and opts.model — redundant but harmless — src/runtime/strategy.ts

Lines 222-238: createChatClient sets defaultModel: analystModel and then observe() is called with model: analystModel. The observe() function at observe.ts:143 does ...(opts.model ? { model: opts.model } : {}), which overrides the chat client's defaultModel. Both paths converge on the same value so this is functionally correct, but it means the chat client's defaultModel is always overridden — the double-setting is a mild code smell. Not a bug.

🟡 LOW Temp directory leak in authorStrategy test — tests/loops/strategy-suite.test.ts

Lines 311: mkdtempSync(join(tmpdir(), 'authored-test-')) creates a temp dir that is never cleaned up (no rmSync in afterEach or after the test). Confirmed 14 leftover dirs in /tmp from test runs. Should add afterEach(() => { rmSync(dir, { recursive: true }) }) or use a tracked cleanup in the existing afterEach block. Low severity — test-only, no production impact, but pollutes CI runners over time.

🟡 LOW Temp directory never cleaned up, leaks on every test run — tests/loops/strategy-suite.test.ts

Line 311: mkdtempSync(join(tmpdir(), 'authored-test-')) creates a real directory on disk that is never removed. Over repeated test runs this accumulates directories in $TMPDIR. Fix: add rmSync(dir, { recursive: true }) in an afterEach or afterAll block, or wrap in a try/finally that cleans up.

🟡 LOW authorStrategy test stub may break on non-unix platforms — tests/loops/strategy-suite.test.ts

Line 311: join(tmpdir(), 'authored-test-') and the dynamic import(file://${file}) on strategy-author.ts:174 — the file:// URL construction works on linux/mac but may need encoding on Windows (spaces in temp path, backslashes). Non-blocking for current CI but worth noting for portability.

🟡 LOW critiqued test assertion assumes router model field in request body — tests/loops/strategy-suite.test.ts

Lines 285-287: captured.map((r) => (r as { model?: string }).model) — this asserts the model name appears in the raw request body's model field. This is correct because runShot at strategy.ts:156 sends { model: modelOverride ?? opts.model, ... } via fetch. The assertion is valid and the cast is fine for test fixtures. No issue, just documenting the dependency chain is sound.

🟡 LOW Narrower exclude possible but not necessary — vitest.config.ts

The pattern '/.claude/worktrees/' is correct. A more precise alternative would be '.claude/worktrees/**' (without leading glob) since worktrees are always at repo root, but the double-star prefix is harmless and covers edge cases like nested clones. No action required.


tangletools · 2026-06-10T10:40:59Z · trace

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — e6001872

Readiness 76/100 · Confidence 75/100 · 8 findings (2 medium, 6 low)

deepseek glm aggregate
Readiness 76 76 76
Confidence 75 75 75
Correctness 76 76 76
Security 76 76 76
Testing 76 76 76
Architecture 76 76 76

Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 5 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Cost estimation uses wrong model when persona.model overrides the worker model — src/runtime/strategy.ts

In shotExecutor (line 318-319), estimateCost and isModelPriced are called with opts.model, but the shot may have run with t.persona?.model (passed as modelOverride to runShot at line 298, used at line 156). When a persona-specified model has different pricing, the USD cost in the conserved pool is wrong — it estimates against the worker model's price table, not the actual model used. Impact: benchmar

🟠 MEDIUM mkdtempSync temp directory never cleaned up — tests/loops/strategy-suite.test.ts

Line 311: mkdtempSync(join(tmpdir(), 'authored-test-')) creates a real temp directory on disk. There is no afterEach cleanup, fs.rmSync, or try/finally to remove it after the test. This is a resource leak — repeated test runs accumulate directories in /tmp. Add cleanup: store the dir in a let captured by an afterEach, or use vitest's tmpdir fixture if available.

🟡 LOW Contract string injection has no length/validation guard — src/runtime/strategy-author.ts

The new opts.contract field is interpolated directly into the LLM prompt at line 145 with no length cap or content validation. A caller passing an extremely long or malicious contract string could bloat the prompt, potentially pushing the losses table out of context or inflating token costs. Low severity because AuthorStrategyOptions is a trusted caller API (not user-facing), and the existing lossesJson already has no length guard either, but worth noting for defense-in-depth if the contract coordinate is ever exposed to less-trusted callers.

🟡 LOW Empty-string analystModel bypasses default — src/runtime/strategy.ts

Line 221: const analystModel = opts.analystModel ?? opts.model — the ?? operator only guards null/undefined, not empty string. If a caller passes analystModel: '', the empty string reaches createChatClient and observe as a model name, which may produce opaque router errors. The contract says 'Omitted ⇒ the worker's model' so this is a caller-serviceability nit, not a code bug per se.

🟡 LOW Temp directory leak in authorStrategy test — tests/loops/strategy-suite.test.ts

Line 311: mkdtempSync(join(tmpdir(), 'authored-test-')) creates a temp dir that is never cleaned up (no rmSync or afterEach cleanup). Each test run leaks a directory under /tmp. Add cleanup after the assertion: rmSync(dir, { recursive: true }) in a finally block.

🟡 LOW analystModel test checks model presence, not correct per-call routing — tests/loops/strategy-suite.test.ts

Lines 269-288: The assertion expect(models).toContain('test-model') and expect(models).toContain('critic-model') only verifies both models appear somewhere in all captured fetch calls — it does not verify that the shot call (index 0) used test-model and the analyst call (index >=1) used critic-model. A bug where models were swapped or where an extra call used the wrong model could pass. Consider asserting on specific captured indices: expect(captured[0]).toHaveProperty('model', 'test-model') and expect(captured[1]).toHaveProperty('model', 'critic-model').

🟡 LOW analystModel test couples to agent-eval transport internals — tests/loops/strategy-suite.test.ts

Lines 285-287: The test asserts that captured fetch bodies contain both 'test-model' and 'critic-model'. This works only because createChatClient({ transport: 'router' }) from @tangle-network/agent-eval happens to use global fetch internally. If agent-eval switches to a different HTTP transport (e.g. undici, node:http), the stub won't capture the critic-model request and the test will fail with a misleading assertion error. Consider adding a comment documenting this coupling, or mocking at the ChatClient interface level instead.

🟡 LOW hooks test assertion is trivially true, doesn't match test name — tests/loops/strategy-suite.test.ts

Lines 290-308: The test name says 'passes lifecycle hooks through to every cell' but only asserts expect(events.length).toBeGreaterThan(0). With budget=1, concurrency=1, tasks=[task], strategies=[oneShot], there is exactly 1 cell — the assertion is correct by construction but doesn't verify the 'every cell' property. If the test is expanded to multiple tasks/strategies, this assertion would pass even if hooks only fired for the first cell. Consider collecting events per-cell (e.g. resetting the array per task) or adding a stronger assertion like events.length >= strategies.length * tasks.length.


tangletools · 2026-06-10T11:34:21Z · trace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants