Skip to content

docs(examples): scientifically-rigorous coding benchmark across harnesses with controlled tool use#369

Merged
drewstone merged 7 commits into
mainfrom
docs/coding-benchmark-example
Jun 24, 2026
Merged

docs(examples): scientifically-rigorous coding benchmark across harnesses with controlled tool use#369
drewstone merged 7 commits into
mainfrom
docs/coding-benchmark-example

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

examples/coding-benchmark/ — a copy-pasteable example that runs one coding task across coding-agent harnesses, fairly and honestly, with real statistics, in ~8 small files of pure composition. Every moving part is an agent-runtime or agent-eval primitive; zero bespoke harness code.

What it does

  • Matrix sweeprunProfileMatrix over harness × baseline-profile × scenario: claude-code / opencode / codex / cli, each on its bare default profile (we measure the harness, not our scaffolding), across two held-out tasks (token-bucket rate limiter, RFC-4180 CSV parser).
  • One-line tool knobwithTools(profile, 'none'|'web'|'search-mcp') authors native web tools / a mounted MCP onto the profile; the sandbox substrate materializes it per harness. CLI: --tools.
  • Multi-round refineopenSandboxRun gives one persistent, resumable box; each round's prompt is built from the prior round's deterministic-check failures (.start() / .resume()).
  • Validators before judge — 3 deterministic checks (typecheck/test/lint, ~$0, in-box) + a realness anchor (scoreAuthenticity/gateRealness), then the LLM judge only on the band the checks can't resolve.
  • Judge layer — 1 judge by default (singleCodeJudge); --ensemble → 3 cross-family models (ensembleJudge, crossFamily:true), reduced by aggregateJudgeVerdicts. 4-dim weighted rubric (correctness 0.40 / completeness 0.25 / code_quality 0.20 / robustness 0.15).
  • Real statsconfidenceInterval (per-harness composite CI), wilson (pass-rate binomial CI), pairedBootstrap on matched scenarios, benjaminiHochberg to BH-correct across all harness pairs.

The no-cheat firewall

The agent's context is scenario.prompt only. The validator commands, realness signals, and rubric note live on eval-only fields of the scenario, read after the loop by the validators/judge — never written into the box. The firewall is a labeled block in dispatch.ts (THE NO-CHEAT FIREWALL LIVES HERE); the realness anchor is write-only to the record (ctx.artifacts). Verified: the realness scan scores a real impl 85 vs a return true stub 35 (gated) on the sample tasks.

Offline by default, faithful live

Ships an in-process SandboxClient (offline-box.ts) so the whole pipeline compiles and runs with no creds — proven end-to-end (default: 8 records; --tools web --ensemble --reps 2: 16 records). --live swaps in a real @tangle-network/sandbox client + a real judge model; nothing else changes.

Verification

  • pnpm run lint — clean (328 files)
  • pnpm run typecheck (src) + pnpm run typecheck:examples — clean
  • Ran offline in all four modes (default / --ensemble / --tools web / --reps) — full leaderboard + significance matrix produced

One runtime gap worked around

runProfileMatrix rejects a bare model id; it requires a snapshot-versioned id (name@YYYY-MM-DD or name-YYYYMMDD) so records are reproducible. profiles.ts now uses dated, env-overridable model ids and documents why. (Caught by actually running it — typecheck was green but the first run threw ProfileMatrixError: ... lacks a snapshot version.)

Operator + partner-facing — please review, do not merge yet.

…sses with controlled tool use

Add examples/coding-benchmark/ — runs one coding task across
claude-code/opencode/codex/cli baseline profiles × scenarios via
runProfileMatrix, with a one-line tool-surface knob, validators-before-judge
scoring, a 1-or-3 (ensemble) judge layer, and real paired-bootstrap + Wilson +
BH stats. The no-cheat firewall (agent context = scenario.prompt only) is
enforced and pinpointed in dispatch.ts. Ships an in-process SandboxClient so
the whole pipeline compiles and runs offline with no creds, and runs faithfully
live with --live + TANGLE_API_KEY.
tangletools
tangletools previously approved these changes Jun 23, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 3b335ea1

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:02:36Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 9e98af79

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:09:45Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 323.5s (2 bridge agents)
Total 323.5s

💰 Value — sound

Adds a well-structured, copy-pasteable example showing how to compare coding-agent harnesses via runProfileMatrix with validators-before-judge, a no-cheat firewall, and real stats — a gap in the examples learning path.

  • What it does: Introduces examples/coding-benchmark/ (9 small files + README) that runs one coding task across claude-code / opencode / codex / cli baseline profiles via runProfileMatrix. Supports a one-line tool-surface knob (none/web/search-mcp), multi-round refine in one persistent box, deterministic in-box checks, a write-only realness anchor, an optional 3-model cross-family ensemble judge, and pa
  • Goals it achieves: Makes the "neutral harness measurer" concept from docs/eval-substrate.md concrete and runnable. Demonstrates the canonical campaign-substrate pattern for coding harness comparison, filling a real gap: no existing example shows runProfileMatrix used for coding tasks with the eval primitives (scoreAuthenticity, ensembleJudge, inMemoryCampaignStorage, paired stats). It is educational, copy-
  • Assessment: Good on its merits. The code composes existing agent-runtime / agent-eval primitives rather than inventing new substrate. The decomposition across files is clear (scenarios, profiles, tools, validators, judges, dispatch, stats, offline box, entrypoint). The no-cheat firewall is a useful, visible design invariant. It follows repo conventions: offline-by-default with --live, imports from packa
  • Better / existing approach: none — this is the right approach. I checked for existing equivalents: src/runtime/run-benchmark.ts is the optimization suite (compares strategies like sample/refine on one worker, not harnesses); examples/strategy-suite/ demos that. examples/product-eval/ uses runProfileMatrix but for persona-conversation product evals, not coding. bench/ contains real benchmark adapters (SWE-bench, Hum
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A coherent, well-integrated coding-benchmark example that composes real substrate primitives (runProfileMatrix, openSandboxRun, ensembleJudge, scoreAuthenticity, pairedBootstrap/Wilson/BH) with zero bespoke harness code — every import resolves, it follows the established product-eval pattern, and it

  • Integration: Fully wired and reachable. Verified against published substrate (agent-eval@0.99.0 + sandbox@0.9.0): every named import exists with the expected signature — runProfileMatrix/ProfileDispatchFn/DispatchContext/JudgeConfig/inMemoryCampaignStorage (campaign), scoreAuthenticity/gateRealness/AuthenticitySignals/ProducedFile (authenticity), confidenceInterval/wilson/pairedBootstrap/benjaminiHochberg/RunR
  • Fit with existing patterns: Hits the grain dead-center. It is the same runProfileMatrix<Scenario,Artifact> cell pattern already established by examples/product-eval/product-eval.ts:104, just richer (judges, multi-round refine, stats). No competing pattern, no reinvention — every moving part is a substrate primitive composed, not reimplemented. The no-cheat firewall is expressed as a structural field-level split on CodingScen
  • Real-world viability: Holds up on both paths. Offline: the in-process box (offline-box.ts) implements exactly the SandboxInstance surface openSandboxRun/settle call (streamPrompt yielding a terminal result event, fs.read/write, exec, delete); missing toolchain honestly reads as check-FAIL → loop exhausts maxRounds → stub judge → deterministic leaderboard (documented in offline-box.ts:11-13). Live: lazy-requires real Sa
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts

  • console.log(

🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts

  •  `    this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260623T233115Z

@tangletools

Copy link
Copy Markdown
Contributor

❌ Needs Work — 9e98af79

Readiness 26/100 · Confidence 70/100 · 17 findings (3 high, 4 medium, 10 low)

opencode-kimi glm deepseek aggregate
Readiness 26 35 79 26
Confidence 70 70 70 70
Correctness 26 35 79 26
Security 26 35 79 26
Testing 26 35 79 26
Architecture 26 35 79 26

Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH sumTokens misses data.usage shape → --live cells fail integrity guard — examples/coding-benchmark/dispatch.ts

dispatch.ts:148-158 sumTokens() reads only event.data.tokenUsage.{input,output}Tokens. The substrate's canonical metering shapes (src/runtime/sandbox-events.ts:25-70, extractLlmCallEvent) are: type:'result' → data.usage.{input,output}Tokens; type:'done' → data.tokenUsage; type:'llm_call' → data.tokensIn/tokensOut. The offline-box.ts stub emits data.tokenUsage so offline meters correctly, but a real sandbox emitting type:'result' with data.usage yields sum=0 → the if (usage.input || usage.output) ctx.cost.observeTokens(usage) guard at dispatch.ts:115 skips observation → runProfileMatrix with integrity:'assert' throws BackendIntegrityError on every cell. README claims 'Nothing else in the example changes' between offline and live, which is not true as shipped. Fix: reuse the runtime's extr

🔴 HIGH missing test files break offline deterministic checks — examples/coding-benchmark/scenarios.ts

scenarios.ts:75 and :106 run node --test test/rate-limiter.test.js and node --test test/csv.test.js, but those files are not present in the PR. Verified empirically: node --test <missing-file> exits 1 with 'Could not find ...'. Offline, the test validator therefore always fails, the refine loop never breaks early, and the example cannot demonstrate passing deterministic checks. Fix by adding the referenced test files (and any fixtures) or by documenting that offline mode intentionally skips tests.

🔴 HIGH paired bootstrap mishandles reps > 1 — examples/coding-benchmark/stats.ts

stats.ts:71-82 builds bByScenario = new Map(b.map((r) => [r.scenarioId ?? '', r])). With reps>1, multiple records share the same scenarioId, so the Map keeps only the last record per scenario. The resulting paired arrays contain duplicated, arbitrarily matched observations instead of matched (scenario, rep) pairs, violating the paired-bootstrap assumption documented in pairedBootstrap. Reproduced conceptually: with reps=2, aScores has four entries while bScores repeats the same two values. Fix by matching on a unique (scenarioId, rep) key or aggregate to one score per scenario before pairing.

Other

🟠 MEDIUM No tests for the example; offline smoke would have caught the live-mode gap — examples/coding-benchmark/benchmark.ts

tsconfig.examples.json + biome.json include examples in typecheck/lint, but there is no test (no .test.ts, no vitest spec) asserting the offline pipeline produces a realness-scored, judge-scored, stats-rendered output. A single vitest that runs main() in offline mode and asserts allRecords.length === harnessProfiles.length * scenarios.length * reps and that pairwiseStats returns a defined leaderboard would cost ~20 lines and lock in the contract the README claims. The repo already ships vitest (vitest.config.ts at root).

🟠 MEDIUM require() used in ESM live path — examples/coding-benchmark/benchmark.ts

benchmark.ts:86-88 uses require('@tangle-network/sandbox') inside if (live). The package has "type": "module". While pnpm tsx polyfills require() for .ts files, running the compiled ESM output directly throws ReferenceError: require is not defined. Replace with dynamic import('@tangle-network/sandbox') and an async client factory.

🟠 MEDIUM ensemble judge has less context than single judge — examples/coding-benchmark/judges.ts

singleCodeJudge (line 113-124) builds judgeUserPrompt that includes the task prompt, the rubric note, AND deterministic check results (typecheck/test/lint pass/fail). But ensembleCodeJudge (line 130-146) passes only artifact.solution and scenario?.prompt to its scoreOne callback — the rubric note and check results are dropped. This means the 3-model cross-family ensemble judges on LESS information than the single judge, making it strictly worse for accuracy even though it's presented as the more r

🟠 MEDIUM realness validator is recorded but never gates the judge — examples/coding-benchmark/validators.ts

validators.ts:74-89 realnessValidator returns {valid,score} and dispatch.ts:128 writes it to ctx.artifacts, but no consumer reads artifact.realness. judges.ts:107-116 singleCodeJudge scores artifact.solution directly with no reference to artifact.realness.valid. README.md:46 claims 'Realness anchor ... catches a stub that compiles but fakes the hard part' and the validators-before-judge ordering implies the anchor gates the judge band. As shipped, a gated stub still receives a full judge score; only the offline writeJson records the gate. Either route artifact.realness into the judge prompt (so it can downweight) or short-circuit the judge when realness.valid===false (return composite 0). At minimum, tighten the README to say the anchor is recorded-for-honesty, not judge-gating.

🟡 LOW Tool-knob description is imprecise — examples/README.md

Line 47 says 'a one-line tool knob (websearch / webfetch / MCP)'. The actual preset names in examples/coding-benchmark/tools.ts:29 are 'none' | 'web' | 'search-mcp'; 'websearch' and 'webfetch' are toggled together by the 'web' preset, and the MCP preset is specifically 'search-mcp', not a generic MCP knob. Also, the Run section (lines 109-110) does not demonstrate the --tools flag despite highlighting it as a feature. Fix: change parenthetical to '(none / web / search-mcp)' and optionally add a command like `pnpm tsx examples/coding-benchmark/ben

🟡 LOW Offline refine loop is a no-op: script.solutionFor ignores the round argument — examples/coding-benchmark/benchmark.ts

benchmark.ts:34-58 offlineSolutions.{rate-limiter,csv-parser}.solutionFor is declared as (round: number) => string but the implementations close over no round state and return the same source each call. offline-box.ts:43-45 increments round and calls script.solutionFor(round), so rounds 2 and 3 write byte-identical content to the same path. The dispatch's multi-round refine loop (dispatch.ts:99-115) therefore produces the same checks output every iteration. offline-box.ts:33 docstring even claims 'round 2 can differ from round 1 (refine demo)' — the capability is in the interface but unexercised. Either drop the round parameter from OfflineScript.solutionFor (clarity), or seed a 2-round refine script for one scenario (demo value).

🟡 LOW require('@tangle-network/sandbox') only resolves under tsx shim, not pure ESM node — examples/coding-benchmark/benchmark.ts

benchmark.ts:79 const { SandboxClient: RealClient } = require('@tangle-network/sandbox'). The repo is "type": "module" (package.json:34). Pure node would throw ReferenceError: require is not defined. tsx (the documented runner) shims it, so the README command works, but a partner copying the snippet into an ESM script breaks. Use a top-level await import('@tangle-network/sandbox') guarded behind the live flag, or document the tsx-only constraint in the README's 'Going live' section.

🟡 LOW iteration: maxRounds passes a constant 3 instead of the actual final round — examples/coding-benchmark/dispatch.ts

dispatch.ts:124-127 calls realnessValidator(...).validate(artifact, { iteration: maxRounds, signal: ctx.signal }). maxRounds is the constant 3 (dispatch.ts:31), but ValidationCtx.iteration is documented (src/runtime/types.ts:31) as 'Iteration index this output came from (0-based)'. The realnessValidator implementation ignores ctx (validators.ts:75-76 declares only artifact), so this is benign today, but the value is semantically wrong and would mislead a future validator that reads ctx.iteration. Capture the loop's round variable into a const before the validator call and pass that.

🟡 LOW run.box as unknown as RunBox declares an fs.read that is never used — examples/coding-benchmark/dispatch.ts

dispatch.ts:46 type RunBox = CheckBox & { fs: { read(path: string): Promise } }. dispatch.ts:117 cast run.box as unknown as RunBox. runBoxChecks (validators.ts:62-70) only calls box.exec(command); the fs.read member is dead. The real SandboxInstance does expose get fs(): FileSystem (sandbox-CNcyBhPp.d.ts:4545) so the cast is harmless at runtime, but the type assertion is wider than the actual call surface and invites a reader to think runBoxChecks reads files. Drop fs.read from RunBox (it's just CheckBox).

🟡 LOW exported OfflineQuality type is dead code — examples/coding-benchmark/offline-box.ts

Line 35 exports OfflineQuality = 'real' | 'stub' but grep across the entire repo returns only the definition site. The type is referenced nowhere — no import, no usage. The comment on line 32-34 describes it as a fidelity level for canned solutions, but the concept is never wired. Dead code in an example is misleading.

🟡 LOW model snapshot format comment mismatch — examples/coding-benchmark/profiles.ts

profiles.ts:43-45 states model ids must be name@YYYY-MM-DD or name-YYYYMMDD, but the actual defaults (e.g., anthropic/claude-sonnet-4-5-2025-09-29) use provider/name-YYYY-MM-DD. Runtime accepts them, so the comment is misleading. Update the comment to the accepted format.

🟡 LOW Benjamini-Hochberg correction is degenerate with constant p-proxy — examples/coding-benchmark/stats.ts

Line 118 uses r.low > 0 || r.high < 0 ? 0.04 : 0.5 as a p-value proxy for all pairs whose bootstrap CI excludes zero. Because all 'significant' pairs share the exact same proxy p-value (0.04), the BH procedure cannot distinguish between them by true effect size. The result: for n harnesses with C(n,2) pairs, any pair needs to rank at position k >= 0.04 * C(n,2) / 0.05 to pass BH — e.g., with 4 harnesses (6 pairs), at least 5 of 6 pairs must have CI excluding zero before ANY pair passes BH. This makes BH correction overly conservative, effectively masking real pairwise differences. For the offline stub this is harmless (all deltas are 0.000), but for a l

🟡 LOW leaderboard harness names include opaque hash suffixes — examples/coding-benchmark/stats.ts

stats.ts:58-67 keys grouping on r.agentProfile?.profileId ?? r.candidateId. Actual run output shows labels like claude-code-baseline-91b9b8dc6fe218e5 instead of claude-code-baseline, making the leaderboard harder to read. Use a human-readable profile name for display while keeping the stable id for grouping if needed.

🟡 LOW pProxy 0.04/0.5 is a hand-rolled p-value feed into benjaminiHochberg — examples/coding-benchmark/stats.ts

stats.ts:118 const pProxy = raw.map((r) => (r.low > 0 || r.high < 0 ? 0.04 : 0.5)). The agent-eval substrate ships real significance primitives (pairedTTest, mannWhitneyU, wilcoxonSignedRank, mcnemar — all exported from index.d.ts). Feeding BH a binary 0.04/0.5 proxy loses power information and is explicitly a placeholder. Comment labels it 'proxy' which is honest; for a benchmark marketed as 'scientifically rigorous', wire a real paired test on the matched scores instead.


tangletools · 2026-06-23T23:36:53Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 3 Blocking Findings — 9e98af79

Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-23T23:36:53Z · immutable trace

…dings, drop 3 reinventions

Reworks examples/coding-benchmark to compose published agent-eval / agent-runtime
primitives instead of hand-rolling them, fixing every PR-369 review finding.

Reinventions removed (now substrate primitives):
- hand-rolled judge (judgeSystemPrompt/judgeUserPrompt/score) -> llmJudge / ensembleJudge
- sumTokens (read only data.tokenUsage; summed 0 under a real sandbox emitting
  data.usage -> integrity:'assert' threw on every --live cell) -> extractLlmCallEvent
- runBoxChecks flat pass/fail -> MultiLayerVerifier ordered pipeline (typecheck ->
  test -> lint, dependency-based skip, blended score)

Findings: --live token shapes fixed + verified; real seeded fixture tests so the
deterministic checks have a file to run; correct paired-scenario stats keying; an
offline vitest smoke; dynamic import() of the SDK behind --live; the ensemble now
sees the same full context as the single judge; the realness gate ACTUALLY gates
the judge (short-circuits to composite 0, no model call); honest tool-knob docs +
--tools command; a real cross-round offline refine demo; dead RunBox.fs.read /
OfflineQuality removed; real paired test (pairedTTest) feeding BH real p-values;
human-readable harness names on the leaderboard.

Consolidated validators+judges -> eval.ts and tools -> profiles.ts (9 source files
-> 7 + a smoke test). README every claim matched to the code.

Bumps the agent-eval devDependency to >=0.99.0 (llmJudge). The peerDependency floor
is unchanged — only the example uses llmJudge, src does not.
tangletools
tangletools previously approved these changes Jun 24, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 543881c8

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T08:59:17Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 366.9s (2 bridge agents)
Total 366.9s

💰 Value — sound

Adds a copy-pasteable examples/coding-benchmark/ that composes agent-runtime/agent-eval primitives to run a rigorous, firewalled coding benchmark across harnesses with validators-before-judge, realness gating, and honest paired statistics — a good addition with no existing equivalent.

  • What it does: Introduces examples/coding-benchmark/ (~8 files) demonstrating how to run one coding task across a matrix of harnesses × baseline profiles × scenarios, with a one-line tool preset knob, multi-round refinement in one persistent sandbox, deterministic checks (typecheck/test/lint via MultiLayerVerifier), a structural realness gate that can short-circuit the judge, optional cross-family ensemble judgi
  • Goals it achieves: Gives partners and contributors a canonical, copy-pasteable example of a scientifically honest coding-agent benchmark using the substrate primitives; demonstrates the no-cheat firewall (agent only sees scenario.prompt, eval criteria stay outside the box); shows how to swap tool surfaces via the profile without per-harness config; and validates the wiring with an offline vitest smoke test.
  • Assessment: Good change, built in the grain of the codebase. It uses the published primitives the docs point to: runProfileMatrix (benchmark.ts:32,202), openSandboxRun for persistent resumable boxes (dispatch.ts:86), MultiLayerVerifier (eval.ts:170), llmJudge/ensembleJudge (eval.ts:244,264), scoreAuthenticity/gateRealness (eval.ts:199-200), and the stats primitives (stats.ts:22-28). The AgentPro
  • Better / existing approach: none — this is the right approach. I checked examples/ and bench/ for existing equivalents: examples/product-eval uses runProfileMatrix but for persona conversations, not coding; examples/ui-audit uses runLoop over a UI-audit client; bench/search-bench/run.mts compares coding harnesses but is internal research infrastructure, not a campaign-matrix example, and lacks the validators-before-judge
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A well-integrated, offline-runnable example that composes real agent-eval/agent-runtime primitives to compare coding harnesses fairly with rigorous statistics; nothing existing does this and every integration point verifies.

  • Integration: Fully reachable and wired. Registered in examples/README.md entry 9b (Tier 2) with CLI run commands; covered by tsconfig.examples.json (pnpm run typecheck:examples passes clean); has a passing vitest smoke test (3 tests). I installed deps and ran pnpm tsx examples/coding-benchmark/benchmark.ts after pnpm build — it produced exactly the documented output (8 records = 4 harnesses × 2 scenarios ×
  • Fit with existing patterns: Composes the right primitives in the grain of the codebase. Follows the established examples convention (imports from published package surface, not relative paths; runs from the repo tsx). Uses the documented agent-eval campaign pattern (runProfileMatrix + ProfileDispatchFn + JudgeConfig) — the same pattern product-eval/ uses for a different domain (user-sim persona eval), so it is consistent, no
  • Real-world viability: Holds up. The offline path genuinely exercises the whole pipeline: the real matrix runs, MultiLayerVerifier executes real tsc/biome/node --test commands (which fail fast offline because the toolchain isn't on PATH — the documented honest signal, not a fake pass), the realness gate runs for real and the smoke test asserts it both catches a return true stub (gated→0) and passes a real token-bucket
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts

  • console.log(

🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts

  •      `    this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260624T103113Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 543881c8

Readiness 60/100 · Confidence 80/100 · 17 findings (2 medium, 15 low)

opencode-kimi glm deepseek aggregate
Readiness 80 80 60 60
Confidence 80 80 80 80
Correctness 80 80 60 60
Security 80 80 60 60
Testing 80 80 60 60
Architecture 80 80 60 60

Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM parseArgs produces NaN/incorrect values when flags follow value options — examples/coding-benchmark/benchmark.ts

Lines 53-61: The opt() helper doesn't validate that the next argument is not a flag. --reps --live produces reps = NaN (Number('--live')), and --tools --live sets toolPreset = '--live' (invalid ToolPreset). Verified: node -e 'parseArgs(["--reps","--live"]).reps' → NaN, 'parseArgs(["--tools","--live"]).toolPreset' → "--live". Fix: check that argv[i+1] doesn't start with --, or use a proper arg parser.

🟠 MEDIUM devDependency version (>=0.99.0) diverges from peerDependency (>=0.97.0) for @tangle-network/agent-eval — package.json

Line 92 (devDependencies): "@tangle-network/agent-eval": ">=0.99.0 <1.0.0" — bumped from ">=0.97.0". Line 120 (peerDependencies): "@tangle-network/agent-eval": ">=0.97.0 <1.0.0" — NOT bumped. agent-eval is a REQUIRED peer dependency (not listed in peerDependenciesMeta optional, line 126-138). Gap: consumers told they can use 0.97.0–0.98.x but dev now requires 0.99.0. Verified: no src/ changes in this PR, so library API is unchanged and consumers w

🟡 LOW Non-standard table key '9b' — examples/README.md

Line 47: The new row key 9b is non-standard in an otherwise numeric sequence (1-22). It's intentional (interleaves with 9 for ui-audit to avoid renumbering the full tier list), but could confuse readers expecting pure numeric ordering. Consider 10 and shift everything down, or add a footnote explaining the b suffix.

🟡 LOW README command depends on undeclared tsx — examples/coding-benchmark/README.md

Lines 6-16 document pnpm tsx examples/coding-benchmark/benchmark.ts, but tsx is not listed in this repo's devDependencies and node_modules/.bin/tsx does not exist after pnpm install --frozen-lockfile. The command happens to work in this environment only because a global tsx binary is on PATH (/home/drew/code/cli-bridge/node_modules/.bin/tsx). In a clean checkout without a global tsx, the documented command will fail. Fix: add tsx to devDependencies, or change the documented command to a runner that is guaranteed (e.g. node --experimental-strip-types or pnpm exec tsc && node ...).

🟡 LOW Offline csv-parser cannot detect row separators — examples/coding-benchmark/benchmark.ts

Lines 88-96: The offline CSV parser checks c === '\\n' to detect row splits, but c is a single character from charAt() and '\\n' is a two-character string — the comparison is always false. Verified via node simulation: multi-row input 'a,b\\nc,d' produces one row [['a','b\\nc','d']] instead of two. The test fixture only tests single-row cases so this is latent, but the example code is incorrect for multi-row CSV.

🟡 LOW benchmark runDir is created but never cleaned up — examples/coding-benchmark/benchmark.ts

Line 176 creates a temp directory with mkdtempSync(join(tmpdir(), 'coding-benchmark-')) and passes it to runProfileMatrix, but main never removes it. After running the test suite multiple times, /tmp/coding-benchmark-* directories accumulate. Fix: wrap the matrix run in a try/finally and rmSync(runDir, { recursive: true, force: true }) after records are returned.

🟡 LOW Advisory lint failures are invisible to the refine prompt — examples/coding-benchmark/dispatch.ts

nextPrompt at dispatch.ts:51 calls layerOutput(report, 'lint'), whose passed field is true because checkLayer for advisory layers returns status: 'pass' regardless of the biome result (eval.ts:135-144). This keeps lint from blocking allPass (correct), but it also means biome warnings never reach the agent in the refine prompt, so the agent cannot correct style issues. Fix: include the lint layer's actual ok/output in the refine prompt while still treating lint as advisory for allPass.

🟡 LOW Multiple unsafe 'as unknown as' type casts bypass type safety — examples/coding-benchmark/dispatch.ts

Line 120: run.box as unknown as CheckBox — double-casts the sandbox instance to the minimal CheckBox interface. Line 168: extractLlmCallEvent(ev as never, 'agent') — casts each stream event to never before passing to the extractor. In offline-box.ts lines 64, 100: as unknown as SandboxEvent and as unknown as SandboxInstance. These all suppress type checking. While the

🟡 LOW Ensemble judge model names lack snapshot dates — contradicts profiles.ts own rule — examples/coding-benchmark/eval.ts

eval.ts:ensembleCodeJudge (line ~213) hardcodes models: ['deepseek-chat', 'gpt-4o-mini', 'gemini-flash'] — bare aliases with no snapshot date. profiles.ts:harnessModel (line ~43) explicitly requires provider/name-YYYY-MM-DD form and documents why: 'a run record without the exact model snapshot is not reproducible.' The ensemble judges also produce scored records; using aliases undermines the reproducibility property the example otherwise enforces. If the live router rejects aliases, the ensemble silently fails. Fix: use snapshot-dated IDs consistent with profiles.ts.

🟡 LOW Unquoted paths in shell-executed check and seed commands — examples/coding-benchmark/eval.ts

eval.ts:seedFile (line ~103): box.exec(\mkdir -p ${dir} && printf %s '${b64}' | base64 -d > ${file.path}`)— file.path and dir are interpolated unquoted. scenarios.ts:typecheckCmd/testCmd/lintCmd (lines ~55-60):tsc ... ${path}, node --test ${fixturePath}, biome check ${path}— same pattern. Current corpus paths (test/rate-limiter.test.js, src/csv.ts) are safe and eval-controlled (never agent-influenced), so this is not an active vulnerability. But as example code that partners will copy, it sets a latent injection precedent. Fix: single-quote paths:mkdir -p '${dir}' ... > '${file.path}'`.

🟡 LOW llmJudge imported from main index breaks peer-floor (>=0.97.0) consumers — examples/coding-benchmark/eval.ts

Line 1: import { ... llmJudge ... } from '@tangle-network/agent-eval'. Verified that llmJudge IS exported from the main index in 0.99.0 (the devDep floor >=0.99.0) but is NOT in 0.97.x/0.98.x (within the declared peerDependencies range >=0.97.0). A downstream consumer pinning agent-eval to the peer floor who compiles this example gets llmJudge is not exported. Fix: import llmJudge from @tangle-network/agent-eval/campaign (exported in both versions). Every other symbol used exists in both.

🟡 LOW seedFile shells out with unquoted file paths — examples/coding-benchmark/eval.ts

Line 105 builds mkdir -p ${dir} && printf %s '${b64}' | base64 -d > ${file.path}. The current scenario paths are fixed (src/..., test/...), but the path is interpolated unquoted, so a scenario path containing spaces or shell metacharacters would break or inject. Fix: quote the path (e.g., use JSON.stringify(file.path) as the shell word) or, better, seed fixtures via box.fs.write instead of box.exec.

🟡 LOW seedFile uses unsanitized paths in shell command — examples/coding-benchmark/eval.ts

Line 105: await box.exec('mkdir -p ${dir} && printf %s ... base64 -d > ${file.path}')file.path and dir are interpolated directly into a shell command. Currently safe because scenarios.ts uses hardcoded paths, but this is a latent injection vector if scenarios are ever loaded from external config. The base64 encoding itself is safe (standard b64 has no quotes/shell metachar). Note: scenario.fixture.content is NOT escaped against single quotes in the printf %s '${b64}' wrapper, but base64 output never contains single quotes.

🟡 LOW Test fixture .js imports .ts — node --test needs a TS loader — examples/coding-benchmark/scenarios.ts

Lines 64-68: node --test test/rate-limiter.test.js where the .js fixture does import { RateLimiter } from '../src/rate-limiter.ts'. Node.js 22+ cannot natively resolve TypeScript imports from JavaScript without --loader ts-node/esm or --import tsx. The README acknowledges this fails offline, but the command itself is structurally incorrect for any vanilla Node.js environment. In a real harness box the toolchain may be pre-configured, but the check commands offer no fallback.

🟡 LOW Test fixtures import .ts files from .js — requires Node 23.6+ type stripping — examples/coding-benchmark/scenarios.ts

Fixture test/rate-limiter.test.js (line ~82) does import { RateLimiter } from '../src/rate-limiter.ts' and test/csv.test.js (line ~143) does import { parseCsv } from '../src/csv.ts'. The check command is node --test test/rate-limiter.test.js (no --experimental-strip-types flag). Importing a .ts file from .js requires Node 23.6+ (unflagged type stripping) or the --experimental-strip-types flag on 22.6–23.5. In live mode with an older Node, the test layer always fails regardless of solution quality, making the typecheck→test→lint pipeline's test layer a permanent red. Offline this is moot (toolchain absent anyway). The README's claim that checks 'run for real' in a live box depends on the box Node version.

🟡 LOW devDep lower bound bumped (0.97→0.99) but peerDep stays at 0.97.0 — package.json

package.json:92 now requires @tangle-network/agent-eval >=0.99.0 <1.0.0 for dev, while package.json:120 still declares peerDependency >=0.97.0 <1.0.0. Currently safe: no src/ code changed in this PR and files field excludes examples/, so nothing published needs 0.99. Risk is latent — any future PR that imports a 0.99-only agent-eval API into src/ would silently allow consumers on 0.97/0.98 and break at runtime. Suggest a comment in the PR description noting the asymmetry is intentional, or bump peer to >=0.99.0 proactively if any 0.99 API is planned for src/ soon. Not blocking.

🟡 LOW devDependency minimum exceeds peerDependency minimum for @tangle-network/agent-eval — package.json

Line 92 devDependencies specifies @tangle-network/agent-eval >=0.99.0 <1.0.0, but line 120 peerDependencies still allows >=0.97.0 <1.0.0. After this change CI/develop will install 0.99.0 while consumers can legally install 0.97.0 or 0.98.0. If 0.98.0/0.99.0 introduced breaking type or runtime changes in re-exported substrate primitives, consumers on the older peer range may see failures that CI no longer catches. Either align peerDependencies to >=0.99.0 <1.0.0 if the library truly requires it, or add a CI matrix job that verifies against the lowest supported peer


tangletools · 2026-06-24T10:42:57Z · trace

…ta, reps stop pseudo-replicating, TS test runner, runnable on clean clone

Four partner-blocking defects against the honesty pitch:

- HIGH: the round-0 offline stub the dispatch wrote scored composite 0.6
  (gated:false) because its `refillPerSec` param matched the realImpl regex,
  so the "stub → gated → composite 0" demo never fired on a real run — only
  the unit test's separate strawman gated. Make the round-0 stub genuinely
  hollow (inert `_capacity`/`_ratePerSec` args, no refill math) so gateRealness
  gates it to composite 0 on the benchmark's own data. Export `offlineSolutions`
  and assert the gate against the EXACT dispatch stub in the smoke test.
- HIGH: the leaderboard CI/Wilson were computed over every raw rep record, so
  identical reps faked a narrower interval (pass-CI [34%,100%] → [61%,100%] at
  reps=3). Collapse reps to one mean per (harness,scenario) before the CI/Wilson,
  matching the pairing path. Add a regression test that identical reps leave the
  CI unchanged.
- MEDIUM: the test check ran plain `node --test`; the fixture imports the
  solution as `.ts`, and Node strip-only mode throws ERR_UNSUPPORTED_TYPESCRIPT_SYNTAX
  on constructor parameter properties (the canonical impl's style), false-failing
  a correct solution. Run `node --experimental-transform-types --test`.
- MEDIUM: `tsx` was undeclared, so the documented `pnpm tsx ...benchmark.ts`
  faceplanted on a clean clone. Pin `tsx@^4.22.4` in devDependencies and update
  the lockfile (agent-eval stays 0.99.0).

README updated so every claim matches: the gate fires on real data, reps are
honest, the test runner handles TS param properties, and the run command works.
tangletools
tangletools previously approved these changes Jun 24, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 818e73db

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T10:52:43Z

…ewall claims, runnable-clone hardening

Address the union of four critical-audit lenses on the coding-benchmark example.

Realness gate (HIGH): the gate fired on one strawman stub shape, not the natural
cheat. Tightened each task's realImpl to require the actual hard-part work (refill
MATH, quote-state tracking, capacity eviction) so a decoy token — a `refillPerSec`
param name, a `for (` loop, a passthrough Map — no longer reads as a real impl, and
added a third LRU scenario whose only passing path is the real eviction algorithm.
The smoke test now asserts each NATURAL cheat is gated, per task.

Firewall claims (HIGH/MEDIUM): the README/dispatch/scenarios claimed the agent
"literally cannot read the answer key". The seeded test fixture is intentionally
visible (TDD-style) and a multi-round agent can read it; only the LLM-judge rubric
and realness signals are firewalled. Softened the claim to that precise truth, and
rephrased "gates to composite 0 on the actual run" to "the smoke test asserts the
gate; the run refines past it, so no leaderboard cell ends gated".

DX/runtime hardening: parseArgs no longer swallows a following flag as an option
value (`--reps --live` → reps=1, not NaN) and clamps reps to a positive integer;
the matrix runDir is cleaned in a finally; the LLM judge is imported from
/campaign so it resolves across the whole peer range; ensemble panel models are
snapshot-dated; seedFile prefers the structured fs.write seam (no shell injection
surface); the realness scan takes the seeded fixture as a reference so a real
solution carries no spurious DEAD_ARTIFACT; stats fails loud on a missing
scenarioId instead of merging into one bucket; renderStats prints a power caveat
when n<6; advisory lint warnings now reach the refine prompt without gating
allPass; casts narrowed; README notes the Node>=22.6 test-layer floor, the
live-box PATH requirements, and the offline-ensemble degeneracy.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 279.1s (2 bridge agents)
Total 279.1s

💰 Value — sound

Adds a well-scoped, runnable examples/coding-benchmark/ that composes agent-runtime/agent-eval primitives to benchmark coding-agent harnesses across tasks with validators, realness gating, LLM judging, and honest statistics — a good, additive example with no existing equivalent.

  • What it does: Introduces examples/coding-benchmark/ (~9 files): an example that runs a matrix sweep of coding tasks across harness baseline profiles (profiles.ts:70-75), dispatches each cell through a persistent multi-round sandbox run (dispatch.ts:84-162), scores results with deterministic checks → realness gate → LLM judge (eval.ts), and reports paired-bootstrap + Wilson + BH-corrected stats (`stats.t
  • Goals it achieves: 1) Provide a canonical, copy-pasteable pattern for comparing coding-agent harnesses fairly. 2) Demonstrate validators-before-judge, a grading-criteria firewall (dispatch.ts:16-30), controlled tool surfaces (profiles.ts:77-123), and statistically honest comparison (stats.ts). 3) Show that all the heavy primitives (runProfileMatrix, MultiLayerVerifier, scoreAuthenticity, llmJudge, `ens
  • Assessment: Good change on its merits. It is coherent, additive to the examples directory, composes existing substrate primitives rather than reimplementing them, and includes offline smoke tests that verify the load-bearing honesty claims (coding-benchmark.test.ts:43-185). Typecheck (pnpm run typecheck:examples) and the new tests pass. The README is precise about scope limits (3 tasks is underpowered, fi
  • Better / existing approach: none — this is the right approach. I searched the repo for runProfileMatrix usage (grep found only examples/coding-benchmark/benchmark.ts:232, examples/product-eval/product-eval.ts:104, and internal runtime adapters in src/runtime/loop-dispatch.ts and src/conversation/run-persona.ts). The only other matrix example is product-eval, which is a user-sim conversation eval, not a coding b
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A well-composed reference example that runs a coding-benchmark matrix purely over agent-runtime/agent-eval primitives; all 16 offline tests pass, every imported symbol resolves in the installed packages, and it follows the two established example patterns (runProfileMatrix from product-eval, in-proc

  • Integration: Reachable and wired: examples/README.md:47 lists it as example 9b and :109-110 prints the exact CLI. It runs via pnpm tsx examples/coding-benchmark/benchmark.ts and its smoke test (examples/coding-benchmark/coding-benchmark.test.ts) executes in CI — confirmed, 16/16 pass, producing the expected 4 harnesses × 3 scenarios × 1 rep = 12 records and a defined leaderboard. Its caller is the developer
  • Fit with existing patterns: Fits the grain exactly. It composes only substrate primitives, each verified present: runProfileMatrix/inMemoryCampaignStorage/JudgeConfig/llmJudge/Scenario from @tangle-network/agent-eval/campaign; MultiLayerVerifier/ensembleJudge/stats fns from agent-eval; scoreAuthenticity/gateRealness from agent-eval/authenticity; openSandboxRun/extractLlmCallEvent/AgentRunSpec from agent-runtime/loops (signat
  • Real-world viability: Holds up: offline path degrades honestly (missing tsc/biome/node toolchain → fast non-zero exit, not a fake pass; all 3 refine rounds run), the firewall is structural (scenario.prompt is the only field the dispatch copies into the box; rubric/realness are read post-loop in eval.ts), and the stats layer refuses to overclaim (renderStats prints a power caveat when n<6 and the test proves identical r
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts

  • console.log(

🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts

  •      `    this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260624T122332Z

…tion, delete the realness regex gate

The realness regex gate could never prove code is real — it scans text, so a
comment or dead code evades it. Replace it with held-out test execution
(SWE-bench / HumanEval style): the agent develops against a few visible example
tests, then is graded on a hidden suite it never saw and cannot hardcode. A real
solution passes; a hardcode-the-visible cheat fails the held-out inputs.

Deleted entirely: realnessGate, scoreAuthenticity / gateRealness usage, the
gatedByRealness judge wrapper, realnessSignals (realImpl/fakeShim per scenario),
the DEAD_ARTIFACT handling, and the AuthenticitySignals / ProducedFile imports.

Built: per-scenario visibleTest (seeded during the turn) + heldoutTest (seeded
ONLY at grading — the firewall). runHeldout copies the hidden suite into the box
after the loop and runs `node --experimental-transform-types --test`; the
held-out pass rate is the PRIMARY, ungameable correctness score. Composite =
0.7 * held-out + 0.3 * judge-quality (the LLM judge stays as a secondary
code-quality signal; MultiLayerVerifier stays as advisory dev checks).

stats: suppress the SIGNIFICANT tag below the power floor (n<6) and on a
zero-variance pair, so small-n / no-variance never prints a bare SIGNIFICANT.

Offline-proven: a hardcode-the-visible cheat scores held-out 2/4 -> composite
0.59; the real impl scores held-out 4/4 -> composite 0.94 (judge held at 0.80).
Held-out tests are never seeded during the turn (firewall, asserted per
scenario). README rewritten honestly; no realness/regex/authenticity claims.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — d5fa3a7f

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T12:51:59Z

@drewstone drewstone merged commit 9bf8016 into main Jun 24, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants