docs(examples): scientifically-rigorous coding benchmark across harnesses with controlled tool use#369
Conversation
…sses with controlled tool use Add examples/coding-benchmark/ — runs one coding task across claude-code/opencode/codex/cli baseline profiles × scenarios via runProfileMatrix, with a one-line tool-surface knob, validators-before-judge scoring, a 1-or-3 (ensemble) judge layer, and real paired-bootstrap + Wilson + BH stats. The no-cheat firewall (agent context = scenario.prompt only) is enforced and pinpointed in dispatch.ts. Ships an in-process SandboxClient so the whole pipeline compiles and runs offline with no creds, and runs faithfully live with --live + TANGLE_API_KEY.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 3b335ea1
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:02:36Z
# Conflicts: # examples/README.md
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 9e98af79
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:09:45Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 323.5s (2 bridge agents) |
| Total | 323.5s |
💰 Value — sound
Adds a well-structured, copy-pasteable example showing how to compare coding-agent harnesses via runProfileMatrix with validators-before-judge, a no-cheat firewall, and real stats — a gap in the examples learning path.
- What it does: Introduces
examples/coding-benchmark/(9 small files + README) that runs one coding task across claude-code / opencode / codex / cli baseline profiles viarunProfileMatrix. Supports a one-line tool-surface knob (none/web/search-mcp), multi-round refine in one persistent box, deterministic in-box checks, a write-only realness anchor, an optional 3-model cross-family ensemble judge, and pa - Goals it achieves: Makes the "neutral harness measurer" concept from
docs/eval-substrate.mdconcrete and runnable. Demonstrates the canonical campaign-substrate pattern for coding harness comparison, filling a real gap: no existing example showsrunProfileMatrixused for coding tasks with the eval primitives (scoreAuthenticity,ensembleJudge,inMemoryCampaignStorage, paired stats). It is educational, copy- - Assessment: Good on its merits. The code composes existing
agent-runtime/agent-evalprimitives rather than inventing new substrate. The decomposition across files is clear (scenarios, profiles, tools, validators, judges, dispatch, stats, offline box, entrypoint). The no-cheat firewall is a useful, visible design invariant. It follows repo conventions: offline-by-default with--live, imports from packa - Better / existing approach: none — this is the right approach. I checked for existing equivalents:
src/runtime/run-benchmark.tsis the optimization suite (compares strategies like sample/refine on one worker, not harnesses);examples/strategy-suite/demos that.examples/product-eval/usesrunProfileMatrixbut for persona-conversation product evals, not coding.bench/contains real benchmark adapters (SWE-bench, Hum - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
A coherent, well-integrated coding-benchmark example that composes real substrate primitives (runProfileMatrix, openSandboxRun, ensembleJudge, scoreAuthenticity, pairedBootstrap/Wilson/BH) with zero bespoke harness code — every import resolves, it follows the established product-eval pattern, and it
- Integration: Fully wired and reachable. Verified against published substrate (agent-eval@0.99.0 + sandbox@0.9.0): every named import exists with the expected signature — runProfileMatrix/ProfileDispatchFn/DispatchContext/JudgeConfig/inMemoryCampaignStorage (campaign), scoreAuthenticity/gateRealness/AuthenticitySignals/ProducedFile (authenticity), confidenceInterval/wilson/pairedBootstrap/benjaminiHochberg/RunR
- Fit with existing patterns: Hits the grain dead-center. It is the same runProfileMatrix<Scenario,Artifact> cell pattern already established by examples/product-eval/product-eval.ts:104, just richer (judges, multi-round refine, stats). No competing pattern, no reinvention — every moving part is a substrate primitive composed, not reimplemented. The no-cheat firewall is expressed as a structural field-level split on CodingScen
- Real-world viability: Holds up on both paths. Offline: the in-process box (offline-box.ts) implements exactly the SandboxInstance surface openSandboxRun/settle call (streamPrompt yielding a terminal result event, fs.read/write, exec, delete); missing toolchain honestly reads as check-FAIL → loop exhausts maxRounds → stub judge → deterministic leaderboard (documented in offline-box.ts:11-13). Live: lazy-requires real Sa
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts
- console.log(
🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts
` this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
❌ Needs Work —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 26 | 35 | 79 | 26 |
| Confidence | 70 | 70 | 70 | 70 |
| Correctness | 26 | 35 | 79 | 26 |
| Security | 26 | 35 | 79 | 26 |
| Testing | 26 | 35 | 79 | 26 |
| Architecture | 26 | 35 | 79 | 26 |
Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH sumTokens misses data.usage shape → --live cells fail integrity guard — examples/coding-benchmark/dispatch.ts
dispatch.ts:148-158 sumTokens() reads only event.data.tokenUsage.{input,output}Tokens. The substrate's canonical metering shapes (src/runtime/sandbox-events.ts:25-70, extractLlmCallEvent) are: type:'result' → data.usage.{input,output}Tokens; type:'done' → data.tokenUsage; type:'llm_call' → data.tokensIn/tokensOut. The offline-box.ts stub emits data.tokenUsage so offline meters correctly, but a real sandbox emitting type:'result' with data.usage yields sum=0 → the
if (usage.input || usage.output) ctx.cost.observeTokens(usage)guard at dispatch.ts:115 skips observation → runProfileMatrix with integrity:'assert' throws BackendIntegrityError on every cell. README claims 'Nothing else in the example changes' between offline and live, which is not true as shipped. Fix: reuse the runtime's extr
🔴 HIGH missing test files break offline deterministic checks — examples/coding-benchmark/scenarios.ts
scenarios.ts:75 and :106 run
node --test test/rate-limiter.test.jsandnode --test test/csv.test.js, but those files are not present in the PR. Verified empirically:node --test <missing-file>exits 1 with 'Could not find ...'. Offline, thetestvalidator therefore always fails, the refine loop never breaks early, and the example cannot demonstrate passing deterministic checks. Fix by adding the referenced test files (and any fixtures) or by documenting that offline mode intentionally skips tests.
🔴 HIGH paired bootstrap mishandles reps > 1 — examples/coding-benchmark/stats.ts
stats.ts:71-82 builds
bByScenario = new Map(b.map((r) => [r.scenarioId ?? '', r])). With reps>1, multiple records share the same scenarioId, so the Map keeps only the last record per scenario. The resulting paired arrays contain duplicated, arbitrarily matched observations instead of matched (scenario, rep) pairs, violating the paired-bootstrap assumption documented inpairedBootstrap. Reproduced conceptually: with reps=2, aScores has four entries while bScores repeats the same two values. Fix by matching on a unique (scenarioId, rep) key or aggregate to one score per scenario before pairing.
Other
🟠 MEDIUM No tests for the example; offline smoke would have caught the live-mode gap — examples/coding-benchmark/benchmark.ts
tsconfig.examples.json + biome.json include examples in typecheck/lint, but there is no test (no .test.ts, no vitest spec) asserting the offline pipeline produces a realness-scored, judge-scored, stats-rendered output. A single vitest that runs main() in offline mode and asserts allRecords.length === harnessProfiles.length * scenarios.length * reps and that pairwiseStats returns a defined leaderboard would cost ~20 lines and lock in the contract the README claims. The repo already ships vitest (vitest.config.ts at root).
🟠 MEDIUM require() used in ESM live path — examples/coding-benchmark/benchmark.ts
benchmark.ts:86-88 uses
require('@tangle-network/sandbox')insideif (live). The package has"type": "module". Whilepnpm tsxpolyfillsrequire()for .ts files, running the compiled ESM output directly throwsReferenceError: require is not defined. Replace with dynamicimport('@tangle-network/sandbox')and an async client factory.
🟠 MEDIUM ensemble judge has less context than single judge — examples/coding-benchmark/judges.ts
singleCodeJudge(line 113-124) buildsjudgeUserPromptthat includes the task prompt, the rubric note, AND deterministic check results (typecheck/test/lint pass/fail). ButensembleCodeJudge(line 130-146) passes onlyartifact.solutionandscenario?.promptto itsscoreOnecallback — the rubric note and check results are dropped. This means the 3-model cross-family ensemble judges on LESS information than the single judge, making it strictly worse for accuracy even though it's presented as the more r
🟠 MEDIUM realness validator is recorded but never gates the judge — examples/coding-benchmark/validators.ts
validators.ts:74-89 realnessValidator returns {valid,score} and dispatch.ts:128 writes it to ctx.artifacts, but no consumer reads artifact.realness. judges.ts:107-116 singleCodeJudge scores artifact.solution directly with no reference to artifact.realness.valid. README.md:46 claims 'Realness anchor ... catches a stub that compiles but fakes the hard part' and the validators-before-judge ordering implies the anchor gates the judge band. As shipped, a gated stub still receives a full judge score; only the offline writeJson records the gate. Either route artifact.realness into the judge prompt (so it can downweight) or short-circuit the judge when realness.valid===false (return composite 0). At minimum, tighten the README to say the anchor is recorded-for-honesty, not judge-gating.
🟡 LOW Tool-knob description is imprecise — examples/README.md
Line 47 says 'a one-line tool knob (websearch / webfetch / MCP)'. The actual preset names in examples/coding-benchmark/tools.ts:29 are 'none' | 'web' | 'search-mcp'; 'websearch' and 'webfetch' are toggled together by the 'web' preset, and the MCP preset is specifically 'search-mcp', not a generic MCP knob. Also, the Run section (lines 109-110) does not demonstrate the
--toolsflag despite highlighting it as a feature. Fix: change parenthetical to '(none / web / search-mcp)' and optionally add a command like `pnpm tsx examples/coding-benchmark/ben
🟡 LOW Offline refine loop is a no-op: script.solutionFor ignores the round argument — examples/coding-benchmark/benchmark.ts
benchmark.ts:34-58 offlineSolutions.{rate-limiter,csv-parser}.solutionFor is declared as (round: number) => string but the implementations close over no round state and return the same source each call. offline-box.ts:43-45 increments
roundand calls script.solutionFor(round), so rounds 2 and 3 write byte-identical content to the same path. The dispatch's multi-round refine loop (dispatch.ts:99-115) therefore produces the same checks output every iteration. offline-box.ts:33 docstring even claims 'round 2 can differ from round 1 (refine demo)' — the capability is in the interface but unexercised. Either drop the round parameter from OfflineScript.solutionFor (clarity), or seed a 2-round refine script for one scenario (demo value).
🟡 LOW require('@tangle-network/sandbox') only resolves under tsx shim, not pure ESM node — examples/coding-benchmark/benchmark.ts
benchmark.ts:79 const { SandboxClient: RealClient } = require('@tangle-network/sandbox'). The repo is "type": "module" (package.json:34). Pure
nodewould throw ReferenceError: require is not defined. tsx (the documented runner) shims it, so the README command works, but a partner copying the snippet into an ESM script breaks. Use a top-levelawait import('@tangle-network/sandbox')guarded behind the live flag, or document the tsx-only constraint in the README's 'Going live' section.
🟡 LOW iteration: maxRounds passes a constant 3 instead of the actual final round — examples/coding-benchmark/dispatch.ts
dispatch.ts:124-127 calls realnessValidator(...).validate(artifact, { iteration: maxRounds, signal: ctx.signal }). maxRounds is the constant 3 (dispatch.ts:31), but ValidationCtx.iteration is documented (src/runtime/types.ts:31) as 'Iteration index this output came from (0-based)'. The realnessValidator implementation ignores ctx (validators.ts:75-76 declares only
artifact), so this is benign today, but the value is semantically wrong and would mislead a future validator that reads ctx.iteration. Capture the loop'sroundvariable into a const before the validator call and pass that.
🟡 LOW run.box as unknown as RunBox declares an fs.read that is never used — examples/coding-benchmark/dispatch.ts
dispatch.ts:46 type RunBox = CheckBox & { fs: { read(path: string): Promise } }. dispatch.ts:117 cast run.box as unknown as RunBox. runBoxChecks (validators.ts:62-70) only calls box.exec(command); the fs.read member is dead. The real SandboxInstance does expose get fs(): FileSystem (sandbox-CNcyBhPp.d.ts:4545) so the cast is harmless at runtime, but the type assertion is wider than the actual call surface and invites a reader to think runBoxChecks reads files. Drop fs.read from RunBox (it's just CheckBox).
🟡 LOW exported OfflineQuality type is dead code — examples/coding-benchmark/offline-box.ts
Line 35 exports
OfflineQuality = 'real' | 'stub'but grep across the entire repo returns only the definition site. The type is referenced nowhere — no import, no usage. The comment on line 32-34 describes it as a fidelity level for canned solutions, but the concept is never wired. Dead code in an example is misleading.
🟡 LOW model snapshot format comment mismatch — examples/coding-benchmark/profiles.ts
profiles.ts:43-45 states model ids must be
name@YYYY-MM-DDorname-YYYYMMDD, but the actual defaults (e.g.,anthropic/claude-sonnet-4-5-2025-09-29) useprovider/name-YYYY-MM-DD. Runtime accepts them, so the comment is misleading. Update the comment to the accepted format.
🟡 LOW Benjamini-Hochberg correction is degenerate with constant p-proxy — examples/coding-benchmark/stats.ts
Line 118 uses
r.low > 0 || r.high < 0 ? 0.04 : 0.5as a p-value proxy for all pairs whose bootstrap CI excludes zero. Because all 'significant' pairs share the exact same proxy p-value (0.04), the BH procedure cannot distinguish between them by true effect size. The result: for n harnesses with C(n,2) pairs, any pair needs to rank at position k >= 0.04 * C(n,2) / 0.05 to pass BH — e.g., with 4 harnesses (6 pairs), at least 5 of 6 pairs must have CI excluding zero before ANY pair passes BH. This makes BH correction overly conservative, effectively masking real pairwise differences. For the offline stub this is harmless (all deltas are 0.000), but for a l
🟡 LOW leaderboard harness names include opaque hash suffixes — examples/coding-benchmark/stats.ts
stats.ts:58-67 keys grouping on
r.agentProfile?.profileId ?? r.candidateId. Actual run output shows labels likeclaude-code-baseline-91b9b8dc6fe218e5instead ofclaude-code-baseline, making the leaderboard harder to read. Use a human-readable profile name for display while keeping the stable id for grouping if needed.
🟡 LOW pProxy 0.04/0.5 is a hand-rolled p-value feed into benjaminiHochberg — examples/coding-benchmark/stats.ts
stats.ts:118 const pProxy = raw.map((r) => (r.low > 0 || r.high < 0 ? 0.04 : 0.5)). The agent-eval substrate ships real significance primitives (pairedTTest, mannWhitneyU, wilcoxonSignedRank, mcnemar — all exported from index.d.ts). Feeding BH a binary 0.04/0.5 proxy loses power information and is explicitly a placeholder. Comment labels it 'proxy' which is honest; for a benchmark marketed as 'scientifically rigorous', wire a real paired test on the matched scores instead.
tangletools · 2026-06-23T23:36:53Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 3 Blocking Findings — 9e98af79
Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-23T23:36:53Z · immutable trace
…dings, drop 3 reinventions Reworks examples/coding-benchmark to compose published agent-eval / agent-runtime primitives instead of hand-rolling them, fixing every PR-369 review finding. Reinventions removed (now substrate primitives): - hand-rolled judge (judgeSystemPrompt/judgeUserPrompt/score) -> llmJudge / ensembleJudge - sumTokens (read only data.tokenUsage; summed 0 under a real sandbox emitting data.usage -> integrity:'assert' threw on every --live cell) -> extractLlmCallEvent - runBoxChecks flat pass/fail -> MultiLayerVerifier ordered pipeline (typecheck -> test -> lint, dependency-based skip, blended score) Findings: --live token shapes fixed + verified; real seeded fixture tests so the deterministic checks have a file to run; correct paired-scenario stats keying; an offline vitest smoke; dynamic import() of the SDK behind --live; the ensemble now sees the same full context as the single judge; the realness gate ACTUALLY gates the judge (short-circuits to composite 0, no model call); honest tool-knob docs + --tools command; a real cross-round offline refine demo; dead RunBox.fs.read / OfflineQuality removed; real paired test (pairedTTest) feeding BH real p-values; human-readable harness names on the leaderboard. Consolidated validators+judges -> eval.ts and tools -> profiles.ts (9 source files -> 7 + a smoke test). README every claim matched to the code. Bumps the agent-eval devDependency to >=0.99.0 (llmJudge). The peerDependency floor is unchanged — only the example uses llmJudge, src does not.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 543881c8
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T08:59:17Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 366.9s (2 bridge agents) |
| Total | 366.9s |
💰 Value — sound
Adds a copy-pasteable examples/coding-benchmark/ that composes agent-runtime/agent-eval primitives to run a rigorous, firewalled coding benchmark across harnesses with validators-before-judge, realness gating, and honest paired statistics — a good addition with no existing equivalent.
- What it does: Introduces examples/coding-benchmark/ (~8 files) demonstrating how to run one coding task across a matrix of harnesses × baseline profiles × scenarios, with a one-line tool preset knob, multi-round refinement in one persistent sandbox, deterministic checks (typecheck/test/lint via MultiLayerVerifier), a structural realness gate that can short-circuit the judge, optional cross-family ensemble judgi
- Goals it achieves: Gives partners and contributors a canonical, copy-pasteable example of a scientifically honest coding-agent benchmark using the substrate primitives; demonstrates the no-cheat firewall (agent only sees
scenario.prompt, eval criteria stay outside the box); shows how to swap tool surfaces via the profile without per-harness config; and validates the wiring with an offline vitest smoke test. - Assessment: Good change, built in the grain of the codebase. It uses the published primitives the docs point to:
runProfileMatrix(benchmark.ts:32,202),openSandboxRunfor persistent resumable boxes (dispatch.ts:86),MultiLayerVerifier(eval.ts:170),llmJudge/ensembleJudge(eval.ts:244,264),scoreAuthenticity/gateRealness(eval.ts:199-200), and the stats primitives (stats.ts:22-28). The AgentPro - Better / existing approach: none — this is the right approach. I checked examples/ and bench/ for existing equivalents: examples/product-eval uses
runProfileMatrixbut for persona conversations, not coding; examples/ui-audit usesrunLoopover a UI-audit client; bench/search-bench/run.mts compares coding harnesses but is internal research infrastructure, not a campaign-matrix example, and lacks the validators-before-judge - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
A well-integrated, offline-runnable example that composes real agent-eval/agent-runtime primitives to compare coding harnesses fairly with rigorous statistics; nothing existing does this and every integration point verifies.
- Integration: Fully reachable and wired. Registered in examples/README.md entry 9b (Tier 2) with CLI run commands; covered by tsconfig.examples.json (pnpm run typecheck:examples passes clean); has a passing vitest smoke test (3 tests). I installed deps and ran
pnpm tsx examples/coding-benchmark/benchmark.tsafterpnpm build— it produced exactly the documented output (8 records = 4 harnesses × 2 scenarios × - Fit with existing patterns: Composes the right primitives in the grain of the codebase. Follows the established examples convention (imports from published package surface, not relative paths; runs from the repo tsx). Uses the documented agent-eval campaign pattern (runProfileMatrix + ProfileDispatchFn + JudgeConfig) — the same pattern product-eval/ uses for a different domain (user-sim persona eval), so it is consistent, no
- Real-world viability: Holds up. The offline path genuinely exercises the whole pipeline: the real matrix runs, MultiLayerVerifier executes real tsc/biome/node --test commands (which fail fast offline because the toolchain isn't on PATH — the documented honest signal, not a fake pass), the realness gate runs for real and the smoke test asserts it both catches a
return truestub (gated→0) and passes a real token-bucket - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts
- console.log(
🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts
` this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 80 | 80 | 60 | 60 |
| Confidence | 80 | 80 | 80 | 80 |
| Correctness | 80 | 80 | 60 | 60 |
| Security | 80 | 80 | 60 | 60 |
| Testing | 80 | 80 | 60 | 60 |
| Architecture | 80 | 80 | 60 | 60 |
Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM parseArgs produces NaN/incorrect values when flags follow value options — examples/coding-benchmark/benchmark.ts
Lines 53-61: The
opt()helper doesn't validate that the next argument is not a flag.--reps --liveproducesreps = NaN(Number('--live')), and--tools --livesetstoolPreset = '--live'(invalid ToolPreset). Verified: node -e 'parseArgs(["--reps","--live"]).reps' → NaN, 'parseArgs(["--tools","--live"]).toolPreset' → "--live". Fix: check thatargv[i+1]doesn't start with--, or use a proper arg parser.
🟠 MEDIUM devDependency version (>=0.99.0) diverges from peerDependency (>=0.97.0) for @tangle-network/agent-eval — package.json
Line 92 (devDependencies): "@tangle-network/agent-eval": ">=0.99.0 <1.0.0" — bumped from ">=0.97.0". Line 120 (peerDependencies): "@tangle-network/agent-eval": ">=0.97.0 <1.0.0" — NOT bumped. agent-eval is a REQUIRED peer dependency (not listed in peerDependenciesMeta optional, line 126-138). Gap: consumers told they can use 0.97.0–0.98.x but dev now requires 0.99.0. Verified: no src/ changes in this PR, so library API is unchanged and consumers w
🟡 LOW Non-standard table key '9b' — examples/README.md
Line 47: The new row key
9bis non-standard in an otherwise numeric sequence (1-22). It's intentional (interleaves with9for ui-audit to avoid renumbering the full tier list), but could confuse readers expecting pure numeric ordering. Consider10and shift everything down, or add a footnote explaining thebsuffix.
🟡 LOW README command depends on undeclared tsx — examples/coding-benchmark/README.md
Lines 6-16 document
pnpm tsx examples/coding-benchmark/benchmark.ts, buttsxis not listed in this repo's devDependencies andnode_modules/.bin/tsxdoes not exist afterpnpm install --frozen-lockfile. The command happens to work in this environment only because a globaltsxbinary is on PATH (/home/drew/code/cli-bridge/node_modules/.bin/tsx). In a clean checkout without a global tsx, the documented command will fail. Fix: addtsxto devDependencies, or change the documented command to a runner that is guaranteed (e.g.node --experimental-strip-typesorpnpm exec tsc && node ...).
🟡 LOW Offline csv-parser cannot detect row separators — examples/coding-benchmark/benchmark.ts
Lines 88-96: The offline CSV parser checks
c === '\\n'to detect row splits, butcis a single character fromcharAt()and'\\n'is a two-character string — the comparison is always false. Verified via node simulation: multi-row input'a,b\\nc,d'produces one row[['a','b\\nc','d']]instead of two. The test fixture only tests single-row cases so this is latent, but the example code is incorrect for multi-row CSV.
🟡 LOW benchmark runDir is created but never cleaned up — examples/coding-benchmark/benchmark.ts
Line 176 creates a temp directory with
mkdtempSync(join(tmpdir(), 'coding-benchmark-'))and passes it torunProfileMatrix, butmainnever removes it. After running the test suite multiple times,/tmp/coding-benchmark-*directories accumulate. Fix: wrap the matrix run in a try/finally andrmSync(runDir, { recursive: true, force: true })after records are returned.
🟡 LOW Advisory lint failures are invisible to the refine prompt — examples/coding-benchmark/dispatch.ts
nextPromptat dispatch.ts:51 callslayerOutput(report, 'lint'), whosepassedfield is true becausecheckLayerfor advisory layers returnsstatus: 'pass'regardless of the biome result (eval.ts:135-144). This keeps lint from blockingallPass(correct), but it also means biome warnings never reach the agent in the refine prompt, so the agent cannot correct style issues. Fix: include the lint layer's actualok/output in the refine prompt while still treating lint as advisory forallPass.
🟡 LOW Multiple unsafe 'as unknown as' type casts bypass type safety — examples/coding-benchmark/dispatch.ts
Line 120:
run.box as unknown as CheckBox— double-casts the sandbox instance to the minimal CheckBox interface. Line 168:extractLlmCallEvent(ev as never, 'agent')— casts each stream event toneverbefore passing to the extractor. In offline-box.ts lines 64, 100:as unknown as SandboxEventandas unknown as SandboxInstance. These all suppress type checking. While the
🟡 LOW Ensemble judge model names lack snapshot dates — contradicts profiles.ts own rule — examples/coding-benchmark/eval.ts
eval.ts:ensembleCodeJudge (line ~213) hardcodes
models: ['deepseek-chat', 'gpt-4o-mini', 'gemini-flash']— bare aliases with no snapshot date. profiles.ts:harnessModel (line ~43) explicitly requiresprovider/name-YYYY-MM-DDform and documents why: 'a run record without the exact model snapshot is not reproducible.' The ensemble judges also produce scored records; using aliases undermines the reproducibility property the example otherwise enforces. If the live router rejects aliases, the ensemble silently fails. Fix: use snapshot-dated IDs consistent with profiles.ts.
🟡 LOW Unquoted paths in shell-executed check and seed commands — examples/coding-benchmark/eval.ts
eval.ts:seedFile (line ~103):
box.exec(\mkdir -p ${dir} && printf %s '${b64}' | base64 -d > ${file.path}`)— file.path and dir are interpolated unquoted. scenarios.ts:typecheckCmd/testCmd/lintCmd (lines ~55-60):tsc ... ${path},node --test ${fixturePath},biome check ${path}— same pattern. Current corpus paths (test/rate-limiter.test.js, src/csv.ts) are safe and eval-controlled (never agent-influenced), so this is not an active vulnerability. But as example code that partners will copy, it sets a latent injection precedent. Fix: single-quote paths:mkdir -p '${dir}' ... > '${file.path}'`.
🟡 LOW llmJudge imported from main index breaks peer-floor (>=0.97.0) consumers — examples/coding-benchmark/eval.ts
Line 1:
import { ... llmJudge ... } from '@tangle-network/agent-eval'. Verified that llmJudge IS exported from the main index in 0.99.0 (the devDep floor>=0.99.0) but is NOT in 0.97.x/0.98.x (within the declared peerDependencies range>=0.97.0). A downstream consumer pinning agent-eval to the peer floor who compiles this example getsllmJudge is not exported. Fix: import llmJudge from@tangle-network/agent-eval/campaign(exported in both versions). Every other symbol used exists in both.
🟡 LOW seedFile shells out with unquoted file paths — examples/coding-benchmark/eval.ts
Line 105 builds
mkdir -p ${dir} && printf %s '${b64}' | base64 -d > ${file.path}. The current scenario paths are fixed (src/...,test/...), but the path is interpolated unquoted, so a scenario path containing spaces or shell metacharacters would break or inject. Fix: quote the path (e.g., useJSON.stringify(file.path)as the shell word) or, better, seed fixtures viabox.fs.writeinstead ofbox.exec.
🟡 LOW seedFile uses unsanitized paths in shell command — examples/coding-benchmark/eval.ts
Line 105:
await box.exec('mkdir -p ${dir} && printf %s ... base64 -d > ${file.path}')—file.pathanddirare interpolated directly into a shell command. Currently safe because scenarios.ts uses hardcoded paths, but this is a latent injection vector if scenarios are ever loaded from external config. The base64 encoding itself is safe (standard b64 has no quotes/shell metachar). Note: scenario.fixture.content is NOT escaped against single quotes in theprintf %s '${b64}'wrapper, but base64 output never contains single quotes.
🟡 LOW Test fixture .js imports .ts — node --test needs a TS loader — examples/coding-benchmark/scenarios.ts
Lines 64-68:
node --test test/rate-limiter.test.jswhere the .js fixture doesimport { RateLimiter } from '../src/rate-limiter.ts'. Node.js 22+ cannot natively resolve TypeScript imports from JavaScript without--loader ts-node/esmor--import tsx. The README acknowledges this fails offline, but the command itself is structurally incorrect for any vanilla Node.js environment. In a real harness box the toolchain may be pre-configured, but the check commands offer no fallback.
🟡 LOW Test fixtures import .ts files from .js — requires Node 23.6+ type stripping — examples/coding-benchmark/scenarios.ts
Fixture test/rate-limiter.test.js (line ~82) does
import { RateLimiter } from '../src/rate-limiter.ts'and test/csv.test.js (line ~143) doesimport { parseCsv } from '../src/csv.ts'. The check command isnode --test test/rate-limiter.test.js(no --experimental-strip-types flag). Importing a .ts file from .js requires Node 23.6+ (unflagged type stripping) or the --experimental-strip-types flag on 22.6–23.5. In live mode with an older Node, the test layer always fails regardless of solution quality, making the typecheck→test→lint pipeline's test layer a permanent red. Offline this is moot (toolchain absent anyway). The README's claim that checks 'run for real' in a live box depends on the box Node version.
🟡 LOW devDep lower bound bumped (0.97→0.99) but peerDep stays at 0.97.0 — package.json
package.json:92 now requires
@tangle-network/agent-eval >=0.99.0 <1.0.0for dev, while package.json:120 still declares peerDependency>=0.97.0 <1.0.0. Currently safe: nosrc/code changed in this PR andfilesfield excludesexamples/, so nothing published needs 0.99. Risk is latent — any future PR that imports a 0.99-only agent-eval API intosrc/would silently allow consumers on 0.97/0.98 and break at runtime. Suggest a comment in the PR description noting the asymmetry is intentional, or bump peer to>=0.99.0proactively if any 0.99 API is planned for src/ soon. Not blocking.
🟡 LOW devDependency minimum exceeds peerDependency minimum for @tangle-network/agent-eval — package.json
Line 92 devDependencies specifies @tangle-network/agent-eval >=0.99.0 <1.0.0, but line 120 peerDependencies still allows >=0.97.0 <1.0.0. After this change CI/develop will install 0.99.0 while consumers can legally install 0.97.0 or 0.98.0. If 0.98.0/0.99.0 introduced breaking type or runtime changes in re-exported substrate primitives, consumers on the older peer range may see failures that CI no longer catches. Either align peerDependencies to >=0.99.0 <1.0.0 if the library truly requires it, or add a CI matrix job that verifies against the lowest supported peer
tangletools · 2026-06-24T10:42:57Z · trace
…ta, reps stop pseudo-replicating, TS test runner, runnable on clean clone Four partner-blocking defects against the honesty pitch: - HIGH: the round-0 offline stub the dispatch wrote scored composite 0.6 (gated:false) because its `refillPerSec` param matched the realImpl regex, so the "stub → gated → composite 0" demo never fired on a real run — only the unit test's separate strawman gated. Make the round-0 stub genuinely hollow (inert `_capacity`/`_ratePerSec` args, no refill math) so gateRealness gates it to composite 0 on the benchmark's own data. Export `offlineSolutions` and assert the gate against the EXACT dispatch stub in the smoke test. - HIGH: the leaderboard CI/Wilson were computed over every raw rep record, so identical reps faked a narrower interval (pass-CI [34%,100%] → [61%,100%] at reps=3). Collapse reps to one mean per (harness,scenario) before the CI/Wilson, matching the pairing path. Add a regression test that identical reps leave the CI unchanged. - MEDIUM: the test check ran plain `node --test`; the fixture imports the solution as `.ts`, and Node strip-only mode throws ERR_UNSUPPORTED_TYPESCRIPT_SYNTAX on constructor parameter properties (the canonical impl's style), false-failing a correct solution. Run `node --experimental-transform-types --test`. - MEDIUM: `tsx` was undeclared, so the documented `pnpm tsx ...benchmark.ts` faceplanted on a clean clone. Pin `tsx@^4.22.4` in devDependencies and update the lockfile (agent-eval stays 0.99.0). README updated so every claim matches: the gate fires on real data, reps are honest, the test runner handles TS param properties, and the run command works.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 818e73db
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T10:52:43Z
…ewall claims, runnable-clone hardening Address the union of four critical-audit lenses on the coding-benchmark example. Realness gate (HIGH): the gate fired on one strawman stub shape, not the natural cheat. Tightened each task's realImpl to require the actual hard-part work (refill MATH, quote-state tracking, capacity eviction) so a decoy token — a `refillPerSec` param name, a `for (` loop, a passthrough Map — no longer reads as a real impl, and added a third LRU scenario whose only passing path is the real eviction algorithm. The smoke test now asserts each NATURAL cheat is gated, per task. Firewall claims (HIGH/MEDIUM): the README/dispatch/scenarios claimed the agent "literally cannot read the answer key". The seeded test fixture is intentionally visible (TDD-style) and a multi-round agent can read it; only the LLM-judge rubric and realness signals are firewalled. Softened the claim to that precise truth, and rephrased "gates to composite 0 on the actual run" to "the smoke test asserts the gate; the run refines past it, so no leaderboard cell ends gated". DX/runtime hardening: parseArgs no longer swallows a following flag as an option value (`--reps --live` → reps=1, not NaN) and clamps reps to a positive integer; the matrix runDir is cleaned in a finally; the LLM judge is imported from /campaign so it resolves across the whole peer range; ensemble panel models are snapshot-dated; seedFile prefers the structured fs.write seam (no shell injection surface); the realness scan takes the seeded fixture as a reference so a real solution carries no spurious DEAD_ARTIFACT; stats fails loud on a missing scenarioId instead of merging into one bucket; renderStats prints a power caveat when n<6; advisory lint warnings now reach the refine prompt without gating allPass; casts narrowed; README notes the Node>=22.6 test-layer floor, the live-box PATH requirements, and the offline-ensemble degeneracy.
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 279.1s (2 bridge agents) |
| Total | 279.1s |
💰 Value — sound
Adds a well-scoped, runnable examples/coding-benchmark/ that composes agent-runtime/agent-eval primitives to benchmark coding-agent harnesses across tasks with validators, realness gating, LLM judging, and honest statistics — a good, additive example with no existing equivalent.
- What it does: Introduces
examples/coding-benchmark/(~9 files): an example that runs a matrix sweep of coding tasks across harness baseline profiles (profiles.ts:70-75), dispatches each cell through a persistent multi-round sandbox run (dispatch.ts:84-162), scores results with deterministic checks → realness gate → LLM judge (eval.ts), and reports paired-bootstrap + Wilson + BH-corrected stats (`stats.t - Goals it achieves: 1) Provide a canonical, copy-pasteable pattern for comparing coding-agent harnesses fairly. 2) Demonstrate validators-before-judge, a grading-criteria firewall (
dispatch.ts:16-30), controlled tool surfaces (profiles.ts:77-123), and statistically honest comparison (stats.ts). 3) Show that all the heavy primitives (runProfileMatrix,MultiLayerVerifier,scoreAuthenticity,llmJudge, `ens - Assessment: Good change on its merits. It is coherent, additive to the examples directory, composes existing substrate primitives rather than reimplementing them, and includes offline smoke tests that verify the load-bearing honesty claims (
coding-benchmark.test.ts:43-185). Typecheck (pnpm run typecheck:examples) and the new tests pass. The README is precise about scope limits (3 tasks is underpowered, fi - Better / existing approach: none — this is the right approach. I searched the repo for
runProfileMatrixusage (grepfound onlyexamples/coding-benchmark/benchmark.ts:232,examples/product-eval/product-eval.ts:104, and internal runtime adapters insrc/runtime/loop-dispatch.tsandsrc/conversation/run-persona.ts). The only other matrix example isproduct-eval, which is a user-sim conversation eval, not a coding b - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
A well-composed reference example that runs a coding-benchmark matrix purely over agent-runtime/agent-eval primitives; all 16 offline tests pass, every imported symbol resolves in the installed packages, and it follows the two established example patterns (runProfileMatrix from product-eval, in-proc
- Integration: Reachable and wired: examples/README.md:47 lists it as example 9b and :109-110 prints the exact CLI. It runs via
pnpm tsx examples/coding-benchmark/benchmark.tsand its smoke test (examples/coding-benchmark/coding-benchmark.test.ts) executes in CI — confirmed, 16/16 pass, producing the expected 4 harnesses × 3 scenarios × 1 rep = 12 records and a defined leaderboard. Its caller is the developer - Fit with existing patterns: Fits the grain exactly. It composes only substrate primitives, each verified present: runProfileMatrix/inMemoryCampaignStorage/JudgeConfig/llmJudge/Scenario from @tangle-network/agent-eval/campaign; MultiLayerVerifier/ensembleJudge/stats fns from agent-eval; scoreAuthenticity/gateRealness from agent-eval/authenticity; openSandboxRun/extractLlmCallEvent/AgentRunSpec from agent-runtime/loops (signat
- Real-world viability: Holds up: offline path degrades honestly (missing tsc/biome/node toolchain → fast non-zero exit, not a fake pass; all 3 refine rounds run), the firewall is structural (scenario.prompt is the only field the dispatch copies into the box; rubric/realness are read post-loop in eval.ts), and the stats layer refuses to overclaim (renderStats prints a power caveat when n<6 and the test proves identical r
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts
- console.log(
🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts
` this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
…tion, delete the realness regex gate The realness regex gate could never prove code is real — it scans text, so a comment or dead code evades it. Replace it with held-out test execution (SWE-bench / HumanEval style): the agent develops against a few visible example tests, then is graded on a hidden suite it never saw and cannot hardcode. A real solution passes; a hardcode-the-visible cheat fails the held-out inputs. Deleted entirely: realnessGate, scoreAuthenticity / gateRealness usage, the gatedByRealness judge wrapper, realnessSignals (realImpl/fakeShim per scenario), the DEAD_ARTIFACT handling, and the AuthenticitySignals / ProducedFile imports. Built: per-scenario visibleTest (seeded during the turn) + heldoutTest (seeded ONLY at grading — the firewall). runHeldout copies the hidden suite into the box after the loop and runs `node --experimental-transform-types --test`; the held-out pass rate is the PRIMARY, ungameable correctness score. Composite = 0.7 * held-out + 0.3 * judge-quality (the LLM judge stays as a secondary code-quality signal; MultiLayerVerifier stays as advisory dev checks). stats: suppress the SIGNIFICANT tag below the power floor (n<6) and on a zero-variance pair, so small-n / no-variance never prints a bare SIGNIFICANT. Offline-proven: a hardcode-the-visible cheat scores held-out 2/4 -> composite 0.59; the real impl scores held-out 4/4 -> composite 0.94 (judge held at 0.80). Held-out tests are never seeded during the turn (firewall, asserted per scenario). README rewritten honestly; no realness/regex/authenticity claims.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — d5fa3a7f
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T12:51:59Z
What
examples/coding-benchmark/— a copy-pasteable example that runs one coding task across coding-agent harnesses, fairly and honestly, with real statistics, in ~8 small files of pure composition. Every moving part is anagent-runtimeoragent-evalprimitive; zero bespoke harness code.What it does
runProfileMatrixover harness × baseline-profile × scenario: claude-code / opencode / codex / cli, each on its bare default profile (we measure the harness, not our scaffolding), across two held-out tasks (token-bucket rate limiter, RFC-4180 CSV parser).withTools(profile, 'none'|'web'|'search-mcp')authors native web tools / a mounted MCP onto the profile; the sandbox substrate materializes it per harness. CLI:--tools.openSandboxRungives one persistent, resumable box; each round's prompt is built from the prior round's deterministic-check failures (.start()/.resume()).typecheck/test/lint, ~$0, in-box) + a realness anchor (scoreAuthenticity/gateRealness), then the LLM judge only on the band the checks can't resolve.singleCodeJudge);--ensemble→ 3 cross-family models (ensembleJudge,crossFamily:true), reduced byaggregateJudgeVerdicts. 4-dim weighted rubric (correctness 0.40 / completeness 0.25 / code_quality 0.20 / robustness 0.15).confidenceInterval(per-harness composite CI),wilson(pass-rate binomial CI),pairedBootstrapon matched scenarios,benjaminiHochbergto BH-correct across all harness pairs.The no-cheat firewall
The agent's context is
scenario.promptonly. The validator commands, realness signals, and rubric note live on eval-only fields of the scenario, read after the loop by the validators/judge — never written into the box. The firewall is a labeled block indispatch.ts(THE NO-CHEAT FIREWALL LIVES HERE); the realness anchor is write-only to the record (ctx.artifacts). Verified: the realness scan scores a real impl 85 vs areturn truestub 35 (gated) on the sample tasks.Offline by default, faithful live
Ships an in-process
SandboxClient(offline-box.ts) so the whole pipeline compiles and runs with no creds — proven end-to-end (default: 8 records;--tools web --ensemble --reps 2: 16 records).--liveswaps in a real@tangle-network/sandboxclient + a real judge model; nothing else changes.Verification
pnpm run lint— clean (328 files)pnpm run typecheck(src) +pnpm run typecheck:examples— clean--ensemble/--tools web/--reps) — full leaderboard + significance matrix producedOne runtime gap worked around
runProfileMatrixrejects a bare model id; it requires a snapshot-versioned id (name@YYYY-MM-DDorname-YYYYMMDD) so records are reproducible.profiles.tsnow uses dated, env-overridable model ids and documents why. (Caught by actually running it — typecheck was green but the first run threwProfileMatrixError: ... lacks a snapshot version.)Operator + partner-facing — please review, do not merge yet.