docs(examples): scientifically-rigorous coding benchmark across harnesses with controlled tool use by drewstone · Pull Request #369 · tangle-network/agent-runtime

drewstone · 2026-06-23T23:02:29Z

What

examples/coding-benchmark/ — a copy-pasteable example that runs one coding task across coding-agent harnesses, fairly and honestly, with real statistics, in ~8 small files of pure composition. Every moving part is an agent-runtime or agent-eval primitive; zero bespoke harness code.

What it does

Matrix sweep — runProfileMatrix over harness × baseline-profile × scenario: claude-code / opencode / codex / cli, each on its bare default profile (we measure the harness, not our scaffolding), across two held-out tasks (token-bucket rate limiter, RFC-4180 CSV parser).
One-line tool knob — withTools(profile, 'none'|'web'|'search-mcp') authors native web tools / a mounted MCP onto the profile; the sandbox substrate materializes it per harness. CLI: --tools.
Multi-round refine — openSandboxRun gives one persistent, resumable box; each round's prompt is built from the prior round's deterministic-check failures (.start() / .resume()).
Validators before judge — 3 deterministic checks (typecheck/test/lint, ~$0, in-box) + a realness anchor (scoreAuthenticity/gateRealness), then the LLM judge only on the band the checks can't resolve.
Judge layer — 1 judge by default (singleCodeJudge); --ensemble → 3 cross-family models (ensembleJudge, crossFamily:true), reduced by aggregateJudgeVerdicts. 4-dim weighted rubric (correctness 0.40 / completeness 0.25 / code_quality 0.20 / robustness 0.15).
Real stats — confidenceInterval (per-harness composite CI), wilson (pass-rate binomial CI), pairedBootstrap on matched scenarios, benjaminiHochberg to BH-correct across all harness pairs.

The no-cheat firewall

The agent's context is scenario.prompt only. The validator commands, realness signals, and rubric note live on eval-only fields of the scenario, read after the loop by the validators/judge — never written into the box. The firewall is a labeled block in dispatch.ts (THE NO-CHEAT FIREWALL LIVES HERE); the realness anchor is write-only to the record (ctx.artifacts). Verified: the realness scan scores a real impl 85 vs a return true stub 35 (gated) on the sample tasks.

Offline by default, faithful live

Ships an in-process SandboxClient (offline-box.ts) so the whole pipeline compiles and runs with no creds — proven end-to-end (default: 8 records; --tools web --ensemble --reps 2: 16 records). --live swaps in a real @tangle-network/sandbox client + a real judge model; nothing else changes.

Verification

pnpm run lint — clean (328 files)
pnpm run typecheck (src) + pnpm run typecheck:examples — clean
Ran offline in all four modes (default / --ensemble / --tools web / --reps) — full leaderboard + significance matrix produced

One runtime gap worked around

runProfileMatrix rejects a bare model id; it requires a snapshot-versioned id (name@YYYY-MM-DD or name-YYYYMMDD) so records are reproducible. profiles.ts now uses dated, env-overridable model ids and documents why. (Caught by actually running it — typecheck was green but the first run threw ProfileMatrixError: ... lacks a snapshot version.)

Operator + partner-facing — please review, do not merge yet.

…sses with controlled tool use Add examples/coding-benchmark/ — runs one coding task across claude-code/opencode/codex/cli baseline profiles × scenarios via runProfileMatrix, with a one-line tool-surface knob, validators-before-judge scoring, a 1-or-3 (ensemble) judge layer, and real paired-bootstrap + Wilson + BH stats. The no-cheat firewall (agent context = scenario.prompt only) is enforced and pinpointed in dispatch.ts. Ships an in-process SandboxClient so the whole pipeline compiles and runs offline with no creds, and runs faithfully live with --live + TANGLE_API_KEY.

tangletools

✅ Auto-approved PR — `3b335ea1`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:02:36Z}

# Conflicts: # examples/README.md

tangletools

✅ Auto-approved PR — `9e98af79`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:09:45Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	2 (2 low)
Heuristic	0.0s
Duplication	0.0s
Interrogation	323.5s (2 bridge agents)
Total	323.5s

💰 Value — sound

Adds a well-structured, copy-pasteable example showing how to compare coding-agent harnesses via runProfileMatrix with validators-before-judge, a no-cheat firewall, and real stats — a gap in the examples learning path.

What it does: Introduces examples/coding-benchmark/ (9 small files + README) that runs one coding task across claude-code / opencode / codex / cli baseline profiles via runProfileMatrix. Supports a one-line tool-surface knob (none/web/search-mcp), multi-round refine in one persistent box, deterministic in-box checks, a write-only realness anchor, an optional 3-model cross-family ensemble judge, and pa
Goals it achieves: Makes the "neutral harness measurer" concept from docs/eval-substrate.md concrete and runnable. Demonstrates the canonical campaign-substrate pattern for coding harness comparison, filling a real gap: no existing example shows runProfileMatrix used for coding tasks with the eval primitives (scoreAuthenticity, ensembleJudge, inMemoryCampaignStorage, paired stats). It is educational, copy-
Assessment: Good on its merits. The code composes existing agent-runtime / agent-eval primitives rather than inventing new substrate. The decomposition across files is clear (scenarios, profiles, tools, validators, judges, dispatch, stats, offline box, entrypoint). The no-cheat firewall is a useful, visible design invariant. It follows repo conventions: offline-by-default with --live, imports from packa
Better / existing approach: none — this is the right approach. I checked for existing equivalents: src/runtime/run-benchmark.ts is the optimization suite (compares strategies like sample/refine on one worker, not harnesses); examples/strategy-suite/ demos that. examples/product-eval/ uses runProfileMatrix but for persona-conversation product evals, not coding. bench/ contains real benchmark adapters (SWE-bench, Hum
Model: opencode/kimi-for-coding/k2p7
Bridge attempts: 1

🎯 Usefulness — sound

A coherent, well-integrated coding-benchmark example that composes real substrate primitives (runProfileMatrix, openSandboxRun, ensembleJudge, scoreAuthenticity, pairedBootstrap/Wilson/BH) with zero bespoke harness code — every import resolves, it follows the established product-eval pattern, and it

Integration: Fully wired and reachable. Verified against published substrate (agent-eval@0.99.0 + sandbox@0.9.0): every named import exists with the expected signature — runProfileMatrix/ProfileDispatchFn/DispatchContext/JudgeConfig/inMemoryCampaignStorage (campaign), scoreAuthenticity/gateRealness/AuthenticitySignals/ProducedFile (authenticity), confidenceInterval/wilson/pairedBootstrap/benjaminiHochberg/RunR
Fit with existing patterns: Hits the grain dead-center. It is the same runProfileMatrix<Scenario,Artifact> cell pattern already established by examples/product-eval/product-eval.ts:104, just richer (judges, multi-round refine, stats). No competing pattern, no reinvention — every moving part is a substrate primitive composed, not reimplemented. The no-cheat firewall is expressed as a structural field-level split on CodingScen
Real-world viability: Holds up on both paths. Offline: the in-process box (offline-box.ts) implements exactly the SandboxInstance surface openSandboxRun/settle call (streamPrompt yielding a terminal result event, fs.read/write, exec, delete); missing toolchain honestly reads as check-FAIL → loop exhausts maxRounds → stub judge → deterministic leaderboard (documented in offline-box.ts:11-13). Live: lazy-requires real Sa
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts

console.log(

🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts

 `    this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260623T233115Z}

tangletools · 2026-06-23T23:36:58Z

❌ Needs Work — `9e98af79`

Readiness 26/100 · Confidence 70/100 · 17 findings (3 high, 4 medium, 10 low)

	opencode-kimi	glm	deepseek	aggregate
Readiness	26	35	79	26
Confidence	70	70	70	70
Correctness	26	35	79	26
Security	26	35	79	26
Testing	26	35	79	26
Architecture	26	35	79	26

Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH sumTokens misses data.usage shape → --live cells fail integrity guard — examples/coding-benchmark/dispatch.ts

dispatch.ts:148-158 sumTokens() reads only event.data.tokenUsage.{input,output}Tokens. The substrate's canonical metering shapes (src/runtime/sandbox-events.ts:25-70, extractLlmCallEvent) are: type:'result' → data.usage.{input,output}Tokens; type:'done' → data.tokenUsage; type:'llm_call' → data.tokensIn/tokensOut. The offline-box.ts stub emits data.tokenUsage so offline meters correctly, but a real sandbox emitting type:'result' with data.usage yields sum=0 → the if (usage.input || usage.output) ctx.cost.observeTokens(usage) guard at dispatch.ts:115 skips observation → runProfileMatrix with integrity:'assert' throws BackendIntegrityError on every cell. README claims 'Nothing else in the example changes' between offline and live, which is not true as shipped. Fix: reuse the runtime's extr

🔴 HIGH missing test files break offline deterministic checks — examples/coding-benchmark/scenarios.ts

scenarios.ts:75 and :106 run node --test test/rate-limiter.test.js and node --test test/csv.test.js, but those files are not present in the PR. Verified empirically: node --test <missing-file> exits 1 with 'Could not find ...'. Offline, the test validator therefore always fails, the refine loop never breaks early, and the example cannot demonstrate passing deterministic checks. Fix by adding the referenced test files (and any fixtures) or by documenting that offline mode intentionally skips tests.

🔴 HIGH paired bootstrap mishandles reps > 1 — examples/coding-benchmark/stats.ts

stats.ts:71-82 builds bByScenario = new Map(b.map((r) => [r.scenarioId ?? '', r])). With reps>1, multiple records share the same scenarioId, so the Map keeps only the last record per scenario. The resulting paired arrays contain duplicated, arbitrarily matched observations instead of matched (scenario, rep) pairs, violating the paired-bootstrap assumption documented in pairedBootstrap. Reproduced conceptually: with reps=2, aScores has four entries while bScores repeats the same two values. Fix by matching on a unique (scenarioId, rep) key or aggregate to one score per scenario before pairing.

Other

🟠 MEDIUM No tests for the example; offline smoke would have caught the live-mode gap — examples/coding-benchmark/benchmark.ts

tsconfig.examples.json + biome.json include examples in typecheck/lint, but there is no test (no .test.ts, no vitest spec) asserting the offline pipeline produces a realness-scored, judge-scored, stats-rendered output. A single vitest that runs main() in offline mode and asserts allRecords.length === harnessProfiles.length * scenarios.length * reps and that pairwiseStats returns a defined leaderboard would cost ~20 lines and lock in the contract the README claims. The repo already ships vitest (vitest.config.ts at root).

🟠 MEDIUM require() used in ESM live path — examples/coding-benchmark/benchmark.ts

benchmark.ts:86-88 uses require('@tangle-network/sandbox') inside if (live). The package has "type": "module". While pnpm tsx polyfills require() for .ts files, running the compiled ESM output directly throws ReferenceError: require is not defined. Replace with dynamic import('@tangle-network/sandbox') and an async client factory.

🟠 MEDIUM ensemble judge has less context than single judge — examples/coding-benchmark/judges.ts

singleCodeJudge (line 113-124) builds judgeUserPrompt that includes the task prompt, the rubric note, AND deterministic check results (typecheck/test/lint pass/fail). But ensembleCodeJudge (line 130-146) passes only artifact.solution and scenario?.prompt to its scoreOne callback — the rubric note and check results are dropped. This means the 3-model cross-family ensemble judges on LESS information than the single judge, making it strictly worse for accuracy even though it's presented as the more r

🟠 MEDIUM realness validator is recorded but never gates the judge — examples/coding-benchmark/validators.ts

validators.ts:74-89 realnessValidator returns {valid,score} and dispatch.ts:128 writes it to ctx.artifacts, but no consumer reads artifact.realness. judges.ts:107-116 singleCodeJudge scores artifact.solution directly with no reference to artifact.realness.valid. README.md:46 claims 'Realness anchor ... catches a stub that compiles but fakes the hard part' and the validators-before-judge ordering implies the anchor gates the judge band. As shipped, a gated stub still receives a full judge score; only the offline writeJson records the gate. Either route artifact.realness into the judge prompt (so it can downweight) or short-circuit the judge when realness.valid===false (return composite 0). At minimum, tighten the README to say the anchor is recorded-for-honesty, not judge-gating.

🟡 LOW Tool-knob description is imprecise — examples/README.md

Line 47 says 'a one-line tool knob (websearch / webfetch / MCP)'. The actual preset names in examples/coding-benchmark/tools.ts:29 are 'none' | 'web' | 'search-mcp'; 'websearch' and 'webfetch' are toggled together by the 'web' preset, and the MCP preset is specifically 'search-mcp', not a generic MCP knob. Also, the Run section (lines 109-110) does not demonstrate the --tools flag despite highlighting it as a feature. Fix: change parenthetical to '(none / web / search-mcp)' and optionally add a command like `pnpm tsx examples/coding-benchmark/ben

🟡 LOW Offline refine loop is a no-op: script.solutionFor ignores the round argument — examples/coding-benchmark/benchmark.ts

benchmark.ts:34-58 offlineSolutions.{rate-limiter,csv-parser}.solutionFor is declared as (round: number) => string but the implementations close over no round state and return the same source each call. offline-box.ts:43-45 increments round and calls script.solutionFor(round), so rounds 2 and 3 write byte-identical content to the same path. The dispatch's multi-round refine loop (dispatch.ts:99-115) therefore produces the same checks output every iteration. offline-box.ts:33 docstring even claims 'round 2 can differ from round 1 (refine demo)' — the capability is in the interface but unexercised. Either drop the round parameter from OfflineScript.solutionFor (clarity), or seed a 2-round refine script for one scenario (demo value).

🟡 LOW require('@tangle-network/sandbox') only resolves under tsx shim, not pure ESM node — examples/coding-benchmark/benchmark.ts

benchmark.ts:79 const { SandboxClient: RealClient } = require('@tangle-network/sandbox'). The repo is "type": "module" (package.json:34). Pure node would throw ReferenceError: require is not defined. tsx (the documented runner) shims it, so the README command works, but a partner copying the snippet into an ESM script breaks. Use a top-level await import('@tangle-network/sandbox') guarded behind the live flag, or document the tsx-only constraint in the README's 'Going live' section.

🟡 LOW iteration: maxRounds passes a constant 3 instead of the actual final round — examples/coding-benchmark/dispatch.ts

dispatch.ts:124-127 calls realnessValidator(...).validate(artifact, { iteration: maxRounds, signal: ctx.signal }). maxRounds is the constant 3 (dispatch.ts:31), but ValidationCtx.iteration is documented (src/runtime/types.ts:31) as 'Iteration index this output came from (0-based)'. The realnessValidator implementation ignores ctx (validators.ts:75-76 declares only artifact), so this is benign today, but the value is semantically wrong and would mislead a future validator that reads ctx.iteration. Capture the loop's round variable into a const before the validator call and pass that.

🟡 LOW run.box as unknown as RunBox declares an fs.read that is never used — examples/coding-benchmark/dispatch.ts

dispatch.ts:46 type RunBox = CheckBox & { fs: { read(path: string): Promise } }. dispatch.ts:117 cast run.box as unknown as RunBox. runBoxChecks (validators.ts:62-70) only calls box.exec(command); the fs.read member is dead. The real SandboxInstance does expose get fs(): FileSystem (sandbox-CNcyBhPp.d.ts:4545) so the cast is harmless at runtime, but the type assertion is wider than the actual call surface and invites a reader to think runBoxChecks reads files. Drop fs.read from RunBox (it's just CheckBox).

🟡 LOW exported OfflineQuality type is dead code — examples/coding-benchmark/offline-box.ts

Line 35 exports OfflineQuality = 'real' | 'stub' but grep across the entire repo returns only the definition site. The type is referenced nowhere — no import, no usage. The comment on line 32-34 describes it as a fidelity level for canned solutions, but the concept is never wired. Dead code in an example is misleading.

🟡 LOW model snapshot format comment mismatch — examples/coding-benchmark/profiles.ts

profiles.ts:43-45 states model ids must be name@YYYY-MM-DD or name-YYYYMMDD, but the actual defaults (e.g., anthropic/claude-sonnet-4-5-2025-09-29) use provider/name-YYYY-MM-DD. Runtime accepts them, so the comment is misleading. Update the comment to the accepted format.

🟡 LOW Benjamini-Hochberg correction is degenerate with constant p-proxy — examples/coding-benchmark/stats.ts

Line 118 uses r.low > 0 || r.high < 0 ? 0.04 : 0.5 as a p-value proxy for all pairs whose bootstrap CI excludes zero. Because all 'significant' pairs share the exact same proxy p-value (0.04), the BH procedure cannot distinguish between them by true effect size. The result: for n harnesses with C(n,2) pairs, any pair needs to rank at position k >= 0.04 * C(n,2) / 0.05 to pass BH — e.g., with 4 harnesses (6 pairs), at least 5 of 6 pairs must have CI excluding zero before ANY pair passes BH. This makes BH correction overly conservative, effectively masking real pairwise differences. For the offline stub this is harmless (all deltas are 0.000), but for a l

🟡 LOW leaderboard harness names include opaque hash suffixes — examples/coding-benchmark/stats.ts

stats.ts:58-67 keys grouping on r.agentProfile?.profileId ?? r.candidateId. Actual run output shows labels like claude-code-baseline-91b9b8dc6fe218e5 instead of claude-code-baseline, making the leaderboard harder to read. Use a human-readable profile name for display while keeping the stable id for grouping if needed.

🟡 LOW pProxy 0.04/0.5 is a hand-rolled p-value feed into benjaminiHochberg — examples/coding-benchmark/stats.ts

stats.ts:118 const pProxy = raw.map((r) => (r.low > 0 || r.high < 0 ? 0.04 : 0.5)). The agent-eval substrate ships real significance primitives (pairedTTest, mannWhitneyU, wilcoxonSignedRank, mcnemar — all exported from index.d.ts). Feeding BH a binary 0.04/0.5 proxy loses power information and is explicitly a placeholder. Comment labels it 'proxy' which is honest; for a benchmark marketed as 'scientifically rigorous', wire a real paired test on the matched scores instead.

_{tangletools · 2026-06-23T23:36:53Z · trace}

tangletools

❌ 3 Blocking Findings — `9e98af79`

Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 11 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-23T23:36:53Z · immutable trace}

…dings, drop 3 reinventions Reworks examples/coding-benchmark to compose published agent-eval / agent-runtime primitives instead of hand-rolling them, fixing every PR-369 review finding. Reinventions removed (now substrate primitives): - hand-rolled judge (judgeSystemPrompt/judgeUserPrompt/score) -> llmJudge / ensembleJudge - sumTokens (read only data.tokenUsage; summed 0 under a real sandbox emitting data.usage -> integrity:'assert' threw on every --live cell) -> extractLlmCallEvent - runBoxChecks flat pass/fail -> MultiLayerVerifier ordered pipeline (typecheck -> test -> lint, dependency-based skip, blended score) Findings: --live token shapes fixed + verified; real seeded fixture tests so the deterministic checks have a file to run; correct paired-scenario stats keying; an offline vitest smoke; dynamic import() of the SDK behind --live; the ensemble now sees the same full context as the single judge; the realness gate ACTUALLY gates the judge (short-circuits to composite 0, no model call); honest tool-knob docs + --tools command; a real cross-round offline refine demo; dead RunBox.fs.read / OfflineQuality removed; real paired test (pairedTTest) feeding BH real p-values; human-readable harness names on the leaderboard. Consolidated validators+judges -> eval.ts and tools -> profiles.ts (9 source files -> 7 + a smoke test). README every claim matched to the code. Bumps the agent-eval devDependency to >=0.99.0 (llmJudge). The peerDependency floor is unchanged — only the example uses llmJudge, src does not.

tangletools

✅ Auto-approved PR — `543881c8`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T08:59:17Z}

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	2 (2 low)
Heuristic	0.0s
Duplication	0.0s
Interrogation	366.9s (2 bridge agents)
Total	366.9s

💰 Value — sound

Adds a copy-pasteable examples/coding-benchmark/ that composes agent-runtime/agent-eval primitives to run a rigorous, firewalled coding benchmark across harnesses with validators-before-judge, realness gating, and honest paired statistics — a good addition with no existing equivalent.

What it does: Introduces examples/coding-benchmark/ (~8 files) demonstrating how to run one coding task across a matrix of harnesses × baseline profiles × scenarios, with a one-line tool preset knob, multi-round refinement in one persistent sandbox, deterministic checks (typecheck/test/lint via MultiLayerVerifier), a structural realness gate that can short-circuit the judge, optional cross-family ensemble judgi
Goals it achieves: Gives partners and contributors a canonical, copy-pasteable example of a scientifically honest coding-agent benchmark using the substrate primitives; demonstrates the no-cheat firewall (agent only sees scenario.prompt, eval criteria stay outside the box); shows how to swap tool surfaces via the profile without per-harness config; and validates the wiring with an offline vitest smoke test.
Assessment: Good change, built in the grain of the codebase. It uses the published primitives the docs point to: runProfileMatrix (benchmark.ts:32,202), openSandboxRun for persistent resumable boxes (dispatch.ts:86), MultiLayerVerifier (eval.ts:170), llmJudge/ensembleJudge (eval.ts:244,264), scoreAuthenticity/gateRealness (eval.ts:199-200), and the stats primitives (stats.ts:22-28). The AgentPro
Better / existing approach: none — this is the right approach. I checked examples/ and bench/ for existing equivalents: examples/product-eval uses runProfileMatrix but for persona conversations, not coding; examples/ui-audit uses runLoop over a UI-audit client; bench/search-bench/run.mts compares coding harnesses but is internal research infrastructure, not a campaign-matrix example, and lacks the validators-before-judge
Model: opencode/kimi-for-coding/k2p7
Bridge attempts: 1

🎯 Usefulness — sound

A well-integrated, offline-runnable example that composes real agent-eval/agent-runtime primitives to compare coding harnesses fairly with rigorous statistics; nothing existing does this and every integration point verifies.

Integration: Fully reachable and wired. Registered in examples/README.md entry 9b (Tier 2) with CLI run commands; covered by tsconfig.examples.json (pnpm run typecheck:examples passes clean); has a passing vitest smoke test (3 tests). I installed deps and ran pnpm tsx examples/coding-benchmark/benchmark.ts after pnpm build — it produced exactly the documented output (8 records = 4 harnesses × 2 scenarios ×
Fit with existing patterns: Composes the right primitives in the grain of the codebase. Follows the established examples convention (imports from published package surface, not relative paths; runs from the repo tsx). Uses the documented agent-eval campaign pattern (runProfileMatrix + ProfileDispatchFn + JudgeConfig) — the same pattern product-eval/ uses for a different domain (user-sim persona eval), so it is consistent, no
Real-world viability: Holds up. The offline path genuinely exercises the whole pipeline: the real matrix runs, MultiLayerVerifier executes real tsc/biome/node --test commands (which fail fast offline because the toolchain isn't on PATH — the documented honest signal, not a fake pass), the realness gate runs for real and the smoke test asserts it both catches a return true stub (gated→0) and passes a real token-bucket
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts

console.log(

🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts

     `    this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260624T103113Z}

tangletools · 2026-06-24T10:43:00Z

✅ No Blockers — `543881c8`

Readiness 60/100 · Confidence 80/100 · 17 findings (2 medium, 15 low)

	opencode-kimi	glm	deepseek	aggregate
Readiness	80	80	60	60
Confidence	80	80	80	80
Correctness	80	80	60	60
Security	80	80	60	60
Testing	80	80	60	60
Architecture	80	80	60	60

Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 4/4 planned shots over 12 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM parseArgs produces NaN/incorrect values when flags follow value options — examples/coding-benchmark/benchmark.ts

Lines 53-61: The opt() helper doesn't validate that the next argument is not a flag. --reps --live produces reps = NaN (Number('--live')), and --tools --live sets toolPreset = '--live' (invalid ToolPreset). Verified: node -e 'parseArgs(["--reps","--live"]).reps' → NaN, 'parseArgs(["--tools","--live"]).toolPreset' → "--live". Fix: check that argv[i+1] doesn't start with --, or use a proper arg parser.

🟠 MEDIUM devDependency version (>=0.99.0) diverges from peerDependency (>=0.97.0) for @tangle-network/agent-eval — package.json

Line 92 (devDependencies): "@tangle-network/agent-eval": ">=0.99.0 <1.0.0" — bumped from ">=0.97.0". Line 120 (peerDependencies): "@tangle-network/agent-eval": ">=0.97.0 <1.0.0" — NOT bumped. agent-eval is a REQUIRED peer dependency (not listed in peerDependenciesMeta optional, line 126-138). Gap: consumers told they can use 0.97.0–0.98.x but dev now requires 0.99.0. Verified: no src/ changes in this PR, so library API is unchanged and consumers w

🟡 LOW Non-standard table key '9b' — examples/README.md

Line 47: The new row key 9b is non-standard in an otherwise numeric sequence (1-22). It's intentional (interleaves with 9 for ui-audit to avoid renumbering the full tier list), but could confuse readers expecting pure numeric ordering. Consider 10 and shift everything down, or add a footnote explaining the b suffix.

🟡 LOW README command depends on undeclared tsx — examples/coding-benchmark/README.md

Lines 6-16 document pnpm tsx examples/coding-benchmark/benchmark.ts, but tsx is not listed in this repo's devDependencies and node_modules/.bin/tsx does not exist after pnpm install --frozen-lockfile. The command happens to work in this environment only because a global tsx binary is on PATH (/home/drew/code/cli-bridge/node_modules/.bin/tsx). In a clean checkout without a global tsx, the documented command will fail. Fix: add tsx to devDependencies, or change the documented command to a runner that is guaranteed (e.g. node --experimental-strip-types or pnpm exec tsc && node ...).

🟡 LOW Offline csv-parser cannot detect row separators — examples/coding-benchmark/benchmark.ts

Lines 88-96: The offline CSV parser checks c === '\\n' to detect row splits, but c is a single character from charAt() and '\\n' is a two-character string — the comparison is always false. Verified via node simulation: multi-row input 'a,b\\nc,d' produces one row [['a','b\\nc','d']] instead of two. The test fixture only tests single-row cases so this is latent, but the example code is incorrect for multi-row CSV.

🟡 LOW benchmark runDir is created but never cleaned up — examples/coding-benchmark/benchmark.ts

Line 176 creates a temp directory with mkdtempSync(join(tmpdir(), 'coding-benchmark-')) and passes it to runProfileMatrix, but main never removes it. After running the test suite multiple times, /tmp/coding-benchmark-* directories accumulate. Fix: wrap the matrix run in a try/finally and rmSync(runDir, { recursive: true, force: true }) after records are returned.

🟡 LOW Advisory lint failures are invisible to the refine prompt — examples/coding-benchmark/dispatch.ts

nextPrompt at dispatch.ts:51 calls layerOutput(report, 'lint'), whose passed field is true because checkLayer for advisory layers returns status: 'pass' regardless of the biome result (eval.ts:135-144). This keeps lint from blocking allPass (correct), but it also means biome warnings never reach the agent in the refine prompt, so the agent cannot correct style issues. Fix: include the lint layer's actual ok/output in the refine prompt while still treating lint as advisory for allPass.

🟡 LOW Multiple unsafe 'as unknown as' type casts bypass type safety — examples/coding-benchmark/dispatch.ts

Line 120: run.box as unknown as CheckBox — double-casts the sandbox instance to the minimal CheckBox interface. Line 168: extractLlmCallEvent(ev as never, 'agent') — casts each stream event to never before passing to the extractor. In offline-box.ts lines 64, 100: as unknown as SandboxEvent and as unknown as SandboxInstance. These all suppress type checking. While the

🟡 LOW Ensemble judge model names lack snapshot dates — contradicts profiles.ts own rule — examples/coding-benchmark/eval.ts

eval.ts:ensembleCodeJudge (line ~213) hardcodes models: ['deepseek-chat', 'gpt-4o-mini', 'gemini-flash'] — bare aliases with no snapshot date. profiles.ts:harnessModel (line ~43) explicitly requires provider/name-YYYY-MM-DD form and documents why: 'a run record without the exact model snapshot is not reproducible.' The ensemble judges also produce scored records; using aliases undermines the reproducibility property the example otherwise enforces. If the live router rejects aliases, the ensemble silently fails. Fix: use snapshot-dated IDs consistent with profiles.ts.

🟡 LOW Unquoted paths in shell-executed check and seed commands — examples/coding-benchmark/eval.ts

eval.ts:seedFile (line ~103): box.exec(\mkdir -p ${dir} && printf %s '${b64}' | base64 -d > ${file.path}`)— file.path and dir are interpolated unquoted. scenarios.ts:typecheckCmd/testCmd/lintCmd (lines ~55-60):tsc ... ${path}, node --test ${fixturePath}, biome check ${path}— same pattern. Current corpus paths (test/rate-limiter.test.js, src/csv.ts) are safe and eval-controlled (never agent-influenced), so this is not an active vulnerability. But as example code that partners will copy, it sets a latent injection precedent. Fix: single-quote paths:mkdir -p '${dir}' ... > '${file.path}'`.

🟡 LOW llmJudge imported from main index breaks peer-floor (>=0.97.0) consumers — examples/coding-benchmark/eval.ts

Line 1: import { ... llmJudge ... } from '@tangle-network/agent-eval'. Verified that llmJudge IS exported from the main index in 0.99.0 (the devDep floor >=0.99.0) but is NOT in 0.97.x/0.98.x (within the declared peerDependencies range >=0.97.0). A downstream consumer pinning agent-eval to the peer floor who compiles this example gets llmJudge is not exported. Fix: import llmJudge from @tangle-network/agent-eval/campaign (exported in both versions). Every other symbol used exists in both.

🟡 LOW seedFile shells out with unquoted file paths — examples/coding-benchmark/eval.ts

Line 105 builds mkdir -p ${dir} && printf %s '${b64}' | base64 -d > ${file.path}. The current scenario paths are fixed (src/..., test/...), but the path is interpolated unquoted, so a scenario path containing spaces or shell metacharacters would break or inject. Fix: quote the path (e.g., use JSON.stringify(file.path) as the shell word) or, better, seed fixtures via box.fs.write instead of box.exec.

🟡 LOW seedFile uses unsanitized paths in shell command — examples/coding-benchmark/eval.ts

Line 105: await box.exec('mkdir -p ${dir} && printf %s ... base64 -d > ${file.path}') — file.path and dir are interpolated directly into a shell command. Currently safe because scenarios.ts uses hardcoded paths, but this is a latent injection vector if scenarios are ever loaded from external config. The base64 encoding itself is safe (standard b64 has no quotes/shell metachar). Note: scenario.fixture.content is NOT escaped against single quotes in the printf %s '${b64}' wrapper, but base64 output never contains single quotes.

🟡 LOW Test fixture .js imports .ts — node --test needs a TS loader — examples/coding-benchmark/scenarios.ts

Lines 64-68: node --test test/rate-limiter.test.js where the .js fixture does import { RateLimiter } from '../src/rate-limiter.ts'. Node.js 22+ cannot natively resolve TypeScript imports from JavaScript without --loader ts-node/esm or --import tsx. The README acknowledges this fails offline, but the command itself is structurally incorrect for any vanilla Node.js environment. In a real harness box the toolchain may be pre-configured, but the check commands offer no fallback.

🟡 LOW Test fixtures import .ts files from .js — requires Node 23.6+ type stripping — examples/coding-benchmark/scenarios.ts

Fixture test/rate-limiter.test.js (line ~82) does import { RateLimiter } from '../src/rate-limiter.ts' and test/csv.test.js (line ~143) does import { parseCsv } from '../src/csv.ts'. The check command is node --test test/rate-limiter.test.js (no --experimental-strip-types flag). Importing a .ts file from .js requires Node 23.6+ (unflagged type stripping) or the --experimental-strip-types flag on 22.6–23.5. In live mode with an older Node, the test layer always fails regardless of solution quality, making the typecheck→test→lint pipeline's test layer a permanent red. Offline this is moot (toolchain absent anyway). The README's claim that checks 'run for real' in a live box depends on the box Node version.

🟡 LOW devDep lower bound bumped (0.97→0.99) but peerDep stays at 0.97.0 — package.json

package.json:92 now requires @tangle-network/agent-eval >=0.99.0 <1.0.0 for dev, while package.json:120 still declares peerDependency >=0.97.0 <1.0.0. Currently safe: no src/ code changed in this PR and files field excludes examples/, so nothing published needs 0.99. Risk is latent — any future PR that imports a 0.99-only agent-eval API into src/ would silently allow consumers on 0.97/0.98 and break at runtime. Suggest a comment in the PR description noting the asymmetry is intentional, or bump peer to >=0.99.0 proactively if any 0.99 API is planned for src/ soon. Not blocking.

🟡 LOW devDependency minimum exceeds peerDependency minimum for @tangle-network/agent-eval — package.json

Line 92 devDependencies specifies @tangle-network/agent-eval >=0.99.0 <1.0.0, but line 120 peerDependencies still allows >=0.97.0 <1.0.0. After this change CI/develop will install 0.99.0 while consumers can legally install 0.97.0 or 0.98.0. If 0.98.0/0.99.0 introduced breaking type or runtime changes in re-exported substrate primitives, consumers on the older peer range may see failures that CI no longer catches. Either align peerDependencies to >=0.99.0 <1.0.0 if the library truly requires it, or add a CI matrix job that verifies against the lowest supported peer

_{tangletools · 2026-06-24T10:42:57Z · trace}

…ta, reps stop pseudo-replicating, TS test runner, runnable on clean clone Four partner-blocking defects against the honesty pitch: - HIGH: the round-0 offline stub the dispatch wrote scored composite 0.6 (gated:false) because its `refillPerSec` param matched the realImpl regex, so the "stub → gated → composite 0" demo never fired on a real run — only the unit test's separate strawman gated. Make the round-0 stub genuinely hollow (inert `_capacity`/`_ratePerSec` args, no refill math) so gateRealness gates it to composite 0 on the benchmark's own data. Export `offlineSolutions` and assert the gate against the EXACT dispatch stub in the smoke test. - HIGH: the leaderboard CI/Wilson were computed over every raw rep record, so identical reps faked a narrower interval (pass-CI [34%,100%] → [61%,100%] at reps=3). Collapse reps to one mean per (harness,scenario) before the CI/Wilson, matching the pairing path. Add a regression test that identical reps leave the CI unchanged. - MEDIUM: the test check ran plain `node --test`; the fixture imports the solution as `.ts`, and Node strip-only mode throws ERR_UNSUPPORTED_TYPESCRIPT_SYNTAX on constructor parameter properties (the canonical impl's style), false-failing a correct solution. Run `node --experimental-transform-types --test`. - MEDIUM: `tsx` was undeclared, so the documented `pnpm tsx ...benchmark.ts` faceplanted on a clean clone. Pin `tsx@^4.22.4` in devDependencies and update the lockfile (agent-eval stays 0.99.0). README updated so every claim matches: the gate fires on real data, reps are honest, the test runner handles TS param properties, and the run command works.

tangletools

✅ Auto-approved PR — `818e73db`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T10:52:43Z}

…ewall claims, runnable-clone hardening Address the union of four critical-audit lenses on the coding-benchmark example. Realness gate (HIGH): the gate fired on one strawman stub shape, not the natural cheat. Tightened each task's realImpl to require the actual hard-part work (refill MATH, quote-state tracking, capacity eviction) so a decoy token — a `refillPerSec` param name, a `for (` loop, a passthrough Map — no longer reads as a real impl, and added a third LRU scenario whose only passing path is the real eviction algorithm. The smoke test now asserts each NATURAL cheat is gated, per task. Firewall claims (HIGH/MEDIUM): the README/dispatch/scenarios claimed the agent "literally cannot read the answer key". The seeded test fixture is intentionally visible (TDD-style) and a multi-round agent can read it; only the LLM-judge rubric and realness signals are firewalled. Softened the claim to that precise truth, and rephrased "gates to composite 0 on the actual run" to "the smoke test asserts the gate; the run refines past it, so no leaderboard cell ends gated". DX/runtime hardening: parseArgs no longer swallows a following flag as an option value (`--reps --live` → reps=1, not NaN) and clamps reps to a positive integer; the matrix runDir is cleaned in a finally; the LLM judge is imported from /campaign so it resolves across the whole peer range; ensemble panel models are snapshot-dated; seedFile prefers the structured fs.write seam (no shell injection surface); the realness scan takes the seeded fixture as a reference so a real solution carries no spurious DEAD_ARTIFACT; stats fails loud on a missing scenarioId instead of merging into one bucket; renderStats prints a power caveat when n<6; advisory lint warnings now reach the refine prompt without gating allPass; casts narrowed; README notes the Node>=22.6 test-layer floor, the live-box PATH requirements, and the offline-ensemble degeneracy.

tangletools

🟢 Value Audit — sound


Verdict	sound
Concerns	2 (2 low)
Heuristic	0.0s
Duplication	0.0s
Interrogation	279.1s (2 bridge agents)
Total	279.1s

💰 Value — sound

Adds a well-scoped, runnable examples/coding-benchmark/ that composes agent-runtime/agent-eval primitives to benchmark coding-agent harnesses across tasks with validators, realness gating, LLM judging, and honest statistics — a good, additive example with no existing equivalent.

What it does: Introduces examples/coding-benchmark/ (~9 files): an example that runs a matrix sweep of coding tasks across harness baseline profiles (profiles.ts:70-75), dispatches each cell through a persistent multi-round sandbox run (dispatch.ts:84-162), scores results with deterministic checks → realness gate → LLM judge (eval.ts), and reports paired-bootstrap + Wilson + BH-corrected stats (`stats.t
Goals it achieves: 1) Provide a canonical, copy-pasteable pattern for comparing coding-agent harnesses fairly. 2) Demonstrate validators-before-judge, a grading-criteria firewall (dispatch.ts:16-30), controlled tool surfaces (profiles.ts:77-123), and statistically honest comparison (stats.ts). 3) Show that all the heavy primitives (runProfileMatrix, MultiLayerVerifier, scoreAuthenticity, llmJudge, `ens
Assessment: Good change on its merits. It is coherent, additive to the examples directory, composes existing substrate primitives rather than reimplementing them, and includes offline smoke tests that verify the load-bearing honesty claims (coding-benchmark.test.ts:43-185). Typecheck (pnpm run typecheck:examples) and the new tests pass. The README is precise about scope limits (3 tasks is underpowered, fi
Better / existing approach: none — this is the right approach. I searched the repo for runProfileMatrix usage (grep found only examples/coding-benchmark/benchmark.ts:232, examples/product-eval/product-eval.ts:104, and internal runtime adapters in src/runtime/loop-dispatch.ts and src/conversation/run-persona.ts). The only other matrix example is product-eval, which is a user-sim conversation eval, not a coding b
Model: opencode/kimi-for-coding/k2p7
Bridge attempts: 1

🎯 Usefulness — sound

A well-composed reference example that runs a coding-benchmark matrix purely over agent-runtime/agent-eval primitives; all 16 offline tests pass, every imported symbol resolves in the installed packages, and it follows the two established example patterns (runProfileMatrix from product-eval, in-proc

Integration: Reachable and wired: examples/README.md:47 lists it as example 9b and :109-110 prints the exact CLI. It runs via pnpm tsx examples/coding-benchmark/benchmark.ts and its smoke test (examples/coding-benchmark/coding-benchmark.test.ts) executes in CI — confirmed, 16/16 pass, producing the expected 4 harnesses × 3 scenarios × 1 rep = 12 records and a defined leaderboard. Its caller is the developer
Fit with existing patterns: Fits the grain exactly. It composes only substrate primitives, each verified present: runProfileMatrix/inMemoryCampaignStorage/JudgeConfig/llmJudge/Scenario from @tangle-network/agent-eval/campaign; MultiLayerVerifier/ensembleJudge/stats fns from agent-eval; scoreAuthenticity/gateRealness from agent-eval/authenticity; openSandboxRun/extractLlmCallEvent/AgentRunSpec from agent-runtime/loops (signat
Real-world viability: Holds up: offline path degrades honestly (missing tsc/biome/node toolchain → fast non-zero exit, not a fake pass; all 3 refine rounds run), the firewall is structural (scenario.prompt is the only field the dispatch copies into the box; rubric/realness are read post-loop in eval.ts), and the stats layer refuses to overclaim (renderStats prints a power caveat when n<6 and the test proves identical r
Model: opencode/zai-coding-plan/glm-5.2
Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/coding-benchmark/benchmark.ts

console.log(

🟡 Cruft: magic number added examples/coding-benchmark/benchmark.ts

     `    this.tokens = Math.min(this.capacity, this.tokens + ((now - this.last) / 1000) * this.refillPerSec)\n` +

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass	What it asks
Heuristic	Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication	Do added function/class names already exist elsewhere in the repo?
Value Audit	What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit	Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

_{value-audit · 20260624T122332Z}

…tion, delete the realness regex gate The realness regex gate could never prove code is real — it scans text, so a comment or dead code evades it. Replace it with held-out test execution (SWE-bench / HumanEval style): the agent develops against a few visible example tests, then is graded on a hidden suite it never saw and cannot hardcode. A real solution passes; a hardcode-the-visible cheat fails the held-out inputs. Deleted entirely: realnessGate, scoreAuthenticity / gateRealness usage, the gatedByRealness judge wrapper, realnessSignals (realImpl/fakeShim per scenario), the DEAD_ARTIFACT handling, and the AuthenticitySignals / ProducedFile imports. Built: per-scenario visibleTest (seeded during the turn) + heldoutTest (seeded ONLY at grading — the firewall). runHeldout copies the hidden suite into the box after the loop and runs `node --experimental-transform-types --test`; the held-out pass rate is the PRIMARY, ungameable correctness score. Composite = 0.7 * held-out + 0.3 * judge-quality (the LLM judge stays as a secondary code-quality signal; MultiLayerVerifier stays as advisory dev checks). stats: suppress the SIGNIFICANT tag below the power floor (n<6) and on a zero-variance pair, so small-n / no-variance never prints a bare SIGNIFICANT. Offline-proven: a hardcode-the-visible cheat scores held-out 2/4 -> composite 0.59; the real impl scores held-out 4/4 -> composite 0.94 (judge held at 0.80). Held-out tests are never seeded during the turn (firewall, asserted per scenario). README rewritten honestly; no realness/regex/authenticity claims.

tangletools

✅ Auto-approved PR — `d5fa3a7f`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-24T12:51:59Z}

tangletools previously approved these changes Jun 23, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/cb-rebase

9e98af7

# Conflicts: # examples/README.md

drewstone dismissed tangletools’s stale review via 9e98af7 June 23, 2026 23:09

tangletools approved these changes Jun 23, 2026

View reviewed changes

tangletools reviewed Jun 23, 2026

View reviewed changes

tangletools requested changes Jun 23, 2026

View reviewed changes

tangletools previously approved these changes Jun 24, 2026

View reviewed changes

tangletools reviewed Jun 24, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 818e73d June 24, 2026 10:52

tangletools previously approved these changes Jun 24, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 2bd13d9 June 24, 2026 11:40

tangletools reviewed Jun 24, 2026

View reviewed changes

tangletools approved these changes Jun 24, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into merge/369

3fda2d7

drewstone merged commit 9bf8016 into main Jun 24, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(examples): scientifically-rigorous coding benchmark across harnesses with controlled tool use#369

docs(examples): scientifically-rigorous coding benchmark across harnesses with controlled tool use#369
drewstone merged 7 commits into
mainfrom
docs/coding-benchmark-example

drewstone commented Jun 23, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 23, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 24, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 23, 2026

What

What it does

The no-cheat firewall

Offline by default, faithful live

Verification

One runtime gap worked around

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 3b335ea1

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 9e98af79

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

Uh oh!

tangletools commented Jun 23, 2026

❌ Needs Work — 9e98af79

Blocking

Other

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

❌ 3 Blocking Findings — 9e98af79

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 543881c8

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

Uh oh!

tangletools commented Jun 24, 2026

✅ No Blockers — 543881c8

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 818e73db

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

🟢 Value Audit — sound

💰 Value — sound

🎯 Usefulness — sound

🔎 Heuristic Signals

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — d5fa3a7f

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `3b335ea1`

✅ Auto-approved PR — `9e98af79`

❌ Needs Work — `9e98af79`

❌ 3 Blocking Findings — `9e98af79`

✅ Auto-approved PR — `543881c8`

✅ No Blockers — `543881c8`

✅ Auto-approved PR — `818e73db`

✅ Auto-approved PR — `d5fa3a7f`