fix(runtime): harness-verified scoring (close self-report exploit) + statistical promotion gate by drewstone · Pull Request #217 · tangle-network/agent-runtime

drewstone · 2026-06-10T01:44:14Z

The adversarial theory review found a live exploit in our own published code, and it's the highest-impact fix on the board.

The hole

defineStrategy bodies self-reported score; runBenchmark ranked on the self-report, while built-in drivers compute score from surface.score(). An authored/adversarial strategy could return {score:1} doing nothing and win the train set and the frozen holdout — falsifying the "structurally safe by construction" claim and invalidating any authored-strategy result.

The fix

defineStrategy.act tracks the harness-verified best score across the shots it brokered (each ShotResult is scored by surface.score in the executor) and overrides the body's self-reported score/resolved. A body can only report what its real shots achieved. Proven: a {score:1}-with-zero-shots strategy now scores 0.
StrategyCtx.surface narrowed to open/close (no raw call/score to the body — scores reach it only through shot()'s verified channel).
Flywheel promotion gate: raw h1>h0 (coin-flip false certification at m≈8) → paired-bootstrap CI on the per-task holdout lift, must exclude 0 + rotating holdout slice (HOLDOUT_OFFSET).

Test

680 tests pass (built-in strategies unchanged — they already verified); exploit test confirms fabricators score 0.

…e hole + statistical promotion gate The adversarial theory review found a live exploit in our own code: defineStrategy bodies SELF-REPORTED their `score`, and runBenchmark ranked on it — while the built-in drivers compute score from surface.score(). So an authored (or adversarial) strategy could `return {score:1}` having done NOTHING and win both the train set AND the frozen holdout. The "structurally safe by construction" claim was FALSE for the authored path, invalidating any authored-strategy result. Fix (the load-bearing one): defineStrategy's act now tracks the harness-VERIFIED best score across the shots it actually brokered (each ShotResult is scored by surface.score inside the executor) and OVERRIDES the body's self-reported score/resolved in the deliverable. A body can only report what its real shots achieved. Proven: a strategy returning {score:1} with zero shots now scores 0. Also: StrategyCtx.surface narrowed to open/close only (no raw call()/score() to the body — scores reach it solely through shot()'s verified channel). And the flywheel promotion gate replaced raw `h1>h0` (a no-margin point comparison on m≈8 tasks ≈ coin-flip false certification) with a paired-bootstrap CI on the per-task holdout lift that must EXCLUDE zero, plus a rotating holdout slice (HOLDOUT_OFFSET) — reused frozen slices are an unforced overfitting channel when tasks stream ~free. 680 tests pass (built-in strategies score identically — they already verified); exploit test confirms fabricators now score 0.

tangletools

✅ Auto-approved PR — `2fc1df3f`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T01:44:21Z}

…adversarial review (#218) A 20-agent first-principles investigation (5 theory lenses × adversarial attack with literature + source-code access) asked whether this program leapfrogs SOTA. Verdict: NO new theorem does — 0 breakthroughs, 7 claims survived only cut to narrower forms, 8 killed (program-space-gradient law refuted by GEPA; category-theory functor constructively falsified; spend-ratchet/dose-κ/Pandora/bit-metered reduce to known work). The doc records what survived honestly: (S1) channel-factorization — critique carries zero check-bits, all pressure factors through the typed score surface; (S2) the selection functional π as a first-class SIGNED term of the eval estimand (the genuinely-unclaimed piece; same data flips sign under keep-best vs final-state); (S3) retention≠retrieval — store certified programs, not prose. Plus the one sharp idea: short programs can't overfit a small holdout (description-length handle), an argument for program- over prompt-space on generalization grounds. The corrected E1–E5 slate; the deployable-check boundary stated. The meta-finding: the program's real edge is measurement INTEGRITY, not a sharper formalism — the attack found a live self-reported-score exploit in our own code, now fixed (#217). A harness that adversarial review hardens is the scarce asset.

…ot-gun (#219) Two fixes the trusted flywheel run + audit surfaced (all our own code): 1. Empty-messages foot-gun (the real cause of the authored strategy scoring 0/12): the shot executor treated `messages: []` as a CARRIED conversation, so an authored body passing an empty array started the worker with a BLANK prompt (no system, no task). Fixed at the executor chokepoint (covers every caller): empty-or-absent messages = a fresh conversation. The author contract now states it explicitly. 2. Breach 1 (unconfined authored import — was prompt-only): assertAuthoredCodeSafe is a runtime static lint run before the dynamic import — rejects foreign imports, require, eval, new Function, process/globalThis, fetch, node builtins; allows only the defineStrategy import. NOT a sandbox (semi-trusted authors); fully untrusted authors still need a container, documented. Verified: all five escape-hatch cases blocked, a legit strategy allowed. Breach 2 (trusted self-report) was fixed in #217; Breach 3 (ShotResult.score in body control flow) is by design — bodies SHOULD branch on the verified score; the firewall is that they never see the verifier/expected values, which holds (StrategyCtx.surface is open/close only). 680 tests pass.

…ion cannot poison the next generation (#261) The deeper cost run crashed at gen2 authoring: an authored body returned a StrategyResult without progression (advisory, unvalidated since #217 made score/resolved harness-owned), the undefined rode through runBenchmark into the losses table, and compactLosses threw on .map — killing the run a generation AFTER the offending candidate ran. defineStrategy now normalizes progression/completions/shots on the deliverable (the source fix); compactLosses tolerates absence anyway (depth). Test: a body returning only {score, resolved} yields a well-formed cell.

drewstone merged commit be674fe into main Jun 10, 2026

drewstone deleted the fix/harness-integrity branch June 10, 2026 01:44

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone mentioned this pull request Jun 10, 2026

docs(research): the leapfrog theory verdict — honest synthesis after adversarial review #218

Merged

drewstone mentioned this pull request Jun 10, 2026

fix(runtime): authored-code import enforcement + empty-messages foot-gun #219

Merged

This was referenced Jun 10, 2026

feat(loops): seeded promotion gate + authored-path convergence #220

Merged

fix(loops): normalize advisory deliverable fields — an authored omission cannot poison the next generation #261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(runtime): harness-verified scoring (close self-report exploit) + statistical promotion gate#217

fix(runtime): harness-verified scoring (close self-report exploit) + statistical promotion gate#217
drewstone merged 1 commit into
mainfrom
fix/harness-integrity

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

drewstone commented Jun 10, 2026

The hole

The fix

Test

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — 2fc1df3f

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ Auto-approved PR — `2fc1df3f`