fix(runtime): harness-verified scoring (close self-report exploit) + statistical promotion gate#217
Merged
Merged
Conversation
…e hole + statistical promotion gate
The adversarial theory review found a live exploit in our own code: defineStrategy
bodies SELF-REPORTED their `score`, and runBenchmark ranked on it — while the built-in
drivers compute score from surface.score(). So an authored (or adversarial) strategy
could `return {score:1}` having done NOTHING and win both the train set AND the frozen
holdout. The "structurally safe by construction" claim was FALSE for the authored path,
invalidating any authored-strategy result.
Fix (the load-bearing one): defineStrategy's act now tracks the harness-VERIFIED best
score across the shots it actually brokered (each ShotResult is scored by surface.score
inside the executor) and OVERRIDES the body's self-reported score/resolved in the
deliverable. A body can only report what its real shots achieved. Proven: a strategy
returning {score:1} with zero shots now scores 0.
Also: StrategyCtx.surface narrowed to open/close only (no raw call()/score() to the body
— scores reach it solely through shot()'s verified channel). And the flywheel promotion
gate replaced raw `h1>h0` (a no-margin point comparison on m≈8 tasks ≈ coin-flip false
certification) with a paired-bootstrap CI on the per-task holdout lift that must EXCLUDE
zero, plus a rotating holdout slice (HOLDOUT_OFFSET) — reused frozen slices are an
unforced overfitting channel when tasks stream ~free.
680 tests pass (built-in strategies score identically — they already verified); exploit
test confirms fabricators now score 0.
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — 2fc1df3f
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T01:44:21Z
drewstone
added a commit
that referenced
this pull request
Jun 10, 2026
…adversarial review (#218) A 20-agent first-principles investigation (5 theory lenses × adversarial attack with literature + source-code access) asked whether this program leapfrogs SOTA. Verdict: NO new theorem does — 0 breakthroughs, 7 claims survived only cut to narrower forms, 8 killed (program-space-gradient law refuted by GEPA; category-theory functor constructively falsified; spend-ratchet/dose-κ/Pandora/bit-metered reduce to known work). The doc records what survived honestly: (S1) channel-factorization — critique carries zero check-bits, all pressure factors through the typed score surface; (S2) the selection functional π as a first-class SIGNED term of the eval estimand (the genuinely-unclaimed piece; same data flips sign under keep-best vs final-state); (S3) retention≠retrieval — store certified programs, not prose. Plus the one sharp idea: short programs can't overfit a small holdout (description-length handle), an argument for program- over prompt-space on generalization grounds. The corrected E1–E5 slate; the deployable-check boundary stated. The meta-finding: the program's real edge is measurement INTEGRITY, not a sharper formalism — the attack found a live self-reported-score exploit in our own code, now fixed (#217). A harness that adversarial review hardens is the scarce asset.
drewstone
added a commit
that referenced
this pull request
Jun 10, 2026
…ot-gun (#219) Two fixes the trusted flywheel run + audit surfaced (all our own code): 1. Empty-messages foot-gun (the real cause of the authored strategy scoring 0/12): the shot executor treated `messages: []` as a CARRIED conversation, so an authored body passing an empty array started the worker with a BLANK prompt (no system, no task). Fixed at the executor chokepoint (covers every caller): empty-or-absent messages = a fresh conversation. The author contract now states it explicitly. 2. Breach 1 (unconfined authored import — was prompt-only): assertAuthoredCodeSafe is a runtime static lint run before the dynamic import — rejects foreign imports, require, eval, new Function, process/globalThis, fetch, node builtins; allows only the defineStrategy import. NOT a sandbox (semi-trusted authors); fully untrusted authors still need a container, documented. Verified: all five escape-hatch cases blocked, a legit strategy allowed. Breach 2 (trusted self-report) was fixed in #217; Breach 3 (ShotResult.score in body control flow) is by design — bodies SHOULD branch on the verified score; the firewall is that they never see the verifier/expected values, which holds (StrategyCtx.surface is open/close only). 680 tests pass.
This was referenced Jun 10, 2026
drewstone
added a commit
that referenced
this pull request
Jun 11, 2026
…ion cannot poison the next generation (#261) The deeper cost run crashed at gen2 authoring: an authored body returned a StrategyResult without progression (advisory, unvalidated since #217 made score/resolved harness-owned), the undefined rode through runBenchmark into the losses table, and compactLosses threw on .map — killing the run a generation AFTER the offending candidate ran. defineStrategy now normalizes progression/completions/shots on the deliverable (the source fix); compactLosses tolerates absence anyway (depth). Test: a body returning only {score, resolved} yields a well-formed cell.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The adversarial theory review found a live exploit in our own published code, and it's the highest-impact fix on the board.
The hole
defineStrategybodies self-reportedscore;runBenchmarkranked on the self-report, while built-in drivers compute score fromsurface.score(). An authored/adversarial strategy couldreturn {score:1}doing nothing and win the train set and the frozen holdout — falsifying the "structurally safe by construction" claim and invalidating any authored-strategy result.The fix
defineStrategy.acttracks the harness-verified best score across the shots it brokered (eachShotResultis scored bysurface.scorein the executor) and overrides the body's self-reportedscore/resolved. A body can only report what its real shots achieved. Proven: a{score:1}-with-zero-shots strategy now scores 0.StrategyCtx.surfacenarrowed toopen/close(no rawcall/scoreto the body — scores reach it only throughshot()'s verified channel).h1>h0(coin-flip false certification at m≈8) → paired-bootstrap CI on the per-task holdout lift, must exclude 0 + rotating holdout slice (HOLDOUT_OFFSET).Test
680 tests pass (built-in strategies unchanged — they already verified); exploit test confirms fabricators score 0.