Skip to content

fix(runtime): harness-verified scoring (close self-report exploit) + statistical promotion gate#217

Merged
drewstone merged 1 commit into
mainfrom
fix/harness-integrity
Jun 10, 2026
Merged

fix(runtime): harness-verified scoring (close self-report exploit) + statistical promotion gate#217
drewstone merged 1 commit into
mainfrom
fix/harness-integrity

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

The adversarial theory review found a live exploit in our own published code, and it's the highest-impact fix on the board.

The hole

defineStrategy bodies self-reported score; runBenchmark ranked on the self-report, while built-in drivers compute score from surface.score(). An authored/adversarial strategy could return {score:1} doing nothing and win the train set and the frozen holdout — falsifying the "structurally safe by construction" claim and invalidating any authored-strategy result.

The fix

  • defineStrategy.act tracks the harness-verified best score across the shots it brokered (each ShotResult is scored by surface.score in the executor) and overrides the body's self-reported score/resolved. A body can only report what its real shots achieved. Proven: a {score:1}-with-zero-shots strategy now scores 0.
  • StrategyCtx.surface narrowed to open/close (no raw call/score to the body — scores reach it only through shot()'s verified channel).
  • Flywheel promotion gate: raw h1>h0 (coin-flip false certification at m≈8) → paired-bootstrap CI on the per-task holdout lift, must exclude 0 + rotating holdout slice (HOLDOUT_OFFSET).

Test

680 tests pass (built-in strategies unchanged — they already verified); exploit test confirms fabricators score 0.

…e hole + statistical promotion gate

The adversarial theory review found a live exploit in our own code: defineStrategy
bodies SELF-REPORTED their `score`, and runBenchmark ranked on it — while the built-in
drivers compute score from surface.score(). So an authored (or adversarial) strategy
could `return {score:1}` having done NOTHING and win both the train set AND the frozen
holdout. The "structurally safe by construction" claim was FALSE for the authored path,
invalidating any authored-strategy result.

Fix (the load-bearing one): defineStrategy's act now tracks the harness-VERIFIED best
score across the shots it actually brokered (each ShotResult is scored by surface.score
inside the executor) and OVERRIDES the body's self-reported score/resolved in the
deliverable. A body can only report what its real shots achieved. Proven: a strategy
returning {score:1} with zero shots now scores 0.

Also: StrategyCtx.surface narrowed to open/close only (no raw call()/score() to the body
— scores reach it solely through shot()'s verified channel). And the flywheel promotion
gate replaced raw `h1>h0` (a no-margin point comparison on m≈8 tasks ≈ coin-flip false
certification) with a paired-bootstrap CI on the per-task holdout lift that must EXCLUDE
zero, plus a rotating holdout slice (HOLDOUT_OFFSET) — reused frozen slices are an
unforced overfitting channel when tasks stream ~free.

680 tests pass (built-in strategies score identically — they already verified); exploit
test confirms fabricators now score 0.
@drewstone drewstone merged commit be674fe into main Jun 10, 2026
@drewstone drewstone deleted the fix/harness-integrity branch June 10, 2026 01:44

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 2fc1df3f

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T01:44:21Z

drewstone added a commit that referenced this pull request Jun 10, 2026
…adversarial review (#218)

A 20-agent first-principles investigation (5 theory lenses × adversarial attack with
literature + source-code access) asked whether this program leapfrogs SOTA. Verdict:
NO new theorem does — 0 breakthroughs, 7 claims survived only cut to narrower forms, 8
killed (program-space-gradient law refuted by GEPA; category-theory functor
constructively falsified; spend-ratchet/dose-κ/Pandora/bit-metered reduce to known work).

The doc records what survived honestly: (S1) channel-factorization — critique carries
zero check-bits, all pressure factors through the typed score surface; (S2) the selection
functional π as a first-class SIGNED term of the eval estimand (the genuinely-unclaimed
piece; same data flips sign under keep-best vs final-state); (S3) retention≠retrieval —
store certified programs, not prose. Plus the one sharp idea: short programs can't overfit
a small holdout (description-length handle), an argument for program- over prompt-space on
generalization grounds. The corrected E1–E5 slate; the deployable-check boundary stated.

The meta-finding: the program's real edge is measurement INTEGRITY, not a sharper
formalism — the attack found a live self-reported-score exploit in our own code, now
fixed (#217). A harness that adversarial review hardens is the scarce asset.
drewstone added a commit that referenced this pull request Jun 10, 2026
…ot-gun (#219)

Two fixes the trusted flywheel run + audit surfaced (all our own code):

1. Empty-messages foot-gun (the real cause of the authored strategy scoring 0/12):
   the shot executor treated `messages: []` as a CARRIED conversation, so an authored
   body passing an empty array started the worker with a BLANK prompt (no system, no
   task). Fixed at the executor chokepoint (covers every caller): empty-or-absent
   messages = a fresh conversation. The author contract now states it explicitly.

2. Breach 1 (unconfined authored import — was prompt-only): assertAuthoredCodeSafe is a
   runtime static lint run before the dynamic import — rejects foreign imports, require,
   eval, new Function, process/globalThis, fetch, node builtins; allows only the
   defineStrategy import. NOT a sandbox (semi-trusted authors); fully untrusted authors
   still need a container, documented. Verified: all five escape-hatch cases blocked, a
   legit strategy allowed.

Breach 2 (trusted self-report) was fixed in #217; Breach 3 (ShotResult.score in body
control flow) is by design — bodies SHOULD branch on the verified score; the firewall is
that they never see the verifier/expected values, which holds (StrategyCtx.surface is
open/close only). 680 tests pass.
drewstone added a commit that referenced this pull request Jun 11, 2026
…ion cannot poison the next generation (#261)

The deeper cost run crashed at gen2 authoring: an authored body returned a
StrategyResult without progression (advisory, unvalidated since #217 made
score/resolved harness-owned), the undefined rode through runBenchmark into
the losses table, and compactLosses threw on .map — killing the run a
generation AFTER the offending candidate ran. defineStrategy now normalizes
progression/completions/shots on the deliverable (the source fix);
compactLosses tolerates absence anyway (depth). Test: a body returning only
{score, resolved} yields a well-formed cell.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants