Skip to content

feat(selector): fold small constants into i32.and immediates + encoder bound (VCR-RA-001)#250

Merged
avrabe merged 2 commits into
mainfrom
feat/vcr-and-immediate-folding
Jun 4, 2026
Merged

feat(selector): fold small constants into i32.and immediates + encoder bound (VCR-RA-001)#250
avrabe merged 2 commits into
mainfrom
feat/vcr-and-immediate-folding

Conversation

@avrabe

@avrabe avrabe commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The first delta-emitting codegen transform on the allocator track (VCR-RA-001, epic #242). Consolidates the encoder-bound finding (was #249) with the folding that uses it.

The waste (measured, #248)

i32.const C; i32.and lowered to movw rN,#C; and rD,rA,rN even when C is small enough to be an AND immediate — the redundant materialization gale measured on flat_flight.

The fix

When the operand before i32.and is i32.const C with C ∈ 0..=0xFF and its movw is cleanly at the instruction tail (not spilled), fold to and rD, rA, #C and drop the materialization (foldable_bitwise_imm + drop_prev_const_materialization, mirroring the const-divisor pattern).

Encoder bound (was #249)

Bounded to 0..=0xFF because the encoder's AND-immediate path isn't yet ThumbExpandImm-complete (and r2,r0,#0x7e → 00 f0 7e 02 is correct; ≥0x100 would mis-encode). 0..=0xFF covers gale's int8 clamps. A pinning test documents the safe range; folding is guarded to it, so the encoder never sees an un-encodable immediate.

Measured delta

(p & 0x7e) + (p & 0x7e): 8 → 6 instructions — both movw #126 eliminated, each AND uses the immediate. Better than const-CSE (no materialization at all).

Full gate (codegen change)

  • 282 lib tests + 20 integration suites pass.
  • Three frozen differentials stay result-identical: control_step 0x00210A55, flight_seam 0x07FDF307, div_const 338/338.
  • Tests: i32_and_folds_small_const_into_immediate (folded shape), i32_and_does_not_fold_out_of_range_const (0x140 stays a register operand — the encoder safety bound), and_immediate_encodes_correctly_in_byte_range... (encoder).
  • CI fuzz (encoder_no_panic, wasm_ops_lower_or_error) adds totality.

Supersedes #248 (the evidence it pinned is now folded away) and subsumes #249 (the encoder-bound test is included here).

Part of #242.

🤖 Generated with Claude Code

avrabe and others added 2 commits June 4, 2026 21:50
…old bound (VCR-RA-001)

Investigating immediate folding (the biggest win the #248 evidence pointed to)
surfaced an encoder limitation: the `And { Operand2::Imm }` path packs the low
12 bits straight into the `i:imm3:imm8` field WITHOUT ThumbExpandImm (the
modified-immediate expansion). For imm <= 0xFF (gale's int8 clamps #0x7e/#0x7f)
that is correct — `and r2,r0,#0x7e` encodes to the canonical `00 f0 7e 02`. For
imm >= 0x100 the field needs a true rotation/replication pattern that is not
implemented, so it would silently encode a different value.

This path is currently DEAD (the selector never emits And-imm), so no live bug —
but it sets the precondition for the immediate-folding transform: **fold only
imm <= 0xFF** until the encoder is hardened to ThumbExpandImm / Ok-or-Err (the
"encoder must be Ok-or-Err, never silently wrong" principle, #180/#185). That
bound covers the measured flat_flight waste.

Pins the safe-range encoding as a regression guard; no codegen change.

Part of #242.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…001)

The FIRST delta-emitting codegen transform on the allocator track. Evidence
(#248) showed the dominant flat_flight-shape waste is redundant const
materialization: `i32.const C; i32.and` lowered to `movw rN,#C; and rD,rA,rN`
even when C is a small constant the AND instruction can take as an immediate.

Fix: when the operand pushed immediately before `i32.and` is `i32.const C` with
C in 0..=0xFF AND its `movw` is cleanly at the instruction tail (not spilled),
fold to `and rD, rA, #C` and drop the materialization (foldable_bitwise_imm +
drop_prev_const_materialization, mirroring the const-divisor pattern).

Bounded to 0..=0xFF: the encoder's AND-immediate path is not yet ThumbExpandImm-
complete (#249) — larger modified immediates await an encoder hardening to
Ok-or-Err. 0..=0xFF covers gale's int8 clamps.

Measured delta: the (p & 0x7e) + (p & 0x7e) pattern drops 8 → 6 instructions
(both `movw #126` eliminated; each AND uses the immediate). Better than const-CSE
— no materialization at all.

GATE (full, codegen change): 282 lib tests + 20 integration suites pass; the
three frozen differential fixtures stay RESULT-identical (control_step
0x00210A55, flight_seam 0x07FDF307, div_const 338/338); tests
i32_and_folds_small_const_into_immediate (folded shape) +
i32_and_does_not_fold_out_of_range_const (0x140 stays a register operand — the
encoder safety bound). CI fuzz adds totality.

Part of #242. Supersedes the #248 evidence (the redundancy it pinned is now folded away).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@avrabe

avrabe commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Built PR #250 and measured it on the G474RE — the first codegen-application delta, and it's correct on silicon:

bench before after #250 Δ selfcheck
controller_step 169 168 −1 cyc 0x05e33e81
flat_flight 262 261 −1 cyc 0x07fdf307

Object-level: flat_flight 180→179 instrs (movw 33→32, .text 588→584 B); controller_step 120→119 (movw 27→26). So the fold fired and the result is bit-correct — the transform + my reflash loop are validated end-to-end. 🎉

One yield note: it folded exactly 1 AND-immediate per function, though both have ~3–4 & 0xFF byte-extractions in the packing. The movw cleanly at tail, not spilled guard is gating the rest (the other ANDs' movws are spilled or mid-sequence). So the bigger AND-fold yield is coupled to the spill work (VCR-RA-001) — once spill-under-pressure keeps those consts resident, more sites qualify. And per the 262→103 gap decomposition, the dominant levers are still ahead: const-CSE on the #0x7e/#0x7f clamp bounds (×6 each), mul+addmla fusion, and tighter clamp lowering (18→6 IT-blocks).

Net: a clean, correct first step. Both microbenches stay frozen-and-staged; I'll post the delta for each subsequent transform as it lands.

@avrabe avrabe merged commit d83abb0 into main Jun 4, 2026
12 checks passed
@avrabe avrabe deleted the feat/vcr-and-immediate-folding branch June 4, 2026 20:37
avrabe added a commit that referenced this pull request Jun 4, 2026
…nt NOP (VCR-RA-001) (#251)

Verifying the prerequisite for extending immediate folding to i32.or/xor
surfaced a latent silent-wrong path: the Thumb-2 `Orr`/`Eor` encoders handled
only `Operand2::Reg`; `Operand2::Imm` fell through to `0xBF00` (NOP). Folding
an or/xor immediate would have silently turned the operation into a no-op —
a miscompile.

Fix (Ok-or-Err, #180/#185): encode the ORR.W / EOR.W T1 immediate for the
zero-extended byte range (imm <= 0xFF) — `orr r2,r0,#0x7e → 40 f0 7e 02`,
`eor → 80 f0 7e 02` — and return an error for larger modified immediates
(ThumbExpandImm not yet implemented) rather than emit a wrong/NOP encoding.

This path was DEAD (the selector never emits Orr/Eor-imm), so no existing
codegen changes — the frozen fixtures are unaffected. It removes the
silent-wrong path and is the precondition for safe i32.or/xor immediate
folding (the #250 i32.and pattern, extended).

Test orr_eor_immediate_encode_in_byte_range_else_error pins the byte-range
bytes + the out-of-range error.

Part of #242.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.09804% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/synth-synthesis/src/instruction_selector.rs 94.50% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

avrabe added a commit that referenced this pull request Jun 4, 2026
… (VCR-RA-001) (#252)

Extends #250's i32.and immediate folding to i32.or and i32.xor, now that the
encoder's ORR/EOR immediate paths are correct + Ok-or-Err (#251). Same shape:
when the operand before the op is `i32.const C` (C in 0..=0xFF) with its `movw`
cleanly at the instruction tail, fold to `orr/eor rd, a, #C` and drop the
materialization (foldable_bitwise_imm + drop_prev_const_materialization).

Bounded to 0..=0xFF by the encoder (ORR/EOR-imm > 0xFF returns Err until
ThumbExpandImm lands, #251) — the fold guard keeps it in range.

GATE (codegen change): clippy clean; 283 lib tests; the three frozen
differential fixtures stay RESULT-identical (control_step 0x00210A55, flight_seam
0x07FDF307, div_const 338/338); test i32_or_xor_fold_small_const_into_immediate
(both ops fold, no movw survives). CI fuzz adds totality.

Part of #242.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Jun 4, 2026
…-RA-001) (#254)

Completes the arithmetic+bitwise immediate-folding family (and/or/xor #250/#252,
now add/sub). When the operand before i32.add/i32.sub is `i32.const C` with C in
0..=0xFFF and its `movw` cleanly at the tail, fold to `add/sub rd, a, #C` and
drop the materialization.

Range is the full 0..=0xFFF (4095), wider than the bitwise 0..=0xFF, because the
ADD/SUB immediate encoder now uses ADDW/SUBW (T4, plain imm12) for >0xFF (#253) —
verified correct before folding into it.

This one actively fires on real fixtures: control_step's frame `sub #16` and its
const adds now fold (the differential stays result-identical, confirming
correctness across a codegen change). Updated test_237_stack_pointer_global_is_
register_promoted to accept the frame-size 16 in its folded `sub #16` form (still
a plain scalar immediate, still not __synth_wasm_data-relocated — the property it
checks).

GATE: clippy clean; 284 lib tests; three frozen differentials RESULT-identical
(control_step 0x00210A55, flight_seam 0x07FDF307, div_const 338/338); test
i32_add_sub_fold_const_into_immediate (byte + >0xFF ADDW-path values). CI fuzz
adds totality.

Part of #242.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Jun 5, 2026
…rectness fixes (#260)

Promote the accumulated v0.11.30 work into the CHANGELOG before tagging:
native-pointer ABI (#237) + the VCR-* constant-immediate folding (#250/#252/#254)
+ analysis foundation (#243/#245) + three latent-miscompile encoder fixes
(#251 ORR/EOR NOP, #253 ADD/SUB large-frame, #255 CMP/ADDS/SUBS ThumbExpandImm).
Adds a falsification statement covering the encoder correctness class.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant