feat(selector): fold small constants into i32.and immediates + encoder bound (VCR-RA-001)#250
Conversation
…old bound (VCR-RA-001) Investigating immediate folding (the biggest win the #248 evidence pointed to) surfaced an encoder limitation: the `And { Operand2::Imm }` path packs the low 12 bits straight into the `i:imm3:imm8` field WITHOUT ThumbExpandImm (the modified-immediate expansion). For imm <= 0xFF (gale's int8 clamps #0x7e/#0x7f) that is correct — `and r2,r0,#0x7e` encodes to the canonical `00 f0 7e 02`. For imm >= 0x100 the field needs a true rotation/replication pattern that is not implemented, so it would silently encode a different value. This path is currently DEAD (the selector never emits And-imm), so no live bug — but it sets the precondition for the immediate-folding transform: **fold only imm <= 0xFF** until the encoder is hardened to ThumbExpandImm / Ok-or-Err (the "encoder must be Ok-or-Err, never silently wrong" principle, #180/#185). That bound covers the measured flat_flight waste. Pins the safe-range encoding as a regression guard; no codegen change. Part of #242. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…001) The FIRST delta-emitting codegen transform on the allocator track. Evidence (#248) showed the dominant flat_flight-shape waste is redundant const materialization: `i32.const C; i32.and` lowered to `movw rN,#C; and rD,rA,rN` even when C is a small constant the AND instruction can take as an immediate. Fix: when the operand pushed immediately before `i32.and` is `i32.const C` with C in 0..=0xFF AND its `movw` is cleanly at the instruction tail (not spilled), fold to `and rD, rA, #C` and drop the materialization (foldable_bitwise_imm + drop_prev_const_materialization, mirroring the const-divisor pattern). Bounded to 0..=0xFF: the encoder's AND-immediate path is not yet ThumbExpandImm- complete (#249) — larger modified immediates await an encoder hardening to Ok-or-Err. 0..=0xFF covers gale's int8 clamps. Measured delta: the (p & 0x7e) + (p & 0x7e) pattern drops 8 → 6 instructions (both `movw #126` eliminated; each AND uses the immediate). Better than const-CSE — no materialization at all. GATE (full, codegen change): 282 lib tests + 20 integration suites pass; the three frozen differential fixtures stay RESULT-identical (control_step 0x00210A55, flight_seam 0x07FDF307, div_const 338/338); tests i32_and_folds_small_const_into_immediate (folded shape) + i32_and_does_not_fold_out_of_range_const (0x140 stays a register operand — the encoder safety bound). CI fuzz adds totality. Part of #242. Supersedes the #248 evidence (the redundancy it pinned is now folded away). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Built PR #250 and measured it on the G474RE — the first codegen-application delta, and it's correct on silicon:
Object-level: One yield note: it folded exactly 1 AND-immediate per function, though both have ~3–4 Net: a clean, correct first step. Both microbenches stay frozen-and-staged; I'll post the delta for each subsequent transform as it lands. |
…nt NOP (VCR-RA-001) (#251) Verifying the prerequisite for extending immediate folding to i32.or/xor surfaced a latent silent-wrong path: the Thumb-2 `Orr`/`Eor` encoders handled only `Operand2::Reg`; `Operand2::Imm` fell through to `0xBF00` (NOP). Folding an or/xor immediate would have silently turned the operation into a no-op — a miscompile. Fix (Ok-or-Err, #180/#185): encode the ORR.W / EOR.W T1 immediate for the zero-extended byte range (imm <= 0xFF) — `orr r2,r0,#0x7e → 40 f0 7e 02`, `eor → 80 f0 7e 02` — and return an error for larger modified immediates (ThumbExpandImm not yet implemented) rather than emit a wrong/NOP encoding. This path was DEAD (the selector never emits Orr/Eor-imm), so no existing codegen changes — the frozen fixtures are unaffected. It removes the silent-wrong path and is the precondition for safe i32.or/xor immediate folding (the #250 i32.and pattern, extended). Test orr_eor_immediate_encode_in_byte_range_else_error pins the byte-range bytes + the out-of-range error. Part of #242. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
… (VCR-RA-001) (#252) Extends #250's i32.and immediate folding to i32.or and i32.xor, now that the encoder's ORR/EOR immediate paths are correct + Ok-or-Err (#251). Same shape: when the operand before the op is `i32.const C` (C in 0..=0xFF) with its `movw` cleanly at the instruction tail, fold to `orr/eor rd, a, #C` and drop the materialization (foldable_bitwise_imm + drop_prev_const_materialization). Bounded to 0..=0xFF by the encoder (ORR/EOR-imm > 0xFF returns Err until ThumbExpandImm lands, #251) — the fold guard keeps it in range. GATE (codegen change): clippy clean; 283 lib tests; the three frozen differential fixtures stay RESULT-identical (control_step 0x00210A55, flight_seam 0x07FDF307, div_const 338/338); test i32_or_xor_fold_small_const_into_immediate (both ops fold, no movw survives). CI fuzz adds totality. Part of #242. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…-RA-001) (#254) Completes the arithmetic+bitwise immediate-folding family (and/or/xor #250/#252, now add/sub). When the operand before i32.add/i32.sub is `i32.const C` with C in 0..=0xFFF and its `movw` cleanly at the tail, fold to `add/sub rd, a, #C` and drop the materialization. Range is the full 0..=0xFFF (4095), wider than the bitwise 0..=0xFF, because the ADD/SUB immediate encoder now uses ADDW/SUBW (T4, plain imm12) for >0xFF (#253) — verified correct before folding into it. This one actively fires on real fixtures: control_step's frame `sub #16` and its const adds now fold (the differential stays result-identical, confirming correctness across a codegen change). Updated test_237_stack_pointer_global_is_ register_promoted to accept the frame-size 16 in its folded `sub #16` form (still a plain scalar immediate, still not __synth_wasm_data-relocated — the property it checks). GATE: clippy clean; 284 lib tests; three frozen differentials RESULT-identical (control_step 0x00210A55, flight_seam 0x07FDF307, div_const 338/338); test i32_add_sub_fold_const_into_immediate (byte + >0xFF ADDW-path values). CI fuzz adds totality. Part of #242. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…rectness fixes (#260) Promote the accumulated v0.11.30 work into the CHANGELOG before tagging: native-pointer ABI (#237) + the VCR-* constant-immediate folding (#250/#252/#254) + analysis foundation (#243/#245) + three latent-miscompile encoder fixes (#251 ORR/EOR NOP, #253 ADD/SUB large-frame, #255 CMP/ADDS/SUBS ThumbExpandImm). Adds a falsification statement covering the encoder correctness class. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
The first delta-emitting codegen transform on the allocator track (VCR-RA-001, epic #242). Consolidates the encoder-bound finding (was #249) with the folding that uses it.
The waste (measured, #248)
i32.const C; i32.andlowered tomovw rN,#C; and rD,rA,rNeven whenCis small enough to be anANDimmediate — the redundant materialization gale measured onflat_flight.The fix
When the operand before
i32.andisi32.const CwithC ∈ 0..=0xFFand itsmovwis cleanly at the instruction tail (not spilled), fold toand rD, rA, #Cand drop the materialization (foldable_bitwise_imm+drop_prev_const_materialization, mirroring the const-divisor pattern).Encoder bound (was #249)
Bounded to
0..=0xFFbecause the encoder'sAND-immediate path isn't yet ThumbExpandImm-complete (and r2,r0,#0x7e → 00 f0 7e 02is correct;≥0x100would mis-encode).0..=0xFFcovers gale's int8 clamps. A pinning test documents the safe range; folding is guarded to it, so the encoder never sees an un-encodable immediate.Measured delta
(p & 0x7e) + (p & 0x7e): 8 → 6 instructions — bothmovw #126eliminated, eachANDuses the immediate. Better than const-CSE (no materialization at all).Full gate (codegen change)
0x00210A55, flight_seam0x07FDF307, div_const338/338.i32_and_folds_small_const_into_immediate(folded shape),i32_and_does_not_fold_out_of_range_const(0x140stays a register operand — the encoder safety bound),and_immediate_encodes_correctly_in_byte_range...(encoder).encoder_no_panic,wasm_ops_lower_or_error) adds totality.Supersedes #248 (the evidence it pinned is now folded away) and subsumes #249 (the encoder-bound test is included here).
Part of #242.
🤖 Generated with Claude Code