perf: in-place select — elide keep-val2 move (#209, VCR-SEL-002)#283
Conversation
…, VCR-SEL-002) The stack selector's `Select` lowering emitted three instructions — `cmp cond,#0; selectmove dst,val1,NE; selectmove dst,val2,EQ` — into a freshly allocated `dst`. Because `select` *consumes* all three operands, `val2`'s register is dead afterward, so we reuse it as `dst`: the EQ "keep val2" branch is then a no-op move on itself and is elided. Exactly one conditional move (the NE override) remains — what native emits for a clamp `(x>k)?k:x`. Each removed `SelectMove` is one fewer IT;MOV: −2 bytes and −1 cycle (the M4 still executes a predicated-false MOV). Soundness — in-place only when `val2` is safely overwritable: - not a live param register (#193 param-clobber class) - distinct from `cond` (cmp consumed it, but a later read can't) - distinct from `val1` (degenerate; fresh dst keeps it simple) Otherwise falls back to the original fresh-dst two-move form. No instruction sits between the CMP and the conditional move in the in-place path, so the flags are intact (same property the two-move form relied on). Gate — RESULT-identical on every frozen differential (this is an optimization: bytes change, results do not): control_step 13/13 0x00210A55 PASS flight_seam inlined 0x07FDF307 MATCH flight_seam flat 0x07FDF307 MATCH div_const 338/338 PASS Measured .text reduction: control_step 378 -> 354 B (-24, -6.3%) flight_seam inlined 1058 -> 1020 B (-38, -3.6%) flight_seam flat 1294 -> 1244 B (-50, -3.9%) Tests: updated test_control_flow_select / _stack_mode_with_constants / _with_global_values to assert the NE override (present in both forms); new test_select_in_place_elides_keep_move_209 pins one-NE-zero-EQ. 323/323 synth-synthesis lib tests pass, clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
On-target results for the in-place select (#283) — this is the biggest lever so far, and it's a clean non-regressing win (eliding moves only):
The clamps were the dominant cost (your "18 IT-blocks vs native 6" from the gap decomposition), and eliding the keep-val2 move per flat_flight now 241 / 103 = 2.34× (was 2.54× a few days ago, 3.18× at v0.11.18). All seams bit-identical. If this is the new default (no flag), I'd update the acceptance gate to flat_flight 241 / controller 150 / control_step 151 / filter 37 once it merges — and the const-CSE/allocator deltas then stack on top of this lower baseline. Nice — the select lowering was hiding a lot. 🎯 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
What
The stack selector's
Selectlowering emitted three instructions into a freshdst:Since
selectconsumes all three operands,val2's register is dead after the op. We reuse it asdst, so the EQ "keep val2" branch becomes a no-op move on itself and is elided — leaving exactly one conditional move (the NE override), which is what native emits for a clamp(x>k)?k:x. Each removedSelectMoveis one fewerIT;MOV: −2 bytes and −1 cycle (the M4 executes a predicated-false MOV).Soundness
In-place only when
val2is safely overwritable, else fall back to the original two-move form:cond— already consumed by the CMPval1— degenerate; fresh dst keeps it simpleNo instruction sits between the CMP and the conditional move in the in-place path, so the flags survive (the same property the two-move form relied on to avoid a flag-clobbering MOVS).
Oracle — RESULT-identical on every frozen differential
This is an optimization: bytes change, results do not.
0x00210A550x07FDF3070x07FDF307Measured
.textreductionTests
test_control_flow_select/test_select_stack_mode_with_constants/test_select_with_global_valuesto assert the NE override (emitted in both forms; the EQ move is what in-place elides).test_select_in_place_elides_keep_move_209pins one-NE-zero-EQ for a reusableval2.synth-synthesislib tests pass, clippy clean.Falsification
This change is wrong if any module produces a different result than wasmtime where
val2's register was actually still needed after the select — i.e. the live-param / aliasing guards missed a case. Watched by the four differentials above + gale on-target against the 5 frozen baselines.Traceability: VCR-SEL-002 (filed in #282). Part of the #209 codegen-quality program.
🤖 Generated with Claude Code