Lever #2 from the flat_flight gap decomposition: fuse mul + add → mla
Per the 262→103 gap decomposition, after const-CSE the next instruction-selection lever for flat_flight is multiply-accumulate fusion. Measured on current main (f6a0c96, cortex-m4f):
synth lowers the filter's gyro*980 + accel*20 as separate mul then add (2 sites — pitch + roll axes):
movw r4, #0x3d4 ; 980
mul r5, r3, r4 ; r5 = gyro * 980
movw r7, #0x14 ; 20
mul r8, r6, r7 ; r8 = accel * 20
add.w r2, r5, r8 ; r2 = r5 + r8 ← fuse this add into the mul
native (gcc -O2) uses MLA — the add is free:
mla r2, r7, r6, r2 ; r2 = 980 * accel + r2 (one instruction)
mla r4, r7, r5, r4
The transform
Peephole: when a mul rD, rA, rB result feeds exactly one add rE, rD, rC (rD not otherwise live), rewrite to mla rE, rA, rB, rC and drop the mul. Cortex-M4 has single-cycle MLA. Saves 1 instruction + 1 temp register per site — 2 sites in flat_flight, and it recurs in any a*k1 + b*k2 filter/accumulator (very common in control code).
Bonus adjacent: multiply-by-constant strength reduction
The multipliers here are constants (980, 20). Native strength-reduces *20 to add.w r,r,r,lsl#2; lsls #2 (= *5 *4), avoiding the movw #20; mul entirely. A mul-by-small-constant → shift/add peephole would compound with the MLA fusion. (Lower priority than the MLA fold itself.)
Scope
Pure instruction selection (no regalloc dependency), so it composes cleanly with the const-CSE/spill work on the VCR-RA-001 track. flat_flight-microbench (261) + controller (168) are staged — I'll post the silicon delta when it lands. Filing per my offer on #209; close as dup/wontfix if it's already on the roadmap.
Lever #2 from the flat_flight gap decomposition: fuse
mul+add→mlaPer the 262→103 gap decomposition, after const-CSE the next instruction-selection lever for
flat_flightis multiply-accumulate fusion. Measured on currentmain(f6a0c96, cortex-m4f):synth lowers the filter's
gyro*980 + accel*20as separatemulthenadd(2 sites — pitch + roll axes):native (gcc -O2) uses MLA — the add is free:
The transform
Peephole: when a
mul rD, rA, rBresult feeds exactly oneadd rE, rD, rC(rD not otherwise live), rewrite tomla rE, rA, rB, rCand drop themul. Cortex-M4 has single-cycleMLA. Saves 1 instruction + 1 temp register per site — 2 sites inflat_flight, and it recurs in anya*k1 + b*k2filter/accumulator (very common in control code).Bonus adjacent: multiply-by-constant strength reduction
The multipliers here are constants (980, 20). Native strength-reduces
*20toadd.w r,r,r,lsl#2; lsls #2(=*5 *4), avoiding themovw #20; mulentirely. Amul-by-small-constant → shift/add peephole would compound with the MLA fusion. (Lower priority than the MLA fold itself.)Scope
Pure instruction selection (no regalloc dependency), so it composes cleanly with the const-CSE/spill work on the VCR-RA-001 track.
flat_flight-microbench(261) +controller(168) are staged — I'll post the silicon delta when it lands. Filing per my offer on #209; close as dup/wontfix if it's already on the roadmap.