selector: fuse mul+add → mla — flat_flight filter emits separate mul+add where native uses MLA

## Lever #2 from the flat_flight gap decomposition: fuse `mul` + `add` → `mla`

Per the [262→103 gap decomposition](https://github.com/pulseengine/synth/issues/209#issuecomment-4625712211), after const-CSE the next instruction-selection lever for `flat_flight` is **multiply-accumulate fusion**. Measured on current `main` (f6a0c96, cortex-m4f):

**synth** lowers the filter's `gyro*980 + accel*20` as separate `mul` then `add` (2 sites — pitch + roll axes):
```
movw r4, #0x3d4        ; 980
mul  r5, r3, r4        ; r5 = gyro * 980
movw r7, #0x14         ; 20
mul  r8, r6, r7        ; r8 = accel * 20
add.w r2, r5, r8       ; r2 = r5 + r8   ← fuse this add into the mul
```
**native (gcc -O2)** uses MLA — the add is free:
```
mla r2, r7, r6, r2     ; r2 = 980 * accel + r2   (one instruction)
mla r4, r7, r5, r4
```

### The transform
Peephole: when a `mul rD, rA, rB` result feeds exactly one `add rE, rD, rC` (rD not otherwise live), rewrite to `mla rE, rA, rB, rC` and drop the `mul`. Cortex-M4 has single-cycle `MLA`. Saves **1 instruction + 1 temp register per site** — 2 sites in `flat_flight`, and it recurs in any `a*k1 + b*k2` filter/accumulator (very common in control code).

### Bonus adjacent: multiply-by-constant strength reduction
The multipliers here are **constants** (980, 20). Native strength-reduces `*20` to `add.w r,r,r,lsl#2; lsls #2` (= `*5 *4`), avoiding the `movw #20; mul` entirely. A `mul`-by-small-constant → shift/add peephole would compound with the MLA fusion. (Lower priority than the MLA fold itself.)

### Scope
Pure instruction selection (no regalloc dependency), so it composes cleanly with the const-CSE/spill work on the VCR-RA-001 track. `flat_flight-microbench` (261) + `controller` (168) are staged — I'll post the silicon delta when it lands. Filing per my offer on #209; close as dup/wontfix if it's already on the roadmap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selector: fuse mul+add → mla — flat_flight filter emits separate mul+add where native uses MLA #257

Lever #2 from the flat_flight gap decomposition: fuse `mul` + `add` → `mla`

The transform

Bonus adjacent: multiply-by-constant strength reduction

Scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

selector: fuse mul+add → mla — flat_flight filter emits separate mul+add where native uses MLA #257

Description

Lever #2 from the flat_flight gap decomposition: fuse mul + add → mla

The transform

Bonus adjacent: multiply-by-constant strength reduction

Scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Lever #2 from the flat_flight gap decomposition: fuse `mul` + `add` → `mla`