Context
The gale-ffi crate uses u64-packed return values for FFI decision functions (per gale's #10 LTO regression guard). When LLVM cross-language LTO operates on this, it produces tight inlined code where the C caller does a direct field access on the packed return. When synth operates on the same wasm, it produces a generic 64-bit shift-and-mask sequence — about 1.68× the bytes for the same logic.
Concrete numbers (from silicon-anchor experiment)
Same input function (gale_k_sem_give_decide returning u64-packed action+new_count), same caller (z_impl_k_sem_give), same Cortex-M4F target:
| route |
inlined z_impl_k_sem_give body size |
silicon handoff (cyc) |
| LLVM cross-language LTO |
82 bytes |
471 (ADC=n) / 558 (ADC=y) |
| wasm-ld merge → synth |
138 bytes |
not yet measurable (memset bug, see issue #N) |
LLVM-LTO output (the gold standard)
8004398: ldrd r2, r1, [r0, #8] ; load count, limit
800439c: cmp r2, r1
800439e: it cc
80043a0: addcc r2, #1
80043a2: str r2, [r0, #8]
5 instructions. LLVM saw the Result::ok().action == WAKE check on Rust's side and the C caller's branching, and dedup'd them.
Synth output (same input, same target)
2f61c: and.w r0, r0, r2 ; mask u64 → u8 action
2f620: and.w r1, r1, r3
2f624: and similar masking
2f63a: f1b2 0320 subs.w r3, r2, #32 ; 64-bit shift dance
2f63e: d50a bpl.n 2f662
2f640: f1c2 0320 rsb r3, r2, #32 ; the byte-extract path uses lsl/lsr/orr
2f644: fa01 f303 lsl.w r3, r1, r3
2f648: fa20 f002 lsr.w r0, r0, r2
2f64c: ea40 0003 orr.w r0, r0, r3
2f650: fa21 f102 lsr.w r1, r1, r2
~30 instructions for the same field extract.
Recommendation
When synth lowers a wasm function returning i64 and the immediate caller bit-masks into byte-boundary fields:
- Recognize the packed-struct-return pattern (one or more
& with byte-mask constants on the i64 result).
- Track which byte fields are used at each callsite.
- Emit register-direct field access via
uxtb/mov instead of generic 64-bit shift extraction.
This would close ~50% of the LLVM-LTO size gap for FFI-shim style code, which is a common pattern across PulseEngine kernel work (gale-ffi, smart-data, etc.).
Cross-references
Context
The gale-ffi crate uses u64-packed return values for FFI decision functions (per gale's #10 LTO regression guard). When LLVM cross-language LTO operates on this, it produces tight inlined code where the C caller does a direct field access on the packed return. When synth operates on the same wasm, it produces a generic 64-bit shift-and-mask sequence — about 1.68× the bytes for the same logic.
Concrete numbers (from silicon-anchor experiment)
Same input function (
gale_k_sem_give_decidereturningu64-packed action+new_count), same caller (z_impl_k_sem_give), same Cortex-M4F target:z_impl_k_sem_givebody sizeLLVM-LTO output (the gold standard)
5 instructions. LLVM saw the
Result::ok().action == WAKEcheck on Rust's side and the C caller's branching, and dedup'd them.Synth output (same input, same target)
~30 instructions for the same field extract.
Recommendation
When synth lowers a wasm function returning i64 and the immediate caller bit-masks into byte-boundary fields:
&with byte-mask constants on the i64 result).uxtb/movinstead of generic 64-bit shift extraction.This would close ~50% of the LLVM-LTO size gap for FFI-shim style code, which is a common pattern across PulseEngine kernel work (gale-ffi, smart-data, etc.).
Cross-references
benches/engine_control/silicon/boards/nucleo_g474re/NOTES-wasm-cross-lto-spike.md.