u64-packed FFI return: emit register-direct field access instead of generic 64-bit shift extraction

## Context

The gale-ffi crate uses u64-packed return values for FFI decision functions (per gale's #10 LTO regression guard). When LLVM cross-language LTO operates on this, it produces tight inlined code where the C caller does a direct field access on the packed return. When synth operates on the same wasm, it produces a generic 64-bit shift-and-mask sequence — about 1.68× the bytes for the same logic.

## Concrete numbers (from silicon-anchor experiment)

Same input function (`gale_k_sem_give_decide` returning `u64`-packed action+new_count), same caller (`z_impl_k_sem_give`), same Cortex-M4F target:

| route | inlined `z_impl_k_sem_give` body size | silicon handoff (cyc) |
|---|---:|---:|
| LLVM cross-language LTO | 82 bytes | 471 (ADC=n) / 558 (ADC=y) |
| wasm-ld merge → synth | 138 bytes | not yet measurable (memset bug, see issue #N) |

## LLVM-LTO output (the gold standard)

```
8004398: ldrd r2, r1, [r0, #8]   ; load count, limit
800439c: cmp  r2, r1
800439e: it   cc
80043a0: addcc r2, #1
80043a2: str  r2, [r0, #8]
```

5 instructions. LLVM saw the `Result::ok().action == WAKE` check on Rust's side and the C caller's branching, and dedup'd them.

## Synth output (same input, same target)

```
2f61c: and.w r0, r0, r2          ; mask u64 → u8 action
2f620: and.w r1, r1, r3
2f624: and similar masking
2f63a: f1b2 0320  subs.w r3, r2, #32   ; 64-bit shift dance
2f63e: d50a       bpl.n   2f662
2f640: f1c2 0320  rsb r3, r2, #32      ; the byte-extract path uses lsl/lsr/orr
2f644: fa01 f303  lsl.w r3, r1, r3
2f648: fa20 f002  lsr.w r0, r0, r2
2f64c: ea40 0003  orr.w r0, r0, r3
2f650: fa21 f102  lsr.w r1, r1, r2
```

~30 instructions for the same field extract.

## Recommendation

When synth lowers a wasm function returning i64 and the immediate caller bit-masks into byte-boundary fields:

1. Recognize the packed-struct-return pattern (one or more `&` with byte-mask constants on the i64 result).
2. Track which byte fields are used at each callsite.
3. Emit register-direct field access via `uxtb`/`mov` instead of generic 64-bit shift extraction.

This would close ~50% of the LLVM-LTO size gap for FFI-shim style code, which is a common pattern across PulseEngine kernel work (gale-ffi, smart-data, etc.).

## Cross-references

- Silicon-anchor evidence: pulseengine/gale@f6db15e, PR #40.
- Full disassembly comparison: `benches/engine_control/silicon/boards/nucleo_g474re/NOTES-wasm-cross-lto-spike.md`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

u64-packed FFI return: emit register-direct field access instead of generic 64-bit shift extraction #94

Context

Concrete numbers (from silicon-anchor experiment)

LLVM-LTO output (the gold standard)

Synth output (same input, same target)

Recommendation

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

route	inlined `z_impl_k_sem_give` body size	silicon handoff (cyc)
LLVM cross-language LTO	82 bytes	471 (ADC=n) / 558 (ADC=y)
wasm-ld merge → synth	138 bytes	not yet measurable (memset bug, see issue #N)

u64-packed FFI return: emit register-direct field access instead of generic 64-bit shift extraction #94

Description

Context

Concrete numbers (from silicon-anchor experiment)

LLVM-LTO output (the gold standard)

Synth output (same input, same target)

Recommendation

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions