Skip to content

u64-packed FFI return: emit register-direct field access instead of generic 64-bit shift extraction #94

@avrabe

Description

@avrabe

Context

The gale-ffi crate uses u64-packed return values for FFI decision functions (per gale's #10 LTO regression guard). When LLVM cross-language LTO operates on this, it produces tight inlined code where the C caller does a direct field access on the packed return. When synth operates on the same wasm, it produces a generic 64-bit shift-and-mask sequence — about 1.68× the bytes for the same logic.

Concrete numbers (from silicon-anchor experiment)

Same input function (gale_k_sem_give_decide returning u64-packed action+new_count), same caller (z_impl_k_sem_give), same Cortex-M4F target:

route inlined z_impl_k_sem_give body size silicon handoff (cyc)
LLVM cross-language LTO 82 bytes 471 (ADC=n) / 558 (ADC=y)
wasm-ld merge → synth 138 bytes not yet measurable (memset bug, see issue #N)

LLVM-LTO output (the gold standard)

8004398: ldrd r2, r1, [r0, #8]   ; load count, limit
800439c: cmp  r2, r1
800439e: it   cc
80043a0: addcc r2, #1
80043a2: str  r2, [r0, #8]

5 instructions. LLVM saw the Result::ok().action == WAKE check on Rust's side and the C caller's branching, and dedup'd them.

Synth output (same input, same target)

2f61c: and.w r0, r0, r2          ; mask u64 → u8 action
2f620: and.w r1, r1, r3
2f624: and similar masking
2f63a: f1b2 0320  subs.w r3, r2, #32   ; 64-bit shift dance
2f63e: d50a       bpl.n   2f662
2f640: f1c2 0320  rsb r3, r2, #32      ; the byte-extract path uses lsl/lsr/orr
2f644: fa01 f303  lsl.w r3, r1, r3
2f648: fa20 f002  lsr.w r0, r0, r2
2f64c: ea40 0003  orr.w r0, r0, r3
2f650: fa21 f102  lsr.w r1, r1, r2

~30 instructions for the same field extract.

Recommendation

When synth lowers a wasm function returning i64 and the immediate caller bit-masks into byte-boundary fields:

  1. Recognize the packed-struct-return pattern (one or more & with byte-mask constants on the i64 result).
  2. Track which byte fields are used at each callsite.
  3. Emit register-direct field access via uxtb/mov instead of generic 64-bit shift extraction.

This would close ~50% of the LLVM-LTO size gap for FFI-shim style code, which is a common pattern across PulseEngine kernel work (gale-ffi, smart-data, etc.).

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions