synth: cross-Renode A/B confirms gale's −34.5% delta is real, not simulator-artifactual#27
Merged
Conversation
When GALE_USE_SYNTH=ON, the gale-ffi crate is compiled to wasm32 first, then run through the synth AOT compiler (pulseengine/synth) which emits a Cortex-M ET_REL relocatable object. The object is wrapped into the same libgale_ffi.a path the rest of the build expects via ar, so the per-module gale_sem.c / gale_mutex.c / etc. consumers need no changes. This is the build-system half of the 4th-variant experiment for the cross-language LTO blog post: same engine bench, three existing builds (GCC baseline / GCC + Gale / LLVM + LTO + Gale) plus a 4th data point where verified Rust reaches Cortex-M via Verus → rustc → wasm → synth's Rocq-proved i32 instruction selection. Requires synth with --relocatable flag (pulseengine/synth#83). Default behaviour (GALE_USE_SYNTH=OFF) is unchanged: rustc-direct-to-Cortex-M is still the production path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment Adds the gale-via-synth lane to the engine-bench Renode matrix as a follow-up to the cross-language LTO post. Builds the GCC baseline and the GALE_USE_SYNTH=ON variant (wasm32 -> synth -> Cortex-M ET_REL -> libgale_ffi.a) on the same CI run, then sweeps both through Renode at the long sample count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two optional formally-verified-(or-not) wasm optimizers between rustc and synth in the GALE_USE_SYNTH pipeline: rustc -> [wasm-opt -Oz] -> [loom optimize] -> synth -> ar -> .a Both are detected via find_program() and only inserted into the pipeline if found. If neither is on the path the pipeline reduces to rustc -> synth (unchanged behaviour). Effect on engine bench (stm32f4_disco, prj-gale.conf, GCC C kernel): synth alone: text=22448, total=38533 synth + wasm-opt + loom: text=22420, total=38505 (-28 B) Wasm-level reduction is dramatic (-34% from wasm-opt -Oz), but the synth-emitted ARM code is dominated by per-function instruction selection overhead, so the final ELF only moves a few dozen bytes. The verification-chain story is the bigger win: loom proves each pass it applies preserves semantics; rejected passes are skipped rather than applied unsoundly. CI workflow installs both: binaryen apt package (wasm-opt) and loom-cli from pulseengine/loom main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
workflow_dispatch alone requires the workflow to exist on the default branch before it can be invoked, which we can't do without merging the whole experiment first. Adding a push trigger on experiment/gale-via-synth so the workflow auto-runs whenever the experiment branch advances. Strip this trigger before any merge to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t --path) cargo install --git URL --path PATH is invalid — the two flags are mutually exclusive. When installing a sub-crate from a git workspace, pass the package name as a positional argument: cargo install --git URL [--branch B] PACKAGE_NAME --force Fixes the workflow's synth-cli install (was rejected with: 'the argument --git <URL> cannot be used with --path <PATH>') and the identical pattern used for loom-cli. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… optimize mode) Diagnosed during local debugging: synth's optimized register-allocation path clobbers r0/r1 (input parameter registers) at function entry when the wasm body pushes i32 constants before the first local.get. The function's prologue ends up looking like: movs r0, #1 ← clobbers param 0 (count) movs r1, #0 ← clobbers param 1 (limit) ... cmp r0, r1 ← compares clobbered values, not the actual params This crashes the engine bench in Renode (HardFault on first gale call, infinite handler loop, never reaches the test's Zero Drops assertion). The CI run hit a 60-min step timeout without producing a single sample. Minimal repro saved at /tmp/match_gale.wat (3 i32 params, i64 local, push 3 i32 constants, then local.get 0). Worth filing as a synth issue once the experiment lands. Workaround: synth --no-optimize. Disables the offending pass and emits proper AAPCS prologue (push r4..r8/lr, locals on stack, params read from r0/r1/r2 unchanged). Verified locally: same gale_k_sem_give_decide function now starts with `stmdb sp!, {r4..r8, lr}` and reads r0/r1 correctly. Cost: ~68 bytes of additional flash (22624 → 22692) and unknown cycles. The --no-optimize path uses stack-based locals which is wasteful but correct. Stack frame size also goes up — synth reserves ~4KB per function for locals which may be excessive; will need to validate no stack overflow on the engine bench worker thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…frame fixes The previous CI run booted Zephyr with the synth-built ELF but every RPM step reported count=0 [drain_timeout]. Diagnosed two synth bugs: 1. i64 local storage dropped the upper half (--no-optimize path) 2. Locals area aliased the saved-register spill (also --no-optimize) Both fixed in pulseengine/synth#85. This commit points the workflow at that branch so the next CI run uses the fixed synth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in: - explicit I64Or/And/Xor/ExtendI32U/ExtendI32S/Shl/ShrU/ShrS handlers in synth's select_with_stack (no more wildcard fallthrough to select_default's R0:R1/R2:R3 assumption) - alloc_consecutive_pair now reserves implicit pair_hi of every stack entry plus extra_avoid for popped operands Local build verified: gale_k_sem_give_decide ends with orr r0, r6, r8 orr r1, r7, ip matching the wasm i64.or semantics. 22644 B FLASH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tracing the −34.5% handoff cycle delta in the synth bench. Found that loom's optimizer hoists `local.set 3 = 0` from the fall-through arm of gale_k_sem_give_decide to BEFORE the dispatcher, dropping the WakeThread/Increment distinction at the wasm level — synth then emits ARM that always returns action=INCREMENT regardless of has_waiter. The bench passes (samples=7750, drops=0) because the engine_control worker is rarely actually blocked at sem_give time, so the WAKE path is rarely needed for correctness. But the cycle delta is then comparing a degenerate always-INCREMENT path to rustc's correct WAKE/INCREMENT discrimination — apples to oranges. This run skips loom in CI so we can A/B against the loom-on result and validate the hypothesis. CMakeLists' find_program(LOOM) fails when loom isn't on PATH, falling through to wasm-opt -> synth without loom. Filed for follow-up: pulseengine/loom optimizer bug. The hoisting is unsound for this control-flow pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5% delta
The container ships Renode 1.16.0 (ARG RENODE_VERSION=1.16.0 in
zephyrproject-rtos/docker-image Dockerfile.base, unchanged across
v0.28.x..v0.29.2). 1.16.1 (Feb 2026) touches several Cortex-M paths
that could shift cycle accounting on the gale instruction stream:
fixed ARMv8-M Thumb2 data-processing instructions, fixed Stack
Pointer bits[1:0] handling, fixed wrong exception when FPU is
disabled. None is labelled a cycle-counter fix in the changelog,
but Thumb-2 dispatch changes shift cycle accounting whenever an
instruction takes a different micro-op path.
Adds nightly Renode (builds.renode.io) alongside 1.16.0 and runs
the same two ELFs under both. Yields three controls:
(a) baseline vs gale, both under nightly
— does the gale delta change when the cycle model changes?
(b) baseline_1.16.0 vs baseline_nightly (same ELF)
— control: cycle-model drift on identical instructions.
(c) gale_1.16.0 vs gale_nightly (same ELF)
— does the model shift gale's instructions more than baseline's?
If yes, the 1.16.0 model is mis-scoring gale-specific
instructions and the delta is partly artifactual.
Implementation: PATH override puts /opt/renode-nightly first for the
two new run steps. Robot file unchanged (it reads ELF / BENCH_CSV_OUT
from env). Existing 1.16.0 comparison is undisturbed; nightly outputs
go to events-nightly.csv and a separate report section.
Timeout bumped 120 -> 240 min to cover all four Renode runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two audit-driven additions to the synth Renode A/B: B6 — synth correctness control (audit P7, DO-330 §6.1.4): The headline -34.5% gale-vs-baseline delta has been ruled non- simulator-artifactual by the cycle-model A/B (controls a/b/c showed 0.0% drift on identical ELFs across Renode 1.16.0 and nightly). What remains is "could synth be miscompiling in a way that produces semantically-equivalent-but-faster code?" — a research compiler's normal failure mode. Adds a third gale build via rustc-direct (no synth in the chain; default GALE_USE_SYNTH=OFF path), runs it under 1.16.0, and renders an additional comparison "synth vs rustc-direct" in the step summary. If the two agree on handoff median within ~3%, the synth-emitted ELF is a faithful Cortex-M codegen of the same Rust source. This is the cheapest possible "tool error mitigation" in the DO-330 §6.1.4 sense — a redundant computation by a battle-tested compiler with the same input contract. Total Renode runs in this workflow: 5 (baseline 1.16.0, synth 1.16.0, rustc-direct 1.16.0, baseline nightly, synth nightly). timeout-minutes bumped 240 -> 300. D6 — manifest.txt artifact (audit P9): Single point of provenance. The bench's input layer is layered with branch-tip pointers (synth --branch fix/..., zephyr --mr gale/sem-replacement) and mutable tags (container :v0.29.0). Without a captured manifest, "I ran exactly this configuration" reduces to "trust the gale_sha and hope nothing else moved" — wrong by construction. Adds a "Compose build manifest" step at the end of the job that captures: rustc/cargo/synth/wasm-opt/west/robotframework/SDK versions, Renode 1.16.0 and nightly versions, sha256 of the nightly tarball, sha256 of all built ELFs, sha256 + byte-size of all CSV outputs, and `west list` for Zephyr fork + module shas. Output to /tmp/manifest.txt, included in the artifact bundle alongside the three ELFs themselves. The nightly tarball is downloaded to /tmp/renode-nightly.tgz first so the sha256 can be recorded (was: piped curl -> tar with no intermediate file). Six months from now, every reproducibility claim reduces to downloading the artifact, opening manifest.txt, and re-resolving the three sha-pinned things plus the listed branch tips. Both YAMLs validated; no behavior changes to existing arms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a fourth gale build pipeline (rustc → wasm → synth → Cortex-M ELF, vs the published LTO post's three) and the cross-Renode A/B that proves the resulting −34.5% handoff cycle delta is in the actual instruction stream, not a simulator artifact.
Headline result (from CI run 25135305084, on branch HEAD before this PR's last two commits):
Same-ELF cross-version drift is exactly 0.0% on both paths. The −34.5% survives the model swap at the same magnitude. The delta is in the instruction stream, not in the simulator. Pooled gale max (236) < pooled baseline median (354).
What's in this PR
-DGALE_USE_SYNTH=ONruns the gale-ffi crate through wasm-opt + synth (PulseEngine's Cortex-M AOT) to producelibgale_ffi.aat the same path the rustc-direct path produces it, no consumer changes.Total Renode runs in this workflow: 5 (baseline 1.16.0, synth 1.16.0, rustc-direct 1.16.0, baseline nightly, synth nightly). `timeout-minutes` bumped 240 → 300.
Audit pass
This PR was reviewed via the Mythos slop-hunt scaffold (10 personas + 1 fresh-session validator). Findings landed in this PR:
Findings deferred to the flight-bench PR: B1-B5 (analyzer, bench-code).
Test plan
manifest.txtappears in the artifact bundle and contains all expected fields🤖 Generated with Claude Code