Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,90 @@ All notable changes to LOOM will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.6.0] - 2026-05-11

This release is driven entirely by real-world findings on production
gale code (Verus-verified kernel-scheduler FFI wasm): closing v0.5.0's
+6.3% CSE size regression, lifting two early-exit guards that made
LOOM a no-op on kernel-style code, and fixing a Z3 panic that blocked
the inline pass on i64-heavy modules. Net effect on the gale_ffi
fixture: code section -0.86% (was +6.3% in v0.5.0). Net effect on
a 2.3 MB calculator.wasm component: -0.4% from the new dead-store
pass alone.

### Optimizer correctness (Z3 / inline)

- **Closed `inline_functions` Z3 SortDiffers panic on i64-heavy wasm**
(PR-D, closes #98). The verifier's symbolic-locals initialization
defaulted to 32-bit width regardless of declared type at three
sites; the gale-ffi crate (u64-packed FFI returns) crashed every
inline attempt with `SortDiffers { left: BitVec(64), right: BitVec(32) }`.
Fix: new helpers `local_type_at` + `bv_width_for_value_type`
resolve param/local types correctly at each extension site.
Defensive `match_bv_widths` zero-extend helper added for future
binop-site backstop.

### Optimizer code size on real workloads

- **CSE cost gate eliminates the gale +6.3% regression** (PR-A).
v0.5.0's enhanced CSE deduplicated every duplicate expression
including 1-2 byte constants. Replacing `i32.const -EINVAL`
(2 bytes) with `local.tee N / local.get N` (4 bytes) plus a new
local declaration was unconditionally a size regression.
New `Expr::worth_dedup(occurrences)` predicate estimates net byte
savings via the formula `net = (N-1)·(cost-2) - 4` and skips when
non-positive. Gale code section: 862 → 808 bytes.

- **`eliminate_dead_locals` pass** (PR-B): drops locals declared by
a function but never read by any `LocalGet` anywhere in the
function body. Targets the gale "default-then-override" pattern
(rustc materializes an EINVAL default that every reachable path
overwrites). The rule is path-INSENSITIVE — sound regardless of
BrIf/BrTable/early-Return control flow — so the pass DOES NOT
need the `has_dataflow_unsafe_control_flow` guard that previously
made `simplify_locals` and `coalesce_locals` no-ops on every
kernel-style function. Gale code section: 808 → 804 bytes.
Asymmetric write neutralization: `LocalSet → Drop` preserves
`[T] → []`; `LocalTee → removed` lets `[T] → [T]` pass through.

- **`eliminate_dead_stores` pass with full backward liveness** (PR-C):
per-position dead-store elimination via backward liveness walk
over the structured wasm instruction tree. Handles Block/If
precisely (`live-before-if = live-in-then ∪ live-in-else`), Br/
BrIf/BrTable via label-stack indexing, Return/Unreachable as
no-continuation. Loop bodies use a conservative approximation
(everything read anywhere in the body is live throughout) — sound
but imprecise inside loops; loop fixpoint precision is a follow-up.
Net effect on calculator.wasm: -0.4% from this pass alone (~10 KB
on a 2.3 MB component).

### Cleanup follow-ups

- **`vacuum` const+drop peephole** (PR-D). PR-B/PR-C neutralize
dead `LocalSet idx` to `Drop`, leaving the value-pusher
immediately followed by Drop. New `peephole_const_drop` folds
`pure_push;Drop` pairs (constants, LocalGet, GlobalGet), recursing
into Block/Loop/If bodies. NOT folded: memory loads, calls,
anything that can trap — discarding the result does not discard
the trap.

### Research outputs

- `docs/research/gale-v0.5.0/source-pattern-analysis.md` — eight
optimization-relevant patterns found in gale source with
file:line citations (FSM dispatch, default-then-override, Verus
`decreases` bounded loops, tail-call dispatch, leaf-inline +
const-prop candidates, bit-mask axioms, 2D `match (state, event)`,
Verus annotations as trusted axioms).
- `docs/research/gale-v0.5.0/wasm-opt-gap-analysis.md` — ranked top
7 wasm-opt passes by expected payoff on kernel-style code, with
per-pick LOC and complexity estimates. Picks #1, #2 (narrowed),
and #3 are shipped in this release.

### Test count

303 → 317 tests passing (+14 across all v0.6.0 PRs).

## [0.5.0] - 2026-05-02

This release closes a real soundness bug discovered on production
Expand Down
48 changes: 48 additions & 0 deletions docs/research/mutation-perf-inference-eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# WarpL (arXiv 2604.13693) — Evaluation for PulseEngine CI

**Verdict: Adopt later.** WarpL solves a problem PulseEngine does not yet have — diagnosing the *root cause* of an already-observed runtime perf regression *inside the JIT* — not detecting that a regression occurred. As a CI signal it is the wrong shape: it costs ~3 hours per case after you already know there is a slowdown, and the bug-class it isolates (Wasmtime/Cranelift instruction-lowering pathologies) is upstream of loom/meld output. Defer until we have (a) a stable wasmtime-based perf baseline in CI and (b) a confirmed regression class plausibly attributable to a loom/meld output pattern.

## 1. Paper summary

WarpL (Zeng et al., ICSE '26, arXiv:2604.13693v1, 15 Apr 2026) is a **mutation-based root-cause localizer** for Wasm-runtime performance bugs, not a regression detector. Given a bug-inducing Wasm module already known to be slow on runtime R_buggy:

- **Mutation**. Fine-grained, type-aware, single-instruction edits (Table 1): operand-instruction substitution (`t.const`/`local.get`/`global.get`/`t.load` interchange within the same type), operator-instruction substitution (replace `t.op` with another op of the same `optype`), operator-instruction deletion (delete op + its operand-producers, replace stack effect with a const). Control-flow instructions are explicitly excluded. ~500 mutants generated per case, deduped to ~350.
- **Functionally-similar selection**. Each mutant is run on R_buggy and on an **oracle runtime** R_oracle (a second Wasm engine known not to exhibit the bug). Score = `α·perfDiffScore + β·funcSimScore` where `perfDiffScore` rewards a large execution-time gap on R_buggy and `funcSimScore` rewards near-identical execution time on R_oracle (Algorithm 1). Mutants whose behavior is preserved on R_oracle but whose slowdown disappears on R_buggy are the candidates.
- **Slow-code isolation**. Dump JIT machine code from R_buggy for original vs selected mutant. Diff via Longest Common Subsequence at the x86-64/aarch64 opcode level. The differing instruction window is the "slow code" report.
- **Perf signature**. Single scalar: wall-clock execution time on each runtime. Plus a machine-instruction count from the JIT dump. No syscall histograms, no perf-event traces, no cycle counters.
- **Eval**. 12 issues across Wasmtime / Wasmer / WasmEdge; localized 10/12, found 6 previously-unknown Wasmtime bugs. **Wall time: <3 h per case for WarpL, plus ~11 h per case for `wasm-reduce` pre-shrinking.** Tool is ~400 LOC C++ (Binaryen LibTooling) + ~500 LOC Python/shell. Open source: https://github.com/BZTesting/WarpL.

**Key assumptions / limitations** (Section 6): (1) you already know which module is slow — WarpL needs a labeled bug-inducing input; (2) you need a second Wasm runtime that does *not* trigger the issue, to act as oracle; (3) the bug must live in JIT lowering or IR optimization, not in host I/O or libc (the two failures #7973 and #7745 were both I/O-bound, outside JIT scope); (4) `wasm-reduce` dominates wall time and is the bottleneck the authors flag for future work.

## 2. PulseEngine corpus survey

Confirmed in `/Users/r/git/pulseengine/loom/tests/corpus/`:

| Fixture | Path | Role |
|---|---|---|
| httparse | `tests/corpus/httparse.wasm` | HTTP/1.x header parser (witness real-app fixture) |
| nom_numbers | `tests/corpus/nom_numbers.wasm` | nom parser-combinator numeric primitives |
| state_machine | `tests/corpus/state_machine.wasm` | finite-state-machine kernel (kiln test) |
| json_lite | `tests/corpus/json_lite.wasm` | minimal JSON tokenizer |
| loom (self) | `tests/corpus/loom.wasm` | LOOM compiled to Wasm — dogfood target |

Sibling projects `/Users/r/git/pulseengine/witness` and `/Users/r/git/pulseengine/meld` exist on disk but were outside the read-permission scope of this session; the loom-side mirror at `tests/corpus/` already holds the witness-derived fixtures the user named, so the five above are representative without cross-repo access. `loom-testing/Cargo.toml` already depends on `wasmtime = "17.0"` — the runtime WarpL primarily targets.

## 3. Feasibility

- **On `loom optimize` output**: Yes, mechanically. WarpL is a black-box harness over a Wasm binary; loom's optimized output is a valid module. But WarpL's value is conditional on a *known* slowdown on a runtime. Loom currently validates via diff vs wasm-opt (size, semantics), not runtime wall-clock — so WarpL has no trigger.
- **On `meld fuse` output**: Same answer, with extra friction. Component-Model fused outputs would need `wasm-reduce` adapted for components (today it operates on core modules) — non-trivial.
- **Runtime to capture signatures**: WarpL needs **two** runtimes (buggy + oracle). Wasmtime is already a dep; adding WasmEdge or Wasmer to CI is feasible (both have static binaries) but doubles container size and the wall-clock baseline must be stable enough that a 1.5×–8× gap (the gaps WarpL reported) is detectable above noise — hard on shared GitHub runners.
- **Integration cost**. WarpL itself: ~900 LOC upstream + a Rust harness wrapping it (~300 LOC) + a perf-baseline DB (artifact-store JSON) for the trigger. Per-case CI cost is **~14 h** dominated by `wasm-reduce` — incompatible with PR-blocking CI; only viable as a nightly/manual job on a self-hosted runner.

## 4. Recommendation

**Adopt later.** WarpL is a high-quality localizer but the wrong end of the pipeline for CI: it presumes regression detection has already happened, takes hours per case, and targets bugs in the Wasm runtime's JIT — bugs we cannot fix from loom/meld even if WarpL finds them. The right CI signal for "loom emitted a Wasm whose perf changed" is a cheap wall-clock benchmark on the five fixtures above with a noise-aware threshold; WarpL becomes useful only *after* such a benchmark fires and we want to know whether the cause is in loom's transformation or in Wasmtime's lowering of it.

## 5. Implementation sketch (when triggered)

1. Add a nightly `cargo bench` harness in `loom-testing/` that runs each of the 5 fixtures through `loom optimize` then through wasmtime, recording wall-clock and machine-instruction counts (from `wasmtime compile --emit-clif`) against a committed baseline JSON.
2. Define a regression trigger: ≥1.3× wall-clock slowdown vs baseline, reproducible across 3 runs on a self-hosted runner — only then enqueue WarpL.
3. Vendor WarpL (BZTesting/WarpL) as a submodule under `tools/warpl/`; build it in a separate Docker image with WasmEdge as oracle runtime.
4. Write a Rust wrapper (`loom-testing/src/bin/warpl_localize.rs`) that takes the regressed fixture, runs `wasm-reduce` with a wall-clock-preserving predicate, then invokes WarpL and posts the slow-code report as a GitHub issue with the JIT-diff snippet.
5. Decide per-issue whether the root cause is a loom transformation (fix in loom) or a Wasmtime lowering bug (file upstream) — WarpL's report distinguishes these because the differing mutant is at Wasm-instruction granularity.
Loading
Loading