Use sorted vectors for DWARF address maps — reduce memory overhead by lewing · Pull Request #336 · dotnet/binaryen

lewing · 2026-03-16T23:03:03Z

Problem

wasm-opt uses excessive memory when processing wasm files with DWARF debug symbols, causing OOM on .NET CI Helix agents (dotnet/runtime#125244, dotnet/runtime#125233). The AddrExprMap and FuncAddrMap structures in wasm-debug.cpp use std::unordered_map with ~64 bytes per entry overhead.

Fix

Replace std::unordered_map with a SortedMap (sorted std::vector + binary search). These maps are built once during construction and only used for read-only lookups, making them ideal for this pattern.

Why it works

std::unordered_map: ~64 bytes/entry (hash buckets, linked list nodes, pointer chasing)
SortedMap (sorted vector): ~16 bytes/entry (contiguous memory, cache-friendly)
Binary search O(log n) is faster in practice than hash O(1) due to cache locality on sequential data
For a wasm binary with 1M expressions, this saves roughly 300-400MB of peak memory

Output is byte-for-byte identical, confirming functional correctness.

This is the first (and simplest) change from #335, split out for easier review.

Copilot

Pull request overview

This PR reduces peak memory usage in DWARF debug processing by replacing address-to-expression/function std::unordered_map lookups with a compact sorted-vector map (SortedMap) that is built once and then queried read-only.

Changes:

Introduces SortedMap<K, V> (sorted std::vector + binary search) for address lookups.
Replaces AddrExprMap and FuncAddrMap internal std::unordered_map storage with SortedMap.
Adds up-front reservation and a post-build sort() step for the new maps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/wasm/wasm-debug.cpp

Replace std::unordered_map with SortedMap (sorted vector + binary search) for AddrExprMap and FuncAddrMap. These maps are built once during construction and only used for read-only lookups, making them ideal for this pattern. - std::unordered_map: ~64 bytes/entry (hash buckets, linked list nodes) - SortedMap (sorted vector): ~16 bytes/entry (contiguous, cache-friendly) SortedMap tracks a finalized flag to assert lookups aren't performed before sort(). Duplicate keys are de-duplicated after sorting (keeps first) to handle cases like FuncAddrMap where start == declarations. DelimiterLocations reservation now counts actual non-zero entries rather than just the number of delimiter location arrays. Output is byte-for-byte identical, confirming functional correctness. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

lewing

Addressed all three review comments:

delimCount under-reservation — Now iterates each DelimiterLocations entry and counts actual non-zero offsets instead of just using delimiterLocations.size().
Duplicate keys after sort() — sort() now de-duplicates adjacent equal keys using std::unique (keeps first). This handles legitimate duplicates in FuncAddrMap where funcLocation.start == funcLocation.declarations for some functions.

Copilot

Pull request overview

This PR reduces memory usage when processing wasm binaries with DWARF debug info by replacing per-address std::unordered_map lookups with a sorted std::vector-backed map and binary search, targeting OOM issues in large debug symbol workloads.

Changes:

Introduce a SortedMap<K, V> helper (vector + sort + binary search) for read-only address lookups.
Replace AddrExprMap and FuncAddrMap internal std::unordered_map structures with SortedMap, including pre-reservation and finalization (sort()).
Update lookup code paths to use the new find() API and pointer-based returns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/wasm/wasm-debug.cpp

- Remove SortedMap::count() — no callers remain after removing pre-sort assertions, and it could be misused during build phase. - Document that duplicate keys (e.g. FuncAddrMap start==declarations) always map to the same value, so de-dup order is irrelevant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR reduces peak memory usage when processing DWARF debug info by replacing address→node/function hash maps with a sorted-vector-based lookup structure in wasm-debug.cpp, leveraging the fact that these maps are constructed once and then queried read-only during DWARF rewriting.

Changes:

Introduce a SortedMap (sorted std::vector + binary search) for address-keyed lookups.
Replace std::unordered_map usages in AddrExprMap and FuncAddrMap with SortedMap.
Pre-reserve vector capacity and finalize maps via sorting/deduplication after construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/wasm/wasm-debug.cpp

Add debug-time validation in SortedMap::sort() that duplicate keys map to the same value (assertUniqueValues=true by default). FuncAddrMap passes assertUniqueValues=false because contiguous functions legitimately share boundary addresses (func1.end == func2.start), matching the old unordered_map overwrite behavior. AddrExprMap uses the default (true) to catch debug info issues early. Also adds operator== to DelimiterInfo for the assertion. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

lewing

Both remaining comments addressed in 68b89a6:

sort() duplicate validation (line 392): sort() now takes assertUniqueValues param (default true). In debug builds, asserts that duplicate keys map to the same value before de-duplicating. AddrExprMap uses the default strict check; FuncAddrMap passes false since contiguous functions legitimately share boundary addresses (func1.end == func2.start).
AddrExprMap uniqueness (line 478): The strict assertion (assertUniqueValues=true) now validates during sort() that startMap and endMap entries have unique keys (or identical values for any duplicates), preserving the same invariant as the old pre-sort assertions.

radekdoulik · 2026-03-17T17:53:19Z

Binary search O(log n) is faster in practice than hash O(1) due to cache locality on sequential data

I am curious, do you have some measurements for this? What is a typical size of the data, how many entries?

lewing requested a review from Copilot March 16, 2026 23:53

Copilot started reviewing on behalf of lewing March 16, 2026 23:54 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

src/wasm/wasm-debug.cpp Show resolved Hide resolved

src/wasm/wasm-debug.cpp Show resolved Hide resolved

src/wasm/wasm-debug.cpp Outdated Show resolved Hide resolved

lewing force-pushed the fix/dwarf-sorted-vectors-only branch from 3b37476 to 11d339b Compare March 17, 2026 01:10

lewing commented Mar 17, 2026

View reviewed changes

lewing requested a review from Copilot March 17, 2026 01:13

Copilot started reviewing on behalf of lewing March 17, 2026 01:13 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

src/wasm/wasm-debug.cpp Show resolved Hide resolved

src/wasm/wasm-debug.cpp Outdated Show resolved Hide resolved

lewing requested a review from Copilot March 17, 2026 01:28

Copilot started reviewing on behalf of lewing March 17, 2026 01:28 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

src/wasm/wasm-debug.cpp Show resolved Hide resolved

src/wasm/wasm-debug.cpp Show resolved Hide resolved

lewing commented Mar 17, 2026

View reviewed changes

pavelsavara approved these changes Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use sorted vectors for DWARF address maps — reduce memory overhead#336

Use sorted vectors for DWARF address maps — reduce memory overhead#336
lewing wants to merge 3 commits intodotnet/mainfrom
fix/dwarf-sorted-vectors-only

lewing commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lewing left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

lewing left a comment

Uh oh!

radekdoulik commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lewing commented Mar 16, 2026

Problem

Fix

Why it works

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lewing left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

lewing left a comment

Choose a reason for hiding this comment

Uh oh!

radekdoulik commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants