Skip to content

hsakmt: Fix shmem leak in reserved_aperture_release#7621

Open
williampalacek wants to merge 4 commits into
developfrom
users/wpalacek/ROCM-23563-fix
Open

hsakmt: Fix shmem leak in reserved_aperture_release#7621
williampalacek wants to merge 4 commits into
developfrom
users/wpalacek/ROCM-23563-fix

Conversation

@williampalacek

Copy link
Copy Markdown

Add munmap() before mmap() to release kernel shmem refcount for MAP_SHARED allocations. Without this,
reserved_aperture_release() leaves shmem pages pinned, causing OOM during repeated allocations.

Fixes: ROCM-23563

Motivation

KFDMemoryTest.BigSysBufferStressTest fails with OOM kills on MI210. Shmem grows unbounded across allocation iterations
because MAP_SHARED memory is not properly released.

Technical Details

reserved_aperture_release() in projects/rocr-runtime/libhsakmt/src/fmm.c remaps freed memory as PROT_NONE using
mmap(..., MAP_FIXED). For MAP_SHARED allocations, this does not decrement the kernel shmem refcount - only munmap() does.
Added munmap() before the mmap() call. Also simplified the ENOMEM retry path since memory is already unmapped.

JIRA ID

Resolves ROCM-23563

Test Plan

Run kfdtest --gtest_filter='*BigSysBufferStressTest*' on MI210 and monitor shmem via /proc/meminfo.

Test Result

Metric Before Fix After Fix
Iteration 1 1953 × 128MB 1953 × 128MB
Iteration 2 OOM killed 1953 × 128MB
Iteration 3 - 1953 × 128MB
Iteration 4 - 1953 × 128MB
Shmem after 246 GiB (leaked) 20 MiB
Result FAILED PASSED (899s)

Submission Checklist

Add munmap() before mmap() to release kernel shmem refcount for
MAP_SHARED allocations. Without this, reserved_aperture_release()
leaves shmem pages pinned, causing OOM during repeated allocations.

Fixes: ROCM-23563
Signed-off-by: William Palacek <William.Palacek@amd.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a shmem leak in reserved_aperture_release() within libhsakmt by explicitly munmap()-ing CPU mappings before re-reserving the VA range with a PROT_NONE anonymous mmap(MAP_FIXED). This ensures MAP_SHARED-backed shmem refcounts are decremented, preventing unbounded shmem growth and OOM in stress tests.

Changes:

  • Add munmap(address, size) before the PROT_NONE remap to correctly drop shmem references for MAP_SHARED allocations.
  • Simplify the ENOMEM retry logic in the remap path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread projects/rocr-runtime/libhsakmt/src/fmm.c Outdated
Comment thread projects/rocr-runtime/libhsakmt/src/fmm.c Outdated
williampalacek and others added 2 commits June 22, 2026 13:56
The retry logic was dead weight - retrying the identical mmap() call
immediately after failure won't change the outcome. Simplified to a
single warning log if the VA range can't be re-reserved.

Signed-off-by: William Palacek <William.Palacek@amd.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants