Skip to content

transport/libfabric: Fix put-signal deadlock#26

Closed
a-szegel wants to merge 3 commits into
NVIDIA:develfrom
a-szegel:fix-alltoall-hang
Closed

transport/libfabric: Fix put-signal deadlock#26
a-szegel wants to merge 3 commits into
NVIDIA:develfrom
a-szegel:fix-alltoall-hang

Conversation

@a-szegel
Copy link
Copy Markdown
Contributor

@a-szegel a-szegel commented Nov 10, 2025

This change adds ~300ns of latency to the put-signal path, but is currently required for correctness.

Seth Howell was able to root cause the alltoall Nvshmem perftest algorithm deadlock to a double mutex acquire. This patch should fix that issue. Great work on the debug Seth!

Only the first commit is needed to solve the hang we were seeing, the other two commits are good practice to avoid future hangs.

a-szegel and others added 3 commits November 14, 2025 21:53
It is undefined behavior to attempt to call lock() or try_lock() on a
std::mutex that a thread already owns. In nvshmemt_libfabric_progress(),
the gdrRecvMutex is taken, and while that mutex is owned, both
perform_gdrcopy_amo() and gdrcopy_amo_ack() call try_again() which
(depending on status) will call progress() and try to lock the same
mutex again. The fix is to switch the mutex to a recursive_mutex which
hurts performance, but gives us correctness.  This section of the code
will need to be re-designed before we can re-optimize the code to use a
std::mutex.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
This patch redesigns the put-signal completion logic to be less
susceptible to deadlocks by relying on the ThreadSafeQueue for atomics
and removing the extra mutex lock.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
Co-authored-by: Seth Howell <sethh@nvidia.com>
In order to avoid deadlocks, one should never hold posting an RX
operation on the successful submission (try_again) of a TX operation.
This commit saves the required data from the operation to be posted in
registers and then posts the RX operation before posting the fetch
atomic ack, and the generic atomic 0 byte write with imm ack.

Signed-off-by: Seth Zegelstein <szegel@amazon.com>
@seth-howell
Copy link
Copy Markdown
Collaborator

This has been merged into devel

@seth-howell seth-howell closed this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants