transport/libfabric: Fix put-signal deadlock#26
Closed
a-szegel wants to merge 3 commits into
Closed
Conversation
089b8bb to
1592528
Compare
It is undefined behavior to attempt to call lock() or try_lock() on a std::mutex that a thread already owns. In nvshmemt_libfabric_progress(), the gdrRecvMutex is taken, and while that mutex is owned, both perform_gdrcopy_amo() and gdrcopy_amo_ack() call try_again() which (depending on status) will call progress() and try to lock the same mutex again. The fix is to switch the mutex to a recursive_mutex which hurts performance, but gives us correctness. This section of the code will need to be re-designed before we can re-optimize the code to use a std::mutex. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
This patch redesigns the put-signal completion logic to be less susceptible to deadlocks by relying on the ThreadSafeQueue for atomics and removing the extra mutex lock. Signed-off-by: Seth Zegelstein <szegel@amazon.com> Co-authored-by: Seth Howell <sethh@nvidia.com>
In order to avoid deadlocks, one should never hold posting an RX operation on the successful submission (try_again) of a TX operation. This commit saves the required data from the operation to be posted in registers and then posts the RX operation before posting the fetch atomic ack, and the generic atomic 0 byte write with imm ack. Signed-off-by: Seth Zegelstein <szegel@amazon.com>
1592528 to
43f594b
Compare
Collaborator
|
This has been merged into devel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change adds ~300ns of latency to the put-signal path, but is currently required for correctness.
Seth Howell was able to root cause the alltoall Nvshmem perftest algorithm deadlock to a double mutex acquire. This patch should fix that issue. Great work on the debug Seth!
Only the first commit is needed to solve the hang we were seeing, the other two commits are good practice to avoid future hangs.