Improve perf of accessing dev.compute_capability#459
Conversation
|
/ok to test |
This comment has been minimized.
This comment has been minimized.
|
|
I looked here while working on the release notes. (I looked at the code before while working on #439 a couple days ago, but didn't realize then that the code is so new.) From the PR description:
I'm almost certain that there can be a race now: https://chatgpt.com/share/67d4bc2f-6c7c-8008-9d86-425fe77a3ed9 |
|
I don't get it. This is thread local storage, not normal Python object, why do we need any lock? |
|
I can try a fwd fix later tonight. Should be relatively easy.
Multiple threads can get to the for loop.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Leo Fang ***@***.***>
Sent: Friday, March 14, 2025 5:43:18 PM
To: NVIDIA/cuda-python ***@***.***>
Cc: Ralf Grosse Kunstleve ***@***.***>; Mention ***@***.***>
Subject: Re: [NVIDIA/cuda-python] Improve perf of accessing `dev.compute_capability` (PR #459)
I don't get it. This is thread local storage, not normal Python object, why do we need any lock?
—
Reply to this email directly, view it on GitHub<#459 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFUZAEZWURC3OQY4UDWR5D2UNZSNAVCNFSM6AAAAABXT6YSCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRWGA3DOMBQGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
[leofang]leofang left a comment (NVIDIA/cuda-python#459)<#459 (comment)>
I don't get it. This is thread local storage, not normal Python object, why do we need any lock?
—
Reply to this email directly, view it on GitHub<#459 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFUZAEZWURC3OQY4UDWR5D2UNZSNAVCNFSM6AAAAABXT6YSCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRWGA3DOMBQGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
Let us not rush into a fix and check CPython threading.local internal first. I believe the impl already has a lock so that Python level access is guaranteed thread safe. |
|
I only have a minute right now, hoping for a piece of information that will help me when I have a block of time later:
|
|
Yes, each thread has its own |
|
See #520: no race, but it recomputes |
|
Right, this is the consequence of storing data in thread-local storage. It was already the case before this PR, and this is why having a lock makes no sense. Each thread always has its own copy of |
|
Do we want that behavior? Or would it be better if |
|
Yes. For example, |
|
Wow, thanks! I need to read up. I completely misinterpreted what the loop is supposed to do. |
|
No worries! Always good to have extra pairs of eyes 🙂 |
Part of #439.
Before this PR:
With this PR:
which I consider good enough for a pure Python implementation. Compared to the CuPy counterpart (which is Cython-based and returning a string instead of namedtuple):
Note that the perf improvement of retrieving a
Device()instance is out of scope of this PR and pending investigation (#439 (comment)).As part of this PR, I also removed a silly lock in the
Deviceconstructor. The data being protected is already placed in the thread-local storage, so it makes no sense to add another lock.