Improve perf of accessing `dev.compute_capability` by leofang · Pull Request #459 · NVIDIA/cuda-python

leofang · 2025-02-21T19:44:14Z

Part of #439.

Before this PR:

In [4]: %timeit dev.compute_capability
1.87 μs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With this PR:

In [7]: %timeit dev.compute_capability
97.2 ns ± 0.087 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

which I consider good enough for a pure Python implementation. Compared to the CuPy counterpart (which is Cython-based and returning a string instead of namedtuple):

In [12]: %timeit dev.compute_capability
41.6 ns ± 0.0415 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Note that the perf improvement of retrieving a Device() instance is out of scope of this PR and pending investigation (#439 (comment)).

As part of this PR, I also removed a silly lock in the Device constructor. The data being protected is already placed in the thread-local storage, so it makes no sense to add another lock.

copy-pr-bot · 2025-02-21T19:44:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2025-02-21T19:45:36Z

/ok to test

github-actions · 2025-02-25T17:52:35Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk · 2025-03-14T23:34:18Z

I looked here while working on the release notes. (I looked at the code before while working on #439 a couple days ago, but didn't realize then that the code is so new.)

From the PR description:

As part of this PR, I also removed a silly lock in the Device constructor. The data being protected is already placed in the thread-local storage, so it makes no sense to add another lock.

I'm almost certain that there can be a race now:

https://chatgpt.com/share/67d4bc2f-6c7c-8008-9d86-425fe77a3ed9

leofang · 2025-03-15T00:42:56Z

I don't get it. This is thread local storage, not normal Python object, why do we need any lock?

rwgk · 2025-03-15T01:01:47Z

I can try a fwd fix later tonight. Should be relatively easy. Multiple threads can get to the for loop. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Leo Fang ***@***.***> Sent: Friday, March 14, 2025 5:43:18 PM To: NVIDIA/cuda-python ***@***.***> Cc: Ralf Grosse Kunstleve ***@***.***>; Mention ***@***.***> Subject: Re: [NVIDIA/cuda-python] Improve perf of accessing `dev.compute_capability` (PR #459) I don't get it. This is thread local storage, not normal Python object, why do we need any lock? — Reply to this email directly, view it on GitHub<#459 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFUZAEZWURC3OQY4UDWR5D2UNZSNAVCNFSM6AAAAABXT6YSCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRWGA3DOMBQGU>. You are receiving this because you were mentioned.Message ID: ***@***.***> [leofang]leofang left a comment (NVIDIA/cuda-python#459)<#459 (comment)> I don't get it. This is thread local storage, not normal Python object, why do we need any lock? — Reply to this email directly, view it on GitHub<#459 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAFUZAEZWURC3OQY4UDWR5D2UNZSNAVCNFSM6AAAAABXT6YSCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRWGA3DOMBQGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

leofang · 2025-03-15T02:34:57Z

Let us not rush into a fix and check CPython threading.local internal first. I believe the impl already has a lock so that Python level access is guaranteed thread safe.

rwgk · 2025-03-15T03:04:22Z

I only have a minute right now, hoping for a piece of information that will help me when I have a block of time later:

I have to read up on threading.local.
But if I'm guessing correctly, a difference before/after this PR is that _tls.devices is now recomputed for each thread, while it was computed only once before. Does that sound right?
Assuming my guess is correct, is that what you wanted, or just something you accepted?

leofang · 2025-03-15T03:29:19Z

Yes, each thread has its own _tls and by definition (thread local storage) and by CPython implementation accessing _tls will not race. _tls should not be considered as normal Python objects.

rwgk · 2025-03-15T03:31:37Z

See #520: no race, but it recomputes _tls.devices for each new thread.

leofang · 2025-03-15T03:36:30Z

Right, this is the consequence of storing data in thread-local storage. It was already the case before this PR, and this is why having a lock makes no sense. Each thread always has its own copy of _tls.devices.

rwgk · 2025-03-15T03:41:49Z

Do we want that behavior? Or would it be better if devices was computed only once per process?

leofang · 2025-03-15T03:49:18Z

Yes. For example, cudaGetDevice/cudaSetDevice are already accessing per-thread-level information:
https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

rwgk · 2025-03-15T04:37:48Z

Wow, thanks! I need to read up. I completely misinterpreted what the loop is supposed to do.

leofang · 2025-03-15T05:19:08Z

No worries! Always good to have extra pairs of eyes 🙂

leofang added 2 commits February 21, 2025 00:16

cache cc to speed it up

2afcb20

avoid silly, redundant lock

95777c4

leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Feb 21, 2025

leofang added this to the cuda.core beta 3 milestone Feb 21, 2025

leofang requested a review from shwina February 21, 2025 19:44

leofang self-assigned this Feb 21, 2025

Merge branch 'main' into cache_cc

4cfd505

This comment has been minimized.

Sign in to view

This was referenced Feb 22, 2025

Switch to use CUDA driver APIs in Device constructor #460

Merged

Querying current device is slow compared to CuPy #439

Closed

leofang requested a review from keenan-simpson February 25, 2025 13:32

keenan-simpson approved these changes Feb 25, 2025

View reviewed changes

leofang merged commit 440eabd into NVIDIA:main Feb 25, 2025

leofang deleted the cache_cc branch February 25, 2025 17:34

leofang mentioned this pull request Mar 14, 2025

cuda.core: release notes update #519

Merged

rwgk mentioned this pull request Mar 15, 2025

test_many_threads.py, with prints in Device.__new__ #520

Closed

leofang mentioned this pull request Feb 10, 2026

cuda.core latency benchmark suite #1579

Open

Conversation

leofang commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Feb 21, 2025

Uh oh!

leofang commented Feb 21, 2025

Uh oh!

This comment has been minimized.

github-actions Bot commented Feb 25, 2025

Uh oh!

rwgk commented Mar 14, 2025

Uh oh!

leofang commented Mar 15, 2025

Uh oh!

rwgk commented Mar 15, 2025 via email

Uh oh!

leofang commented Mar 15, 2025

Uh oh!

rwgk commented Mar 15, 2025

Uh oh!

leofang commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented Mar 15, 2025

Uh oh!

leofang commented Mar 15, 2025

Uh oh!

rwgk commented Mar 15, 2025

Uh oh!

leofang commented Mar 15, 2025

Uh oh!

rwgk commented Mar 15, 2025

Uh oh!

leofang commented Mar 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leofang commented Feb 21, 2025 •

edited

Loading

leofang commented Mar 15, 2025 •

edited

Loading