Add `@functools.lru_cache` decorator for `get_binding_version()` by rwgk · Pull Request #512 · NVIDIA/cuda-python

rwgk · 2025-03-12T22:09:14Z

This one-line change results in a 8k+ fold speedup.

>>> 35.381859/0.004149
8527.804049168473

$ git stash
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.023946 seconds for 100000 iterations
    0.24 µs per call
cuda_utils.get_binding_version()
    35.381859 seconds for 100000 iterations
    353.82 µs per call

$ git stash pop
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.022644 seconds for 100000 iterations
    0.23 µs per call
cuda_utils.get_binding_version()
    0.004149 seconds for 100000 iterations
    0.04 µs per call

In retrospect, I should have just looked at the get_bindings() implementation immediately.

The way I actually found this (perf version 6.8.12):

perf record -F 99 -g -- python test_slowness.py 100000
perf report

I gave the top of the perf report and the get_bindings() implementation to ChatGPT:

https://chatgpt.com/share/67d20482-8914-8008-b382-e450fd5a4d74

That made it immediately obvious that importlib.metadata.version("cuda-bindings") is the bottleneck, mainly because it involves regex calls, but also because it triggers filesystem calls.

import sys
import time

from cuda.bindings import driver
from cuda.core.experimental._utils import cuda_utils


class show_timings:
    def __init__(self, num_iters, label):
        self.label = label
        self.num_iters = num_iters

    def __enter__(self):
        self.start = time.perf_counter()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        elapsed = time.perf_counter() - self.start
        print(self.label)
        if self.num_iters:
            print(f"    {elapsed:.6f} seconds for {self.num_iters} iterations")
            print(f"    {(elapsed / self.num_iters) * 1e6:.2f} µs per call")
        else:
            print(f"    {elapsed:.6f} seconds")


err, dv = driver.cuDriverGetVersion()
assert err == driver.CUresult.CUDA_SUCCESS
print("driver.cuDriverGetVersion()", dv, flush=True)

bv = cuda_utils.get_binding_version()
print("cuda_utils.get_binding_version()", bv, flush=True)

num_iters = int(sys.argv[1])

if 1:
    with show_timings(num_iters, "driver.cuDriverGetVersion()"):
        for _ in range(num_iters):
            driver.cuDriverGetVersion()
if 1:
    with show_timings(num_iters, "cuda_utils.get_binding_version()"):
        for _ in range(num_iters):
            cuda_utils.get_binding_version()

>>> 35.381859/0.004149 8527.804049168473 $ git stash $ python test_slowness.py 100000 driver.cuDriverGetVersion() 12060 cuda_utils.get_binding_version() (12, 8) driver.cuDriverGetVersion() 0.023946 seconds for 100000 iterations 0.24 µs per call cuda_utils.get_binding_version() 35.381859 seconds for 100000 iterations 353.82 µs per call $ git stash pop $ python test_slowness.py 100000 driver.cuDriverGetVersion() 12060 cuda_utils.get_binding_version() (12, 8) driver.cuDriverGetVersion() 0.022644 seconds for 100000 iterations 0.23 µs per call cuda_utils.get_binding_version() 0.004149 seconds for 100000 iterations 0.04 µs per call

copy-pr-bot · 2025-03-12T22:09:18Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2025-03-12T22:09:35Z

/ok to test

rwgk · 2025-03-12T22:19:55Z

@shwina for visibility

This PR does not fix #439

leofang · 2025-03-12T23:54:40Z

Thanks, Ralf! How did you notice the slowness?

github-actions · 2025-03-13T00:10:12Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk · 2025-03-13T03:54:25Z

Thanks, Ralf! How did you notice the slowness?

When I was working on this:

c789bf6

Originally I had the _utils.get_binding_version() call inside the loop where you see if _BINDING_VERSION >= (12, 0):.

Today I was hoping fixing that very obvious problem first would help with #439 as well. But no, that's something different, and not nearly as extreme (5x vs 8500x).

rwgk self-assigned this Mar 12, 2025

rwgk requested a review from keenan-simpson March 12, 2025 22:10

This comment has been minimized.

Sign in to view

leofang added this to the cuda.core beta 3 milestone Mar 12, 2025

leofang approved these changes Mar 12, 2025

View reviewed changes

leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Mar 12, 2025

rwgk merged commit f903d98 into NVIDIA:main Mar 12, 2025

rwgk deleted the get_binding_version_8k branch March 13, 2025 03:54

leofang mentioned this pull request Feb 10, 2026

cuda.core latency benchmark suite #1579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `@functools.lru_cache` decorator for `get_binding_version()`#512

Add `@functools.lru_cache` decorator for `get_binding_version()`#512
rwgk merged 1 commit into
NVIDIA:mainfrom
rwgk:get_binding_version_8k

rwgk commented Mar 12, 2025

Uh oh!

copy-pr-bot Bot commented Mar 12, 2025

Uh oh!

rwgk commented Mar 12, 2025

Uh oh!

rwgk commented Mar 12, 2025

Uh oh!

This comment has been minimized.

leofang commented Mar 12, 2025

Uh oh!

github-actions Bot commented Mar 13, 2025

Uh oh!

rwgk commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rwgk commented Mar 12, 2025

Uh oh!

copy-pr-bot Bot commented Mar 12, 2025

Uh oh!

rwgk commented Mar 12, 2025

Uh oh!

rwgk commented Mar 12, 2025

Uh oh!

This comment has been minimized.

leofang commented Mar 12, 2025

Uh oh!

github-actions Bot commented Mar 13, 2025

Uh oh!

rwgk commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants