Skip to content

fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection#8942

Merged
mudler merged 2 commits intomudler:masterfrom
sozercan:fix/gate-cuda-dir-on-gpu-vendor
Mar 12, 2026
Merged

fix: gate CUDA directory checks on GPU vendor to prevent false CUDA detection#8942
mudler merged 2 commits intomudler:masterfrom
sozercan:fix/gate-cuda-dir-on-gpu-vendor

Conversation

@sozercan
Copy link
Copy Markdown
Contributor

@sozercan sozercan commented Mar 10, 2026

Summary

  • Fix incorrect CUDA backend selection on CPU-only hosts that have CUDA runtime
    libraries installed (e.g., cuda-cudart-12-5 via apt), which create
    /usr/local/cuda-12 directories as a side effect
  • Reorder checks in getSystemCapabilities() so CUDA directory existence only
    refines the capability when an NVIDIA GPU is actually detected, consistent with
    the arm64 L4T code path that already gates on GPUVendor == Nvidia
  • Add unit tests covering 8 scenarios for the capability detection logic

Problem

Container images that install CUDA runtime libraries create /usr/local/cuda-12
or /usr/local/cuda-13 directories. The previous code checked for these
directories before checking whether a GPU was present, causing CPU-only hosts
to select a CUDA backend. The CUDA backend then crashes because libcuda.so.1 is
absent.

Previous PR that fixed a similar issue: #6149

Changes

pkg/system/capabilities.go — Reordered the non-arm64 path in
getSystemCapabilities():

  1. Check for no GPU → return "default" (early exit)
  2. Check for low VRAM (≤4GB) → return "default" with warning
  3. Check CUDA directories only if GPUVendor == Nvidia
  4. Fall back to GPU vendor string

pkg/system/capabilities_test.go — New file with table-driven tests:

Scenario GPUVendor CUDA dirs Expected
CUDA dir, no GPU "" cuda12 "default"
CUDA 12 + NVIDIA "nvidia" cuda12 "nvidia-cuda-12"
CUDA 13 + NVIDIA "nvidia" cuda13 "nvidia-cuda-13"
Both dirs + NVIDIA "nvidia" both "nvidia-cuda-13"
CUDA dir + AMD "amd" cuda12 "amd"
No CUDA, no GPU "" none "default"
No CUDA + NVIDIA "nvidia" none "nvidia"
CUDA + NVIDIA + low VRAM "nvidia" cuda12 "default"

Test plan

  • go test ./pkg/system/... — new unit tests pass (skipped on darwin as expected)
  • go vet ./pkg/system/... — clean
  • Verify on a CPU-only container with CUDA runtime libs installed that capability resolves to "default" instead of "nvidia-cuda-12"

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 10, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit 0e28fbf
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/69b1fb09b6f33a00086fde42
😎 Deploy Preview https://deploy-preview-8942--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@sozercan sozercan force-pushed the fix/gate-cuda-dir-on-gpu-vendor branch from 4c4ab1e to 75d9ce8 Compare March 10, 2026 20:36
@sozercan sozercan marked this pull request as ready for review March 10, 2026 20:36
@sozercan sozercan force-pushed the fix/gate-cuda-dir-on-gpu-vendor branch from 75d9ce8 to 47fb0f5 Compare March 10, 2026 20:38
"testing"
)

func TestGetSystemCapabilities(t *testing.T) {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move these to use ginkgo to be consistent with all other tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — converted to Ginkgo with DescribeTable/Entry and added a system_suite_test.go. Follows the same pattern as pkg/vram and pkg/functions/peg.

…etection

Container images that install CUDA runtime libraries (e.g., cuda-cudart-12-5
via apt) create /usr/local/cuda-12 directories as a side effect. The previous
code checked for these directories before checking whether a GPU was present,
causing CPU-only hosts to select a CUDA backend that crashes because
libcuda.so.1 is absent.

Reorder checks so CUDA directory existence only refines the capability when
an NVIDIA GPU is actually detected, consistent with the arm64 L4T code path.

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
@sozercan sozercan force-pushed the fix/gate-cuda-dir-on-gpu-vendor branch from 90d3b3e to eea2fd7 Compare March 11, 2026 23:29
@sozercan sozercan requested a review from mudler March 11, 2026 23:30
Copy link
Copy Markdown
Owner

@mudler mudler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mudler mudler merged commit 45d1881 into mudler:master Mar 12, 2026
33 checks passed
@sozercan sozercan deleted the fix/gate-cuda-dir-on-gpu-vendor branch March 12, 2026 20:04
@mudler mudler added the bug Something isn't working label Mar 14, 2026
localai-bot pushed a commit to localai-bot/LocalAI that referenced this pull request Mar 18, 2026
…etection (mudler#8942)

Container images that install CUDA runtime libraries (e.g., cuda-cudart-12-5
via apt) create /usr/local/cuda-12 directories as a side effect. The previous
code checked for these directories before checking whether a GPU was present,
causing CPU-only hosts to select a CUDA backend that crashes because
libcuda.so.1 is absent.

Reorder checks so CUDA directory existence only refines the capability when
an NVIDIA GPU is actually detected, consistent with the arm64 L4T code path.

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>
localai-bot pushed a commit to localai-bot/LocalAI that referenced this pull request Mar 25, 2026
…etection (mudler#8942)

Container images that install CUDA runtime libraries (e.g., cuda-cudart-12-5
via apt) create /usr/local/cuda-12 directories as a side effect. The previous
code checked for these directories before checking whether a GPU was present,
causing CPU-only hosts to select a CUDA backend that crashes because
libcuda.so.1 is absent.

Reorder checks so CUDA directory existence only refines the capability when
an NVIDIA GPU is actually detected, consistent with the arm64 L4T code path.

Signed-off-by: Sertac Ozercan <sozercan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants