-
-
Notifications
You must be signed in to change notification settings - Fork 173
Description
Issue description
Can't run on Jetson Orin / Thor
Expected Behavior
Be able to run node-llama-cpp on a Jetson Orin / Thor.
Actual Behavior
Crashing:
/root/.nvm/versions/node/v24.13.1/lib/node_modules/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
[node-llama-cpp] CUDA error: an internal operation failed
[node-llama-cpp] current device: 0, in function ggml_cuda_op_mul_mat_cublas at /root/.nvm/versions/node/v24.13.1/lib/node_modules/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1363
[node-llama-cpp] cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
Aborted (core dumped)
Steps to reproduce
after trying 2 full days to get it running I cant compile llama.cpp to run:
by now I have tried all sorts of compile options. This is only my last attempt:
NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_ARCHITECTURES=87 NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_FORCE_MMQ=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON npx --no node-llama-cpp source build --gpu
If I dont set GGML_CUDA_NO_VMM=ON I get memory allocation errors.
If not set CMAKE_CUDA_ARCHITECTURES=87 some random virtual cuda arch is detected.
Here my latest attempt as done with the llama.cpp from GitHub:
NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA=ON NODE_LLAMA_CPP_CMAKE_OPTION_DNLC_VARIANT=cuda.b8121 NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_ARCHITECTURES=87 NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_CUB_3DOT2=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_BACKEND_DL=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_FORCE_MMQ=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUBLAS=OFF NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc NODE_LLAMA_CPP_CMAKE_OPTION_CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda npx --no node-llama-cpp source build --gpu
My Environment
| Dependency | Version |
|---|---|
| Operating System | Jetson Orin |
| CPU | ARM aarch64 |
| Node.js version | 24.13.1 |
| Typescript version | ? |
node-llama-cpp version |
3.16.2 |
$ cat /etc/nv_tegra_release
R36 (release), REVISION: 4.7, GCID: 42132812, BOARD: generic, EABI: aarch64, DATE: Thu Sep 18 22:54:44 UTC 2025
KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia
npx --yes node-llama-cpp inspect gpu output:
# npx --yes node-llama-cpp inspect gpu
OS: Ubuntu 22.04.5 LTS (arm64)
Node: 24.13.1 (arm64)
node-llama-cpp: 3.16.2
Prebuilt binaries: b8121
Cloned source: b8121
CUDA: available
Vulkan: Vulkan is detected, but using it failed
To resolve errors related to Vulkan, see the Vulkan guide: https://node-llama-cpp.withcat.ai/guide/vulkan
CUDA device: Orin
CUDA used VRAM: 5.8% (3.56GB/61.37GB)
CUDA free VRAM: 94.19% (57.81GB/61.37GB)
CPU model: Cortex-A78AE
Math cores: 12
Used RAM: 5.8% (3.56GB/61.37GB)
Free RAM: 94.19% (57.81GB/61.37GB)
Used swap: 0.67% (211MB/30.68GB)
Max swap size: 30.68GB
mmap: supported
Additional Context
using the llama.cpp from src compiled on the orin works great.
$ compile.sh
cmake -B build -DGGML_CUDA=ON -DDNLC_VARIANT=cuda.b8121 -DCMAKE_CUDA_ARCHITECTURES=87 -DGGML_CUDA_CUB_3DOT2=ON -DGGML_BACKEND_DL=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_NO_VMM=ON -DGGML_CUBLAS=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
test run:
root@jetson-orin:/usr/src/llama.cpp/build/bin# ./llama-cli -m /root/.cache/qmd/models/hf_tobil_qmd-query-expansion-1.7B-q4_k_m.gguf -p "Hello, how are you?" -n 128
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: no
load_backend: loaded CUDA backend from /mnt/src/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /mnt/src/llama.cpp/build/bin/libggml-cpu.so
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8136-9051663d5
model : hf_tobil_qmd-query-expansion-1.7B-q4_k_m.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> Hello, how are you?
[Start thinking]
lex: greetings from the virtual assistant
lex: how are you today?
vec: greetings from the virtual assistant
vec: how are you today?
hyde: The topic of hello, how are you? covers greetings from the virtual assistant. Proper implementation follows established patterns and best practices.
[ Prompt: 293.1 t/s | Generation: 60.9 t/s ]
Relevant Features Used
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
Yes, but no idea. Im not a dev.