Skip to content

bug: Jetson Orin compile + install #560

@mcr-ksh

Description

@mcr-ksh

Issue description

Can't run on Jetson Orin / Thor

Expected Behavior

Be able to run node-llama-cpp on a Jetson Orin / Thor.

Actual Behavior

Crashing:

/root/.nvm/versions/node/v24.13.1/lib/node_modules/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
[node-llama-cpp] CUDA error: an internal operation failed
[node-llama-cpp]   current device: 0, in function ggml_cuda_op_mul_mat_cublas at /root/.nvm/versions/node/v24.13.1/lib/node_modules/node-llama-cpp/llama/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1363
[node-llama-cpp]   cublasSgemm_v2(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
Aborted (core dumped)

Steps to reproduce

after trying 2 full days to get it running I cant compile llama.cpp to run:

by now I have tried all sorts of compile options. This is only my last attempt:

NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_ARCHITECTURES=87 NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_FORCE_MMQ=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON npx --no node-llama-cpp source build --gpu

If I dont set GGML_CUDA_NO_VMM=ON I get memory allocation errors.
If not set CMAKE_CUDA_ARCHITECTURES=87 some random virtual cuda arch is detected.

Here my latest attempt as done with the llama.cpp from GitHub:
NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA=ON NODE_LLAMA_CPP_CMAKE_OPTION_DNLC_VARIANT=cuda.b8121 NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_ARCHITECTURES=87 NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_CUB_3DOT2=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_BACKEND_DL=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_FORCE_MMQ=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUDA_NO_VMM=ON NODE_LLAMA_CPP_CMAKE_OPTION_GGML_CUBLAS=OFF NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc NODE_LLAMA_CPP_CMAKE_OPTION_CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda npx --no node-llama-cpp source build --gpu

My Environment

Dependency Version
Operating System Jetson Orin
CPU ARM aarch64
Node.js version 24.13.1
Typescript version ?
node-llama-cpp version 3.16.2

$ cat /etc/nv_tegra_release
R36 (release), REVISION: 4.7, GCID: 42132812, BOARD: generic, EABI: aarch64, DATE: Thu Sep 18 22:54:44 UTC 2025
KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

npx --yes node-llama-cpp inspect gpu output:

# npx --yes node-llama-cpp inspect gpu
OS: Ubuntu 22.04.5 LTS (arm64)
Node: 24.13.1 (arm64)

node-llama-cpp: 3.16.2
Prebuilt binaries: b8121
Cloned source: b8121

CUDA: available
Vulkan: Vulkan is detected, but using it failed
To resolve errors related to Vulkan, see the Vulkan guide: https://node-llama-cpp.withcat.ai/guide/vulkan

CUDA device: Orin
CUDA used VRAM: 5.8% (3.56GB/61.37GB)
CUDA free VRAM: 94.19% (57.81GB/61.37GB)

CPU model: Cortex-A78AE
Math cores: 12
Used RAM: 5.8% (3.56GB/61.37GB)
Free RAM: 94.19% (57.81GB/61.37GB)
Used swap: 0.67% (211MB/30.68GB)
Max swap size: 30.68GB
mmap: supported

Additional Context

using the llama.cpp from src compiled on the orin works great.

$ compile.sh

cmake -B build -DGGML_CUDA=ON -DDNLC_VARIANT=cuda.b8121 -DCMAKE_CUDA_ARCHITECTURES=87 -DGGML_CUDA_CUB_3DOT2=ON -DGGML_BACKEND_DL=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_NO_VMM=ON -DGGML_CUBLAS=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda

test run:

root@jetson-orin:/usr/src/llama.cpp/build/bin# ./llama-cli -m /root/.cache/qmd/models/hf_tobil_qmd-query-expansion-1.7B-q4_k_m.gguf -p "Hello, how are you?" -n 128
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: no
load_backend: loaded CUDA backend from /mnt/src/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /mnt/src/llama.cpp/build/bin/libggml-cpu.so

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8136-9051663d5
model      : hf_tobil_qmd-query-expansion-1.7B-q4_k_m.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Hello, how are you?

[Start thinking]
lex: greetings from the virtual assistant
lex: how are you today?
vec: greetings from the virtual assistant
vec: how are you today?
hyde: The topic of hello, how are you? covers greetings from the virtual assistant. Proper implementation follows established patterns and best practices.

[ Prompt: 293.1 t/s | Generation: 60.9 t/s ]

Relevant Features Used

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

Yes, but no idea. Im not a dev.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrequires triageRequires triaging

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions