Skip to content

fix: drop gguf VRAM estimation (now redundant)#8325

Merged
mudler merged 1 commit intomasterfrom
chore/drop-gguf-vram-estimation
Feb 1, 2026
Merged

fix: drop gguf VRAM estimation (now redundant)#8325
mudler merged 1 commit intomasterfrom
chore/drop-gguf-vram-estimation

Conversation

@mudler
Copy link
Copy Markdown
Owner

@mudler mudler commented Feb 1, 2026

Cleanup. This is now handled directly in llama.cpp, no need to estimate from Go.

VRAM estimation in general is tricky, but llama.cpp ( https://github.com/ggml-org/llama.cpp/blob/41ea26144e55d23f37bb765f88c07588d786567f/src/llama.cpp#L168 ) lately has added an automatic "fitting" of models to VRAM, so we can drop backend-specific GGUF VRAM estimation from our code instead of trying to guess as we already enable it

params.fit_params = true;

Fixes: #8302
See: #8302 (comment)

@netlify
Copy link
Copy Markdown

netlify bot commented Feb 1, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit a52f1d8
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/697f26cf6dbf1d0008fdefb7
😎 Deploy Preview https://deploy-preview-8325--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@mudler mudler force-pushed the chore/drop-gguf-vram-estimation branch from 2a8bbc7 to 5162a40 Compare February 1, 2026 10:09
Cleanup. This is now handled directly in llama.cpp, no need to estimate from Go.

VRAM estimation in general is tricky, but llama.cpp ( https://github.com/ggml-org/llama.cpp/blob/41ea26144e55d23f37bb765f88c07588d786567f/src/llama.cpp#L168 ) lately has added an automatic "fitting" of models to VRAM, so we can drop backend-specific GGUF VRAM estimation from our code instead of trying to guess as we already enable it

 https://github.com/mudler/LocalAI/blob/397f7f0862d4105b874523e1a0105ae036db18ec/backend/cpp/llama-cpp/grpc-server.cpp#L393

Fixes: #8302
See: #8302 (comment)
@mudler mudler force-pushed the chore/drop-gguf-vram-estimation branch from ca2e280 to a52f1d8 Compare February 1, 2026 10:11
@mudler mudler changed the title fix: drop gguf VRAM estimation fix: drop gguf VRAM estimation (now redundant) Feb 1, 2026
@mudler mudler merged commit 800f749 into master Feb 1, 2026
39 checks passed
@mudler mudler deleted the chore/drop-gguf-vram-estimation branch February 1, 2026 16:33
@mudler mudler added the bug Something isn't working label Feb 7, 2026
localai-bot pushed a commit to localai-bot/LocalAI that referenced this pull request Mar 25, 2026
fix: drop gguf VRAM estimation

Cleanup. This is now handled directly in llama.cpp, no need to estimate from Go.

VRAM estimation in general is tricky, but llama.cpp ( https://github.com/ggml-org/llama.cpp/blob/41ea26144e55d23f37bb765f88c07588d786567f/src/llama.cpp#L168 ) lately has added an automatic "fitting" of models to VRAM, so we can drop backend-specific GGUF VRAM estimation from our code instead of trying to guess as we already enable it

 https://github.com/mudler/LocalAI/blob/397f7f0862d4105b874523e1a0105ae036db18ec/backend/cpp/llama-cpp/grpc-server.cpp#L393

Fixes: mudler#8302
See: mudler#8302 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Small model does not get loaded completely into VRAM to do GPU inference, slow CPU inference

1 participant