Conversation
Contributor
|
Amazing, you've done it! The pieces of the puzzle are in place. Congrats, ik, on the world's smallest working DeepSeek-R1-0528 quant! 🎉 With the new DDR5 2x64GB DIMM kits becoming available, an AM5 gaming class rig + GPU can barely fit this little beast! I'm going to double check that 👈 Commands and LogsPull and Buildgit branch | grep '*'
* ik/cuda_iq1_m_r4
git rev-parse --short HEAD
8ed7825f
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
cmake --build ./build --config Release -j $(nproc)llama-sweep-benchmodel=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-c 16384 \
-ctk f16 \
-mla 3 -fa \
-amb 512 \
-fmoe \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|13|14|15|16|17|18|19|20)\.ffn_.*=CUDA0" \
-ot "blk\.(21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38)\.ffn_.*=CUDA1" \
-ot exps=CPU \
-b 4096 -ub 4096 \
--warmup-batch \
--threads 24
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q4_0: 61 tensors
llama_model_loader: - type iq4_ks: 551 tensors
llama_model_loader: - type iq1_s_r4: 116 tensors
llama_model_loader: - type iq1_m_r4: 58 tensors
llm_load_print_meta: model type = 671B
llm_load_print_meta: model ftype = IQ1_S_R4 - 1.5 bpw
llm_load_print_meta: model params = 672.050 B
llm_load_print_meta: model size = 130.203 GiB (1.664 BPW)
llm_load_print_meta: repeating layers = 129.285 GiB (1.657 BPW, 670.196 B parameters)
llm_load_print_meta: general.name = DeepSeek R1 0528
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 5994.06 MiB
llm_load_tensors: CPU buffer size = 44211.82 MiB
llm_load_tensors: CPU buffer size = 469.99 MiB
llm_load_tensors: CUDA0 buffer size = 42859.65 MiB
llm_load_tensors: CUDA1 buffer size = 43061.37 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 576.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 522.00 MiB
llama_new_context_with_model: KV self size = 1098.00 MiB, c^KV (f16): 1098.00 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 2824.02 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 2520.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 368.05 MiB
llama_new_context_with_model: graph nodes = 5500
llama_new_context_with_model: graph splits = 111
|
Contributor
|
Importantly, it runs clean with no nans!!! Ship it! 🚢 🐿️ 🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

To help the quest for the world's smallest DeepSeek model, this PR adds CUDA implementation for
IQ1_M_R4.GEMM is done via dequantize+cuBLAS, so may require
cmake -DGGML_CUDA_IQK_FORCE_BF16=ON.Performance is on par or even tiny bit better than
IQ1_M.Here sweep bench for LlaMA-3-8B on RTX-4080
IQ1_M
IQ1_M_R4 (PR)