Hadamard transforms for K-cache (CPU only) by ikawrakow · Pull Request #1033 · ikawrakow/ik_llama.cpp

ikawrakow · 2025-12-03T16:29:18Z

There is the claim that applying Hadamard transforms to tensors involved in Transformer models improves quantization accuracy that is quite prevalent in certain groups on the Internet.

I have tried using Hadamard transforms for model weight quantization with zero success (i.e., no improvement, or even worse outcomes, compared to not using Hadmard transforms). Hence, I have been skeptical that using Hadamard transforms for quantized KV cache would move the needle in any meaningful way.

If so, why try this now?

Well, DeepSeek-V3.2-Exp has been released, and it is being claimed that this model is exceptionally good at reasoning and tool calling. We have issue #834 asking for support, so I was reviewing the attention implementation today to see what will it take to add support. Guess what? They do use a Hadamard transform before converting the activations to be stored in the KV cache to fp8. Hence, to support the DeepSeek-V3.2-Exp attention mechanism, we do need to be able to handle Hadamard transforms. So, I decided to add this ability, just for the CPU backend for now. Once at it, I decided to see if this really reduces quantization error when using quantized KV cache. This PR is the result.

A Hadamard transform can be easily applied to the K-cache: just apply to K and Q, store the Hadamard-transformed K in the K-cache, and then proceed with the self attention calculation as usual. The V cache is much more involved: one needs to either apply a backward transformation to the entire V cache before multiplying with softmax(K*Q) (slow, a lot of extra RAM/VRAM required), or incorporate the Hadamard transform directly into the softmax(K*Q) computation in the flash attention kernel (complicated). Hence, the PR only adds the ability to use Hadamard transforms for the K-cache. This should not be a big issue as it is well known that K-cache quantization errors have a much bigger impact on model quality degradation than V-cache.

The results are, at least to me, quite astounding. The table below shows PPL for LlaMA-3-8B (model is quantized with Q4_0) for 3 different K-cache quantization types with and without a Hadamard transform. The V-cache is always f16. As the implementation is only for the CPU, it takes quite a bit of time to run these PPL calculations, so just one model for now. I have left a more detailed comparison for a future PR adding Hadamard transforms to the CUDA backend.

K-cache	PPL (no Hadamard)	PPL (Hadamard)	Diff to f16 (no Hadamard)	Diff to f16 (Hadamard)
f16	7.6077	N/A	N/A	N/A
Q8_0	7.6090	7.6083	0.02%	0.008%
Q6_0	7.6192	7.6123	0.15%	0.06%
IQ4_NL	7.7252	7.6717	1.54%	0.84%
Q4_0	7.8131	7.6789	2.70%	0.94%

~~I did not run Q8_0 K-cache with a Hadamard transform as any difference will be basically within the noise.~~ Q6_0 K-cache is already quite good without Hadamard, but using a Hadamard transform reduces the quantization error by more than a factor of 2. The most astounding result is the nearly 3X reduction in quantization error for Q4_0 K-cache, thus making it a viable option when really short on memory. Also interesting to note that with a Hadamard transform applied the benefit of the non-linear IQ4_NL quantization is much smaller compared to no-Hadamard.

To use it: add -khad or --k-cache-hadamard to the command line. As mentioned, CPU-only for now. It will run on CUDA, but self-attention will be done on the CPU, so it will be much slower.

Performance impact is negligible when running CPU-only.

Caveat: the attention head size must be a power of 2. This is true for almost all relevant models, with the notable exception of DeepSeek-V2/V3/R1 (and by extension Kimi-2)

Nexesenex · 2025-12-03T18:30:14Z

This is amazing. My favored Q6_0 becomes so pretty!

fairydreaming · 2026-03-12T14:38:11Z

@ikawrakow hi, I'm currently working on DeepSeek V3.2 sparse attention (DSA) implementation for llama.cpp. Do you mind if I borrow your Hadamard transform code?

ikawrakow · 2026-03-12T14:59:41Z

@fairydreaming

I don't mind, but mainline maintainers might. If you look at what happened in PR 19726, I'm basically the incarnation of evilness, so nothing that is associated with me in some way will be accepted in llama.cpp, see for instance this comment. Hence, you will need to hide where it came from, which has been done on a number of occasions, but it is kind of unethical, so perhaps you will not want to do that.

You are obviously more than welcome to do your implementation here.

WIP: Hadamard transforms for K-cache

3a309a8

ikawrakow merged commit 658ced0 into main Dec 4, 2025

ikawrakow mentioned this pull request Dec 4, 2025

K-cache Hadamard transforms (CUDA) #1034

Merged

saood06 mentioned this pull request Jan 31, 2026

Feature Request: GLM 4.7 Flash graph parallel support #1180

Open

4 tasks

jukofyork mentioned this pull request Mar 25, 2026

TurboQuant KV Cache Compression — Working Implementation Ready for Review #1509

Closed

This was referenced Mar 27, 2026

V-cache Hadamard transform #1527

Merged

Even better Q4_0 KV cache #1547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadamard transforms for K-cache (CPU only)#1033

Hadamard transforms for K-cache (CPU only)#1033
ikawrakow merged 1 commit intomainfrom
ik/k_cache_hadamard

ikawrakow commented Dec 3, 2025 •

edited

Loading

Uh oh!

Nexesenex commented Dec 3, 2025

Uh oh!

fairydreaming commented Mar 12, 2026

Uh oh!

ikawrakow commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikawrakow commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nexesenex commented Dec 3, 2025

Uh oh!

fairydreaming commented Mar 12, 2026

Uh oh!

ikawrakow commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ikawrakow commented Dec 3, 2025 •

edited

Loading