Skip to content

Hadamard transforms for K-cache (CPU only)#1033

Merged
ikawrakow merged 1 commit intomainfrom
ik/k_cache_hadamard
Dec 4, 2025
Merged

Hadamard transforms for K-cache (CPU only)#1033
ikawrakow merged 1 commit intomainfrom
ik/k_cache_hadamard

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

@ikawrakow ikawrakow commented Dec 3, 2025

There is the claim that applying Hadamard transforms to tensors involved in Transformer models improves quantization accuracy that is quite prevalent in certain groups on the Internet.

I have tried using Hadamard transforms for model weight quantization with zero success (i.e., no improvement, or even worse outcomes, compared to not using Hadmard transforms). Hence, I have been skeptical that using Hadamard transforms for quantized KV cache would move the needle in any meaningful way.

If so, why try this now?

Well, DeepSeek-V3.2-Exp has been released, and it is being claimed that this model is exceptionally good at reasoning and tool calling. We have issue #834 asking for support, so I was reviewing the attention implementation today to see what will it take to add support. Guess what? They do use a Hadamard transform before converting the activations to be stored in the KV cache to fp8. Hence, to support the DeepSeek-V3.2-Exp attention mechanism, we do need to be able to handle Hadamard transforms. So, I decided to add this ability, just for the CPU backend for now. Once at it, I decided to see if this really reduces quantization error when using quantized KV cache. This PR is the result.

A Hadamard transform can be easily applied to the K-cache: just apply to K and Q, store the Hadamard-transformed K in the K-cache, and then proceed with the self attention calculation as usual. The V cache is much more involved: one needs to either apply a backward transformation to the entire V cache before multiplying with softmax(K*Q) (slow, a lot of extra RAM/VRAM required), or incorporate the Hadamard transform directly into the softmax(K*Q) computation in the flash attention kernel (complicated). Hence, the PR only adds the ability to use Hadamard transforms for the K-cache. This should not be a big issue as it is well known that K-cache quantization errors have a much bigger impact on model quality degradation than V-cache.

The results are, at least to me, quite astounding. The table below shows PPL for LlaMA-3-8B (model is quantized with Q4_0) for 3 different K-cache quantization types with and without a Hadamard transform. The V-cache is always f16. As the implementation is only for the CPU, it takes quite a bit of time to run these PPL calculations, so just one model for now. I have left a more detailed comparison for a future PR adding Hadamard transforms to the CUDA backend.

K-cache PPL (no Hadamard) PPL (Hadamard) Diff to f16 (no Hadamard) Diff to f16 (Hadamard)
f16 7.6077 N/A N/A N/A
Q8_0 7.6090 7.6083 0.02% 0.008%
Q6_0 7.6192 7.6123 0.15% 0.06%
IQ4_NL 7.7252 7.6717 1.54% 0.84%
Q4_0 7.8131 7.6789 2.70% 0.94%

I did not run Q8_0 K-cache with a Hadamard transform as any difference will be basically within the noise. Q6_0 K-cache is already quite good without Hadamard, but using a Hadamard transform reduces the quantization error by more than a factor of 2. The most astounding result is the nearly 3X reduction in quantization error for Q4_0 K-cache, thus making it a viable option when really short on memory. Also interesting to note that with a Hadamard transform applied the benefit of the non-linear IQ4_NL quantization is much smaller compared to no-Hadamard.

To use it: add -khad or --k-cache-hadamard to the command line. As mentioned, CPU-only for now. It will run on CUDA, but self-attention will be done on the CPU, so it will be much slower.

Performance impact is negligible when running CPU-only.

Caveat: the attention head size must be a power of 2. This is true for almost all relevant models, with the notable exception of DeepSeek-V2/V3/R1 (and by extension Kimi-2)

@Nexesenex
Copy link
Copy Markdown
Contributor

This is amazing. My favored Q6_0 becomes so pretty!

@fairydreaming
Copy link
Copy Markdown

@ikawrakow hi, I'm currently working on DeepSeek V3.2 sparse attention (DSA) implementation for llama.cpp. Do you mind if I borrow your Hadamard transform code?

@ikawrakow
Copy link
Copy Markdown
Owner Author

@fairydreaming

I don't mind, but mainline maintainers might. If you look at what happened in PR 19726, I'm basically the incarnation of evilness, so nothing that is associated with me in some way will be accepted in llama.cpp, see for instance this comment. Hence, you will need to hide where it came from, which has been done on a number of occasions, but it is kind of unethical, so perhaps you will not want to do that.

You are obviously more than welcome to do your implementation here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants