Hadamard transforms for K-cache (CPU only)#1033
Merged
Conversation
Contributor
|
This is amazing. My favored Q6_0 becomes so pretty! |
4 tasks
|
@ikawrakow hi, I'm currently working on DeepSeek V3.2 sparse attention (DSA) implementation for llama.cpp. Do you mind if I borrow your Hadamard transform code? |
Owner
Author
|
I don't mind, but mainline maintainers might. If you look at what happened in PR 19726, I'm basically the incarnation of evilness, so nothing that is associated with me in some way will be accepted in You are obviously more than welcome to do your implementation here. |
This was referenced Mar 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is the claim that applying Hadamard transforms to tensors involved in Transformer models improves quantization accuracy that is quite prevalent in certain groups on the Internet.
I have tried using Hadamard transforms for model weight quantization with zero success (i.e., no improvement, or even worse outcomes, compared to not using Hadmard transforms). Hence, I have been skeptical that using Hadamard transforms for quantized KV cache would move the needle in any meaningful way.
If so, why try this now?
Well, DeepSeek-V3.2-Exp has been released, and it is being claimed that this model is exceptionally good at reasoning and tool calling. We have issue #834 asking for support, so I was reviewing the attention implementation today to see what will it take to add support. Guess what? They do use a Hadamard transform before converting the activations to be stored in the KV cache to
fp8. Hence, to support the DeepSeek-V3.2-Exp attention mechanism, we do need to be able to handle Hadamard transforms. So, I decided to add this ability, just for the CPU backend for now. Once at it, I decided to see if this really reduces quantization error when using quantized KV cache. This PR is the result.A Hadamard transform can be easily applied to the K-cache: just apply to
KandQ, store the Hadamard-transformedKin the K-cache, and then proceed with the self attention calculation as usual. The V cache is much more involved: one needs to either apply a backward transformation to the entire V cache before multiplying withsoftmax(K*Q)(slow, a lot of extra RAM/VRAM required), or incorporate the Hadamard transform directly into thesoftmax(K*Q)computation in the flash attention kernel (complicated). Hence, the PR only adds the ability to use Hadamard transforms for the K-cache. This should not be a big issue as it is well known that K-cache quantization errors have a much bigger impact on model quality degradation than V-cache.The results are, at least to me, quite astounding. The table below shows PPL for LlaMA-3-8B (model is quantized with
Q4_0) for 3 different K-cache quantization types with and without a Hadamard transform. The V-cache is alwaysf16. As the implementation is only for the CPU, it takes quite a bit of time to run these PPL calculations, so just one model for now. I have left a more detailed comparison for a future PR adding Hadamard transforms to the CUDA backend.I did not runQ8_0K-cache with a Hadamard transform as any difference will be basically within the noise.Q6_0K-cache is already quite good without Hadamard, but using a Hadamard transform reduces the quantization error by more than a factor of 2. The most astounding result is the nearly 3X reduction in quantization error forQ4_0K-cache, thus making it a viable option when really short on memory. Also interesting to note that with a Hadamard transform applied the benefit of the non-linearIQ4_NLquantization is much smaller compared to no-Hadamard.To use it: add
-khador--k-cache-hadamardto the command line. As mentioned, CPU-only for now. It will run on CUDA, but self-attention will be done on the CPU, so it will be much slower.Performance impact is negligible when running CPU-only.
Caveat: the attention head size must be a power of 2. This is true for almost all relevant models, with the notable exception of DeepSeek-V2/V3/R1 (and by extension Kimi-2)