Add PyTorch-native K-quant pass#2479
Merged
Merged
Conversation
Add KQuant pass under olive.passes.pytorch.kquant that implements the ggml K-quant weight-only search (the asymmetric make_qkx2_quants and the symmetric make_qx_quants variants) using a unified driver. It produces per-group scale and zero point compatible with WeightQuantizer, supports both asymmetric and symmetric quantization, and reuses the shared prepare_model/finalize plumbing so embeddings and lm_head can be quantized and retied like the other PyTorch quant passes. Also opt-in 2-bit precisions (uint2/int2) for both KQuant and Rtn in olive_config.json; the underlying WeightQuantizer and pack/unpack helpers already support 2-bit, only the registration was missing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new PyTorch-native KQuant pass for Hugging Face/PyTorch model weight-only quantization, using shared Olive quantization preparation/finalization flow and extending pass metadata to advertise 2-bit precisions.
Changes:
- Introduces
olive.passes.pytorch.kquantwith K-quant qparam search and pass implementation. - Registers
KQuantinolive_config.jsonand adds 2-bit precision support toRtn. - Adds PyTorch unit tests for KQuant qparam quality, pass execution, overrides, embeddings, and lm_head composition.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
olive/passes/pytorch/kquant.py |
Implements KQuant search, qparam generation, config validation, and pass execution. |
olive/olive_config.json |
Registers KQuant and updates Rtn supported precisions. |
test/passes/pytorch/test_kquant.py |
Adds coverage for KQuant qparams and end-to-end pass behavior. |
For asymmetric, clamp rmin <= 0 / rmax >= 0 and substitute the RTN-style (-1, 1) sentinel for all-zero groups so the normalizer is never zero. For symmetric, replace a zero normalizer with 1 (data is all zero anyway). Mirrors WeightQuantizer.find_qparams and ggml's make_qkx2_quants min<=0 clamp. Also updates the class docstring to mention 2-bit support. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xiaoyu-work
approved these changes
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Describe your changes
Add
KQuantpass underolive.passes.pytorch.kquantthat implements the ggml K-quant weight-only search in PyTorch. A unified_kquant_searchdriver covers both variants:make_qkx2_quantsused by Q2_K/Q4_K/Q5_K): tracks per-group(min, max), refines(scale, offset)via 2-D LSQ over a sweep of perturbediscalefactors.make_qx_quantsused by Q3_K/Q6_K): usesmax|x|as normalizer, zero point fixed atmidq, refines scale via 1-D LSQsumlx/suml2.Outputs
(scales, zero_points)shaped to matchWeightQuantizer.find_qparams, with the search parameterized by(maxq, minq)pulled directly from each module'sWeightQuantizer(viaget_maxq_minq) so the algorithm andfinalizeagree on the integer range. Uses the sharedprepare_model/finalizeplumbing so embeddings andlm_headcan be quantized and retied like the other PyTorch quant passes (RTN, GPTQ).Checklist before requesting a review
lintrunner -aRelease note: New
KQuantPyTorch pass for weight-only K-quant quantization (asymmetric + symmetric, 2/4/8-bit).RtnandKQuantnow also advertiseuint2/int2precisions.(Optional) Issue link