This repo is designed to provide complete transparency and experimental freedom in weight quantization workflows. It exposes every part of the quantization and compression process—perfect for debugging, benchmarking, or building custom deployment tools.
This repository is a modular, test-driven framework for experimenting with weight quantization, linear layer replacement, and custom packing schemes in PyTorch. It includes full support for symmetric/asymmetric quantization, granular control over scaling strategies, and bit-level weight packing for memory compression.
-
Modes: Symmetric, Asymmetric
-
Granularity:
- PerTensor
- PerDimension (row/column)
- PerGroup (e.g., 32-element groups)
-
QWQALinearLayer: A quantized wrapper overnn.Linearsupportingint8weights withfloat32/bfloat16activations. -
Dynamically replaces any
nn.Linearmodule in a PyTorch model with quantized counterparts using:replace_linear_layers_with_w8a16(model, Target.Linear(QWQALinearLayer), exclude_list)
-
Packs 2D tensors into lower-bit formats using bitwise operations (e.g., pack 2-bit weights into
uint8) -
Optimized for memory compression and alignment
-
Includes corresponding unpack routines
quantization/
├── linear_layer.py # Quantized LinearLayer and model replacement logic
├── linear_quantizer.py # Quantization logic: scale, zero-point, modes and granularity
├── weight_pack.py # Bitwise tensor packing/unpacking routines
├── main.py # Integration test suite for quantization and replacement
├── test_weight_pack.py # Unit tests for weight packing
├── test_linear_quantizer.py # Unit tests for linear quantization logic
└── README.md # Project overview and documentation
python3 main.py
Or run individual unit test modules:
python3 -m unittest test_linear_quantizer.py
python3 -m unittest test_weight_pack.pyimport torch
from linear_layer import QWQALinearLayer
layer = QWQALinearLayer(16, 32)
input = torch.randn(4, 16)
output = layer(input)