Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,8 @@
title: quanto
- local: quantization/modelopt
title: NVIDIA ModelOpt
- local: quantization/autoround
title: AutoRound
title: Quantization
- isExpanded: false
sections:
Expand Down
206 changes: 206 additions & 0 deletions docs/source/en/quantization/autoround.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
<!-- Copyright 2026 The HuggingFace Team. All rights reserved.
Comment thread
xin3he marked this conversation as resolved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# AutoRound

[AutoRound](https://github.com/intel/auto-round) is an advanced quantization toolkit. It achieves high accuracy at ultra-low bit widths (2-4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our papers [SignRoundV1](https://arxiv.org/pdf/2309.05516) and [SignRoundV2](https://arxiv.org/abs/2512.04746) for more details.


Install `auto-round`(version ≥ 0.13.0):

```bash
pip install "auto-round>=0.13.0"
```

To use the Marlin kernel for faster CUDA inference, install `gptqmodel`:

```bash
pip install "gptqmodel>=5.8.0"
```

## Load a quantized model

Load a pre-quantized AutoRound model by passing [`AutoRoundConfig`] to [`~ModelMixin.from_pretrained`]. The method works with any model that loads via [Accelerate](https://hf.co/docs/accelerate/index) and has `torch.nn.Linear` layers.

You can use [`PipelineQuantizationConfig`] to quantize specific components of a pipeline:

```python
import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig

pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={"transformer": AutoRoundConfig(backend="auto")}
)
pipe = DiffusionPipeline.from_pretrained(
"INCModel/Z-Image-W4A16-AutoRound",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="cuda",
)

image = pipe("a cat holding a sign that says hello").images[0]
image.save("output.png")
```

Or load a quantized model component directly:

```python
import torch
from diffusers import ZImageTransformer2DModel, ZImagePipeline, AutoRoundConfig

model_id = "INCModel/Z-Image-W4A16-AutoRound"

quantization_config = AutoRoundConfig(backend="auto")
transformer = ZImageTransformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
device_map="cuda",
)

pipe = ZImagePipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.bfloat16,
device_map="cuda",
Comment thread
xin3he marked this conversation as resolved.
Comment thread
xin3he marked this conversation as resolved.
)

image = pipe("a cat holding a sign that says hello").images[0]
image.save("output.png")
```

> [!NOTE]
> AutoRound in Diffusers only supports loading *pre-quantized* models. To quantize a model from scratch, use the [AutoRound CLI or Python API](https://github.com/intel/auto-round) directly, then load the result with Diffusers.

## torch.compile

AutoRound is compatible with [`torch.compile`](../optimization/fp16#torchcompile) for faster inference. You can compile the quantized transformer (DiT) for better performance:

```python
import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, AutoRoundConfig

pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={"transformer": AutoRoundConfig(backend="auto")}
)
pipe = DiffusionPipeline.from_pretrained(
"INCModel/Z-Image-W4A16-AutoRound",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="cuda",
)

pipe.transformer = torch.compile(pipe.transformer, mode="default", fullgraph=False)
```

## Backends

AutoRound supports multiple inference backends for Weight-only quantized model. The backend controls which kernel handles dequantization during the forward pass. Set the `backend` parameter in [`AutoRoundConfig`] to choose one:

| Backend | Value | Device | Requirements | Notes |
|---------|-------|--------|--------------|-------|
| **Auto** | `"auto"` | Any | — | Default. Automatically selects the best available backend. |
| **PyTorch** | `"torch"` | CPU / CUDA | — | Pure PyTorch implementation. Broadest compatibility. |
| **Triton** | `"tritonv2"` | CUDA | `triton` | Triton-based kernel for GPU inference. |
| **ExllamaV2** | `"exllamav2"` | CUDA | `gptqmodel>=5.8.0` | Good CUDA performance via the ExllamaV2 kernel. |
| **Marlin** | `"marlin"` | CUDA | `gptqmodel>=5.8.0` | Best CUDA performance via the Marlin kernel. |


```python
from diffusers import AutoRoundConfig

# Auto-select (default)
config = AutoRoundConfig()

# Explicit Triton backend for CUDA
config = AutoRoundConfig(backend="tritonv2")

# Marlin backend for best CUDA performance (requires gptqmodel>=5.8.0)
config = AutoRoundConfig(backend="marlin")

# ExllamaV2 backend for good CUDA performance (requires gptqmodel>=5.8.0)
config = AutoRoundConfig(backend="exllamav2")

# PyTorch backend for CPU/CUDA inference
config = AutoRoundConfig(backend="torch")
```


## Save and load

<hfoptions id="save-and-load">
<hfoption id="save">

AutoRound requires data calibration to quantize a model. This is done outside of Diffusers using the [AutoRound library](https://github.com/intel/auto-round) directly:

```python
from auto_round import AutoRound

autoround = AutoRound(
"Tongyi-MAI/Z-Image",
scheme="W4A16", # W4G128 symmetric
enable_torch_compile=True,
num_inference_steps=3,
guidance_scale=7.5,
dataset="coco2014",
)
autoround.quantize_and_save("Z-Image-W4A16-AutoRound")
```

For more details on calibration options, see the [AutoRound documentation](https://github.com/intel/auto-round).

</hfoption>
<hfoption id="load">


```python
import torch
from diffusers import ZImageTransformer2DModel, ZImagePipeline

model_id = "INCModel/Z-Image-W4A16-AutoRound"

# The inference backend will be automatically selected.
pipe = ZImagePipeline.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda",
Comment thread
xin3he marked this conversation as resolved.
)

image = pipe("a cat holding a sign that says hello").images[0]
image.save("output.png")
```
</hfoption>
</hfoptions>

Comment thread
sayakpaul marked this conversation as resolved.

### Supported Quantization Schemes

AutoRound supports several Schemes:

- **W4A16**(bits:4,group_size:128,sym:True,act_bits:16)
- **W8A16**(bits:8,group_size:128,sym:True,act_bits:16)
- **W3A16**(bits:3,group_size:128,sym:True,act_bits:16)
- **W2A16**(bits:2,group_size:128,sym:True,act_bits:16)
- **GGUF:Q4_K_M**(all Q*_K,Q*_0,Q*_1 provided by llamacpp are supported)
- **NVFP4**(Experimental feature, recommend exporting to `llm_compressor` format.data_type nvfp4,act_data_type nvfp4,static_global_scale,group_size 16)
- **MXFP4**(**Research feature, no real kernel**, Standard MXFP4, data_type mxfp,act_data_type mxfp,bits 4, act_bits 4, group_size 32)
- **MXINT4**(**Research feature, no real kernel**, Standard MXINT4, data_type mxint,act_data_type mxint,bits 4, act_bits 4, group_size 32)
- **MXFP4_RCEIL**(**Research feature,no real kernel**, NVIDIA's variant, data_type mxfp,act_data_type mxfp_rceil,bits 4, act_bits 4, group_size 32)
- **MXFP8**(**Research feature, no real kernel**, data_type mxfp,act_data_type mxfp_rceil,group_size 32)
- **FPW8A16**(**Research feature, no real kernel**, data_type fp8,group_size 0->per tensor )
- **FP8_STATIC**(**Research feature, no real kernel**, data_type:fp8,act_data_type:fp8,group_size -1 ->per channel, act_group_size=0->per tensor)

Besides, you could modify the `group_size`, `bits`, `sym` and many other configs you want, though there are maybe no real kernels.

## Resources

- [Pre-quantized AutoRound models on the Hub](https://huggingface.co/models?search=autoround)
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@
"onnx",
"optimum_quanto>=0.2.6",
"gguf>=0.10.0",
"auto-round>=0.13.0",
"torchao>=0.7.0",
"bitsandbytes>=0.43.3",
"nvidia_modelopt[hf]>=0.33.1",
Expand Down
21 changes: 21 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
OptionalDependencyNotAvailable,
_LazyModule,
is_accelerate_available,
is_auto_round_available,
is_bitsandbytes_available,
is_flax_available,
is_gguf_available,
Expand Down Expand Up @@ -123,6 +124,18 @@
else:
_import_structure["quantizers.quantization_config"].append("NVIDIAModelOptConfig")

try:
if not is_auto_round_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils import dummy_auto_round_objects

_import_structure["utils.dummy_auto_round_objects"] = [
name for name in dir(dummy_auto_round_objects) if not name.startswith("_")
]
else:
_import_structure["quantizers.quantization_config"].append("AutoRoundConfig")

try:
if not is_onnx_available():
raise OptionalDependencyNotAvailable()
Expand Down Expand Up @@ -982,6 +995,14 @@
else:
from .quantizers.quantization_config import NVIDIAModelOptConfig

try:
if not is_auto_round_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils.dummy_auto_round_objects import *
else:
from .quantizers.quantization_config import AutoRoundConfig

try:
if not is_onnx_available():
raise OptionalDependencyNotAvailable()
Expand Down
1 change: 1 addition & 0 deletions src/diffusers/dependency_versions_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@
"onnx": "onnx",
"optimum_quanto": "optimum_quanto>=0.2.6",
"gguf": "gguf>=0.10.0",
"auto-round": "auto-round>=0.13.0",
"torchao": "torchao>=0.7.0",
"bitsandbytes": "bitsandbytes>=0.43.3",
"nvidia_modelopt[hf]": "nvidia_modelopt[hf]>=0.33.1",
Expand Down
17 changes: 17 additions & 0 deletions src/diffusers/quantizers/auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,12 @@

import warnings

from .autoround import AutoRoundQuantizer
from .bitsandbytes import BnB4BitDiffusersQuantizer, BnB8BitDiffusersQuantizer
from .gguf import GGUFQuantizer
from .modelopt import NVIDIAModelOptQuantizer
from .quantization_config import (
AutoRoundConfig,
BitsAndBytesConfig,
GGUFQuantizationConfig,
NVIDIAModelOptConfig,
Expand All @@ -41,6 +43,7 @@
"quanto": QuantoQuantizer,
"torchao": TorchAoHfQuantizer,
"modelopt": NVIDIAModelOptQuantizer,
"auto-round": AutoRoundQuantizer,
}

AUTO_QUANTIZATION_CONFIG_MAPPING = {
Expand All @@ -50,6 +53,7 @@
"quanto": QuantoConfig,
"torchao": TorchAoConfig,
"modelopt": NVIDIAModelOptConfig,
"auto-round": AutoRoundConfig,
}


Expand Down Expand Up @@ -143,6 +147,19 @@ def merge_quantization_configs(
if isinstance(quantization_config, NVIDIAModelOptConfig):
quantization_config.check_model_patching()

if quantization_config_from_args is not None and isinstance(quantization_config, AutoRoundConfig):
# For AutoRound, allow overriding fields like `backend` from user args,
# since the model config may store a default value (e.g. backend="auto").
for key, value in quantization_config_from_args.__dict__.items():
if key in ("quant_method",):
continue
if hasattr(quantization_config, key) and getattr(quantization_config, key) != value:
warnings.warn(
f"Overriding `{key}` in the model's quantization_config with value {value!r} "
f"from the user-provided `quantization_config`."
)
setattr(quantization_config, key, value)

if warning_msg != "":
warnings.warn(warning_msg)

Expand Down
1 change: 1 addition & 0 deletions src/diffusers/quantizers/autoround/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .autoround_quantizer import AutoRoundQuantizer
Loading
Loading