Add FP8 kernel acceleration for compressed-tensors quantized models by jiqing-feng · Pull Request #45699 · huggingface/transformers

jiqing-feng · 2026-04-29T08:31:50Z

What does this PR do?

This PR adds native FP8 matmul kernel support for compressed-tensors FP8 quantized models in transformers. Previously, compressed-tensors FP8 models were loaded via the compressed-tensors library and dequantized back to FP16/BF16 for inference. With this change, FP8 weights are kept in FP8 format and inference uses hardware-accelerated FP8 matmul kernels (torch._scaled_mm on XPU, fbgemm.f8f8bf16_rowwise on CUDA).

Key changes:

New file: src/transformers/integrations/compressed_tensors_fp8.py

CTFP8Linear: FP8 linear layer that stores weights in FP8 and uses row-wise FP8 matmul kernels. Activations are dynamically quantized per-row via quantize_fp8_per_row.
Weight converters (CompressedTensorsScaleConvert, CompressedTensorsFp8Dequantize) to handle the checkpoint format conversion (e.g. weight_scale → weight_scale_inv).
CTFP8PerRowQuantize: Online quantization support — quantize BF16 weights to FP8 per-row on-the-fly during model loading.

Modified: src/transformers/quantizers/quantizer_compressed_tensors.py

CompressedTensorsHfQuantizer now detects FP8 quantization configs (float type, num_bits=8) and automatically routes to the FP8 kernel path when GPU/XPU is available. Falls back to the default compressed-tensors dequantize path on CPU.
Added get_weight_conversions() and get_quantize_ops() to support both pre-quantized loading and online quantization.
No changes to the non-FP8 code path — existing INT8/INT4 compressed-tensors models are unaffected.

Modified: src/transformers/quantizers/auto.py

Minor formatting change (no functional change).

Supported models

Per-channel dynamic: e.g. RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
Per-tensor static: e.g. RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8
Online quantization: Any BF16 model can be quantized to FP8 on-the-fly by passing a CompressedTensorsConfig with FP8 quantization scheme.

Usage

Pre-quantized model (no config needed)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Online quantization

from transformers import AutoModelForCausalLM, CompressedTensorsConfig
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs, QuantizationType, QuantizationStrategy

ct_config = CompressedTensorsConfig(
    config_groups={
        "group_0": QuantizationScheme(
            weights=QuantizationArgs(
                num_bits=8, type=QuantizationType.FLOAT, strategy=QuantizationStrategy.CHANNEL,
            ),
        ),
    },
    run_compressed=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=ct_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

Devices

XPU (Intel Data Center Max / Arc): uses torch._scaled_mm
CUDA (SM89+): uses fbgemm.f8f8bf16_rowwise
CPU: falls back to default compressed-tensors dequantize path

@sywangyi

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Rocketknight1 · 2026-04-29T10:23:33Z

cc @SunMarc

SunMarc

Thanks, left a comment !

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-04-30T04:49:42Z

Hi @SunMarc . Please check it the integration is ok. I'll clean the tests and doc after you approved the integration.

jiqing-feng · 2026-05-07T01:54:58Z

Hi @SunMarc . Would you please review the PR? Thanks!

jiqing-feng · 2026-05-08T07:35:50Z

Hi @Rocketknight1 . It seems that @SunMarc does not have bandwidth to review this PR. Would you please help to review the PR? Thanks!

stevhliu

thanks! one more minor change, otherwise docs lgtm :)

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

IlyasMoutawwakil · 2026-06-03T10:37:18Z

+        else:
+            # CUDA SM80 (A100): no FP8 hardware, dequantize weight to BF16 + normal matmul
+            w = self.weight.to(input.dtype) * self.weight_scale_inv.to(input.dtype)
+            output = F.linear(input.view(-1, input.shape[-1]), w, self.bias)


a bit opinionated but i don't think we should have a dequant path here, for me it kidna goes againt the principle of quantizers. wdyt @SunMarc

I agree with keeping the quantizer pure, but removing the dequant fallback conflicts with checking hardware at runtime (the ZeroGPU requirement).

If we delay the hardware check to forward() and get assigned an SM80 (A100) GPU at runtime, the model is already initialized with CompressedTensorsFP8Linear. Without the fallback, the forward pass will hard crash.

How should we handle this?

Keep the fallback (and maybe add a logger.warning_once).

Remove the fallback and raise a RuntimeError if an unsupported GPU is detected at runtime.

IlyasMoutawwakil

sorry for the late re-review 😅, overall pretty good, just a couple change requests around naming and a qustion around cpu fallback for @SunMarc

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

HuggingFaceDocBuilderDev · 2026-06-04T08:23:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

IlyasMoutawwakil

LGTM for the integration part ! cc @SunMarc for the quantizer part

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

ArthurZucker

Nice! Nit on walking / updating existing conversion

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-06-05T02:40:06Z

Hi @ArthurZucker . Would you please review my new commit to check if it fixed your comments? Thanks!

jiqing-feng · 2026-06-11T02:57:30Z

Hi @SunMarc . Would you please let me know what need to be changed before merging? Thanks!

SunMarc

Thanks for your work !

I think it will still be better if we don't do online quantization no ? Like users can just use compressed tensors to do it no ? If they want to use online fp8, they have finegrained-fp8 for that and we can update the support if needed. The thing is that I don't want to introduce a new way to create CT checkpoints + maintenance overhead for reverse ops that needs to match CT implementation and ours. Right now, what you did is to dequantize back to BF16 if someone wants to save the online quantized model.

Also, for the FP8 kernels, we should probably add a new arg like the other quantization method called dequantize. run_compressed don't work anymore as you saw. CT just decompresses the model on the first forward.

SunMarc · 2026-06-11T14:09:59Z

+            x = input.reshape(-1, input.shape[-1])
+            x_quantized, x_scale = _quantize_fp8_per_row(x)
+
+            weight_scale_float32 = self.weight_scale_inv.to(torch.float32)


The weight_scale will be in float32 if you specified it in the modeling so we don't need to do that

Suggested change

weight_scale_float32 = self.weight_scale_inv.to(torch.float32)

SunMarc · 2026-06-11T14:12:12Z

+            scale_b = weight_scale_float32.t()
+            if scale_b.shape[-1] == 1 and self.out_features > 1:


maybe define a variable called is_per_tensor ?

SunMarc · 2026-06-11T14:15:00Z

+
+        if _can_use_fp8_kernel():
+            # XPU or CUDA SM89+: FP8 kernel path (quantize activation + scaled_mm)
+            x = input.reshape(-1, input.shape[-1])


maybe we can reshape the input before as we do for both path ?

SunMarc · 2026-06-11T14:16:08Z

+
+        module_kwargs = {} if pre_quantized else {"dtype": None}
+        if isinstance(module, nn.Linear):
+            with torch.device("meta"):


we don't need this normally as this method is already under a context manager that does this

Suggested change

with torch.device("meta"):

SunMarc · 2026-06-11T14:21:07Z

+def _is_fp8_config(quantization_config: CompressedTensorsConfig) -> bool:
+    """Check if a CompressedTensorsConfig describes FP8 quantization."""
+    ct_qconfig = quantization_config.quantization_config
+    if ct_qconfig is None:
+        return False
+    for group in ct_qconfig.config_groups.values():
+        weights = group.weights
+        if weights is not None and weights.type == "float" and weights.num_bits == 8:
+            return True
+    return False
+


we can move that to compressed_tensors config class maybe ?

SunMarc · 2026-06-11T14:36:33Z

+        if self.is_fp8:
+            return False


depends if we dequantize or not also

SunMarc · 2026-06-11T14:36:42Z

+        if self.is_fp8:
+            return False


SunMarc · 2026-06-11T14:39:28Z

+class CompressedTensorsActivationScaleConvert(ConversionOps):
+    """Rename compressed-tensors `input_scale` to `activation_scale`."""
+
+    def convert(self, input_dict, **kwargs):
+        scale = input_dict["input_scale"][0]
+        return {"activation_scale": scale.to(torch.float32)}
+
+    @property
+    def reverse_op(self):
+        return _IdentityOp()


we can just keep the same name no ?

SunMarc · 2026-06-11T14:45:13Z

+class CompressedTensorsScaleConvert(ConversionOps):
+    """Convert compressed-tensors `weight_scale` to `weight_scale_inv`.
+
+    In compressed-tensors, `weight_scale` is the dequantization multiplier:
+        bf16_weight = fp8_weight * weight_scale
+
+    In our CompressedTensorsFP8Linear, `weight_scale_inv` has the same semantics (it's
+    multiplied with the FP8 weight to get the dequantized value), so no inversion is needed.
+    The conversion also reshapes the scale: scalar → (1, 1), 1D (N,) → (N, 1).
+    """
+


same here, we can keep the same name no ?

SunMarc · 2026-06-11T14:49:48Z

+class CompressedTensorsFp8Dequantize(ConversionOps):
+    """Dequantize compressed-tensors FP8 weights back to BF16.
+
+    Folds the per-channel / per-tensor ``weight_scale`` into the FP8 weight,
+    producing a BF16 tensor. Prepended to a converter chain for layers that
+    cannot stay in FP8 (e.g. merged MoE experts, which are not ``nn.Linear``):
+    it pairs each weight with its sibling scale *by index* and preserves the
+    per-expert list structure so the downstream merge / concat ops still see
+    one tensor per expert.
+    """
+


Do we really need this ? Like it is not really useful to online quantize a model to finally save it in bf16 no ?

With compressed tensors, it will dequantize the model in any case no if you specify run_compressed=False no for a quantized model.

also can you explain the moe bit, i didn't fully understand

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-06-12T03:10:12Z

Hi @SunMarc, thanks for the review! Addressed everything; kept the MoE dequant-before-merge on purpose (explained below).

Done

No more online quant — removed CompressedTensorsFP8PerRowQuantize; get_quantize_ops returns None and param_needs_quantization is False for FP8. Online FP8 → use finegrained-fp8.
dequantize arg — added dequantize: bool = False to CompressedTensorsConfig (like the other methods), wired through to replace_with_compressed_tensors_fp8_linear. run_compressed no longer drives this.
(1) dropped the redundant .to(float32) (weight_scale is already f32).
(2) added an is_per_tensor variable.
(3) input is reshaped to 2D once, shared by both paths.
(4) removed with torch.device("meta").
(5) moved _is_fp8_config → CompressedTensorsConfig.is_fp8.
(6) / (7) is_trainable / is_qat_trainable now return dequantize for FP8 (trainable only on the dequantized BF16 path).
(8) kept input_scale — the converter was dead code, removed it.
(9) kept weight_scale (no more weight_scale_inv); the converter only reshapes.

Side effect: CompressedTensorsFp8Dequantize.reverse_op is now _IdentityOp() (we never re-quantize on save).

The MoE dequant (comments 10 & 11)

MoE checkpoints store per-expert weights+scales, but transformers merges the experts (stack/cat) into a single 3D packed nn.Parameter — not an nn.Linear, so it can't hold a weight_scale, and the per-expert scales differ so they can't survive the merge. The only correct option is to dequantize each expert (weight * weight_scale) to BF16 before the merge: update_weight_conversions prepends CompressedTensorsFp8Dequantize to the merging converters, pairs weight+scale by expert index, drops the scales. Without it, MoE FP8 loading crashes. (Attention/router linears in the same model still stay FP8.) The dequantize=True flag reuses this same op to also fold plain linears to BF16 — that's the answer to comment 10.

Known limitation: since merged experts land in BF16, this PR doesn't save memory on MoE expert weights (the bulk of params) — only dense models and the attention/router part of MoE benefit. Not fundamental: experts could stay FP8 with (1) a 3D FP8 expert param, (2) a stacked per-expert scale tensor, (3) a grouped scaled-mm in the MoE forward. That's a bigger change, so I'd suggest a follow-up and keeping this dequant path as the correct baseline.

Happy to adjust naming or split differently — let me know!

github-actions · 2026-06-15T01:41:39Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: compressed_tensors_integration

github-actions · 2026-06-15T01:54:44Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45699&sha=b73018

jiqing-feng added 4 commits April 29, 2026 15:30

add compressed-tensor fp8 integeration

5f46f04

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

f5a3168

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update

59762f0

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update copyright

0facb18

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng changed the title ~~Fp8~~ Add FP8 kernel acceleration for compressed-tensors quantized models Apr 29, 2026

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

SunMarc reviewed Apr 29, 2026

View reviewed changes

Comment thread src/transformers/integrations/compressed_tensors_fp8.py Outdated

evalstate mentioned this pull request Apr 29, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

jiqing-feng added 6 commits April 30, 2026 09:51

rm fbgemm kernel from compressed tensor

c243de2

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

3be6079

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

rm fbgemm rely

4be83b7

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix format

58f2628

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into fp8

7afbb7e

fix import

e952e55

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng marked this pull request as ready for review April 30, 2026 03:11

Merge branch 'main' into fp8

6a010e0

Merge branch 'main' into fp8

8920b9d