[kernel] add fused_qk_rmsnorm_per_token_quant kernel by gbyu-amd · Pull Request #2958 · ROCm/aiter

gbyu-amd · 2026-04-29T09:39:27Z

Motivation

Some quark models, e.g., amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 and amd/Kimi-K2-Thinking-MXFP4-AttnFP8 have fp8 weight linear layers in attn and adopt ptpc quant recipe, thus add fused_qk_rmsnorm_per_token_quant kernel in this pr which will be used in ATOM/vLLM-ATOM.

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-04-29T09:40:17Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2958 --add-label <label>

valarLip · 2026-05-08T13:04:53Z

you can merge it once atom test passed

gbyu-amd · 2026-05-09T00:34:05Z

ATOM test passed as well. Merge it now.

bingxche · 2026-05-09T04:04:45Z

Hi, this PR breaks SGLang. Could you please revert first? @valarLip @gbyu-amd

The PR renames the public fused_qk_rmsnorm in aiter/ops/fused_qk_norm_rope_cache_quant.py to a private _fused_qk_rmsnorm (with a different signature), without keeping a backward-compatible alias. SGLang imports this name at module load time in python/sglang/srt/models/deepseek_common/attention_forward_methods/forward_mla.py:

from aiter.ops.fused_qk_norm_rope_cache_quant import (
    fused_qk_rmsnorm as fused_qk_rmsnorm_bf16,
)

After your PR, this import raises ImportError: cannot import name 'fused_qk_rmsnorm'. Because it's a top-level import, ~28 SGLang model modules that transitively depend on it (deepseek_v2, deepseek_nextn, deepseek_v4, kimi_k25, glm4_moe, longcat_flash, mistral_large_3, etc.) all fail to register. SGLang then thinks DeepseekV3ForCausalLM has no native implementation, falls back to the Transformers backend, and ultimately crashes with KeyError: 'sglang'.

https://github.com/ROCm/aiter/actions/runs/25586397205/job/75123761210#step:10:317

gbyu-amd · 2026-05-09T05:57:26Z

hi @bingxche , this pr has unified the api to fused_qk_rmsnorm (

aiter/aiter/ops/fused_qk_rmsnorm_group_quant.py

Lines 175 to 196 in e22dadd

    
           def fused_qk_rmsnorm( 
        
               q_out_quantized: Optional[Tensor] = None, 
        
               q_out_scale: Optional[Tensor] = None, 
        
               q: Optional[Tensor] = None, 
        
               q_weight: Optional[Tensor] = None, 
        
               q_epsilon: float = 1e-6, 
        
               q_out_unquantized: Optional[Tensor] = None, 
        
               k_out: Optional[Tensor] = None, 
        
               q_res_out: Optional[Tensor] = None, 
        
               k: Optional[Tensor] = None, 
        
               k_weight: Optional[Tensor] = None, 
        
               k_epsilon: Optional[float] = None, 
        
               q_residual: Optional[Tensor] = None, 
        
               gemma_norm: bool = False, 
        
               quant_type: Optional[QuantType] = QuantType.No, 
        
               group_size: Optional[int] = None, 
        
               transpose_scale: bool = False, 
        
           ) -> None: 
        
               # Centralized interface 
        
               if quant_type == QuantType.No: 
        
                   _fused_qk_rmsnorm( 
        
                       q_out_quantized, q, q_weight, q_epsilon, k_out, k, k_weight, k_epsilon

), which fuses qk_rmsnorm with or without q quant. Putting the dispatch logic inside aiter kernel here should make the code in framework side more cleaner, could you update the sglang code to align with this api?
cc @valarLip

Both pyproject.toml (build-system) and requirements.txt (runtime) were inconsistent on this branch — pyproject was at 0.1.4 (stale, not on PyPI for manylinux_2_28), requirements at 0.1.6. Main is at 0.1.7 since #2958-era kernels need flydsl 0.1.7 IR API. Wheels rebuilt from this HEAD will declare Requires-Dist: flydsl ==0.1.7, matching what main publishes.

gbyu-amd requested a review from a team April 29, 2026 09:39

gbyu-amd mentioned this pull request Apr 29, 2026

[fix][acc] fix accuracy of fp8 attn weights model using ptpc quant recipe ROCm/ATOM#670

Merged

1 task

gbyu-amd marked this pull request as draft April 29, 2026 11:46

gbyu-amd marked this pull request as ready for review April 29, 2026 13:24

Guanbao Yu added 2 commits May 5, 2026 21:19

add fused_qk_rmsnorm_per_token_quant kernel

0f2936c

make format happy

ec8257b

gbyu-amd force-pushed the guanbao/fuse_qknorm_per_token_quant branch from e77d654 to ec8257b Compare May 6, 2026 02:22

zhuyuhua-v and others added 2 commits May 6, 2026 10:47

Merge branch 'main' into guanbao/fuse_qknorm_per_token_quant

afde7d3

Merge branch 'main' into guanbao/fuse_qknorm_per_token_quant

c1c40f0

gbyu-amd requested review from XiaobingSuper, ganyi1996ppo, valarLip, wuhuikx, xytpai and zejunchen-zejun May 6, 2026 07:44

zejunchen-zejun previously approved these changes May 7, 2026

View reviewed changes

xytpai previously approved these changes May 7, 2026

View reviewed changes

add centralized interface

e5cd7e7

xytpai dismissed stale reviews from zejunchen-zejun and themself via e5cd7e7 May 7, 2026 04:31

xytpai and others added 9 commits May 7, 2026 06:11

refine interface

8af42e8

bugfix

cc110de

Merge branch 'main' into guanbao/fuse_qknorm_per_token_quant

b186488

Merge branch 'main' into guanbao/fuse_qknorm_per_token_quant

9bd7109

refine api

1fc2d27

fix format

6cf5fbd

refine fused_qk_rmsnorm api

81b6691

fix ut

7f6f308

Merge branch 'main' into guanbao/fuse_qknorm_per_token_quant

dca8fba

valarLip approved these changes May 8, 2026

View reviewed changes

valarLip added the ci:atom label May 8, 2026

gbyu-amd merged commit 594d1a9 into main May 9, 2026
42 of 43 checks passed

gbyu-amd deleted the guanbao/fuse_qknorm_per_token_quant branch May 9, 2026 00:34

bingxche mentioned this pull request May 9, 2026

[AMD] Fix DeepSeek import cascade by supporting both pre- and post-#2958 aiter fused_qk_rmsnorm APIs sgl-project/sglang#24799

Merged

5 tasks

amd-bot mentioned this pull request May 12, 2026

[Analyze] amd-aiter-scout.yml run #103 (a6f359d) bingxche/sglang-ci-bot#69

Open

This was referenced May 13, 2026

fused_qk_rmsnorm crashes under torch.compile: 'float' object has no attribute 'size' (blocks Kimi-K2.5-MXFP4) #3172

Open

[v0.1.14 blocker] PR #2958 broke aiter.fused_qk_rmsnorm public API for ATOM positional callers #3177

Closed

rbrugaro-amd mentioned this pull request May 14, 2026

[Issue]: fused_qk_rmsnorm renamed to private _fused_qk_rmsnorm without deprecation (PR #2958) #3207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kernel] add fused_qk_rmsnorm_per_token_quant kernel#2958

[kernel] add fused_qk_rmsnorm_per_token_quant kernel#2958
gbyu-amd merged 14 commits into
mainfrom
guanbao/fuse_qknorm_per_token_quant

gbyu-amd commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

valarLip commented May 8, 2026

Uh oh!

gbyu-amd commented May 9, 2026

Uh oh!

Uh oh!

bingxche commented May 9, 2026

Uh oh!

gbyu-amd commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

gbyu-amd commented Apr 29, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented Apr 29, 2026

🏷️ CI Guide

Uh oh!

valarLip commented May 8, 2026

Uh oh!

gbyu-amd commented May 9, 2026

Uh oh!

Uh oh!

bingxche commented May 9, 2026

Uh oh!

gbyu-amd commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants