[kernel] add fused_qk_rmsnorm_per_token_quant kernel#2958
Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
e77d654 to
ec8257b
Compare
|
you can merge it once atom test passed |
|
ATOM test passed as well. Merge it now. |
|
Hi, this PR breaks SGLang. Could you please revert first? @valarLip @gbyu-amd The PR renames the public After your PR, this import raises
https://github.com/ROCm/aiter/actions/runs/25586397205/job/75123761210#step:10:317 |
|
hi @bingxche , this pr has unified the api to fused_qk_rmsnorm ( aiter/aiter/ops/fused_qk_rmsnorm_group_quant.py Lines 175 to 196 in e22dadd cc @valarLip |
Both pyproject.toml (build-system) and requirements.txt (runtime) were inconsistent on this branch — pyproject was at 0.1.4 (stale, not on PyPI for manylinux_2_28), requirements at 0.1.6. Main is at 0.1.7 since #2958-era kernels need flydsl 0.1.7 IR API. Wheels rebuilt from this HEAD will declare Requires-Dist: flydsl ==0.1.7, matching what main publishes.


Motivation
Some quark models, e.g., amd/DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 and amd/Kimi-K2-Thinking-MXFP4-AttnFP8 have fp8 weight linear layers in attn and adopt ptpc quant recipe, thus add fused_qk_rmsnorm_per_token_quant kernel in this pr which will be used in ATOM/vLLM-ATOM.
Technical Details
Test Plan
Test Result
Submission Checklist