[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek by qichu-yun · Pull Request #528 · ROCm/ATOM

qichu-yun · 2026-04-09T04:04:34Z

Motivation

DeepSeek MLA preprocessing in the SGLang + ATOM plugin was still doing q/k RMSNorm and q quantization in separate steps, leaving unnecessary kernel and memory overhead in a hot path. Since ATOM already provides a gated fused norm-quant implementation for DeepSeek, this PR integrates that path into the plugin so supported workloads can benefit from the fusion while unsupported cases continue to use the existing fallback path.

before :

after :

Test Plan

lauch server:

export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

model_path=/shared/data/amd_int/models/DeepSeek-R1-0528

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models

TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 9000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --mem-fraction-static 0.9 \
    --page-size 1 \
    --disable-radix-cache \

client:

model_path=/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4

ISL=8000
OSL=1000
CON=4
NUM=$(( CON * 2 ))
RANGE_RATIO=1.0

PYTHONDONTWRITEBYTECODE=1 python "/home/qichu_qle/my_sgl/bench_serving/benchmark_serving.py" \
  --model=$model_path \
  --backend=sglang \
  --base-url=http://127.0.0.1:9000 \
  --dataset-name=random \
  --random-input-len="${ISL}" \
  --random-output-len="${OSL}" \
  --random-range-ratio "${RANGE_RATIO}" \
  --num-prompts="${NUM}" \
  --max-concurrency="${CON}" \
  --trust-remote-code \
  --request-rate=inf \
  --num-warmups="$(( 2 * CON ))" \
  --ignore-eos \
  --save-result \
  --percentile-metrics="ttft,tpot,itl,e2el" \
  --result-dir="./tmp/oot-benchmark-results" \
  --result-filename="${ISL}_${OSL}_${CON}.json" \
  --profile

Test Result

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  97.66     
Total input tokens:                      64000     
Total generated tokens:                  8000      
Request throughput (req/s):              0.08      
Output token throughput (tok/s):         81.92     
Total Token throughput (tok/s):          737.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          1330.96   
Median TTFT (ms):                        1457.61   
P99 TTFT (ms):                           1891.41   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.20     
Median TPOT (ms):                        20.08     
P99 TPOT (ms):                           21.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.20     
Median ITL (ms):                         19.68     
P99 ITL (ms):                            20.15     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21514.63  
Median E2EL (ms):                        21514.56  
P99 E2EL (ms):                           21516.52  
==================================================

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

qichu-yun · 2026-04-16T09:00:03Z

Could you please kindly help review this PR? @valarLip @wuhuikx @zejunchen-zejun

…or DeepSeek

qichu-yun · 2026-04-21T02:42:01Z

The failed case is irrelevant to this PR change， please review the code when you're free, thanks! @valarLip

…itations in v0.1.3 (#1061) * docs(release-notes): fix misattributed plugin PR citations in v0.1.3 Four citations in the vLLM-ATOM sections referenced PRs that actually belong to SGLang-ATOM or the native ATOM path. Verified each PR title against GitHub before correcting. - Model Support / DeepSeek V4 / R1 FP4: dropped the bullet. #650 is the native DeepSeek V4 triton-MoE path (already cited under ATOM Server) and #614 is a SGLang-ATOM R1 FP4 PR (already cited under SGLang-ATOM); neither supports a vLLM-ATOM V4 / R1 FP4 claim. - Model Support / Qwen3.5 / Qwen3-Next: dropped #532 (it adds Qwen3.5 / Qwen3-Next to SGLang, not vLLM); keep #772 (Qwen3-Next MTP for vLLM). - H&P / vLLM-ATOM: dropped #528 + the "Q/K norm-quant fusion" claim; #528 is the SGLang+ATOM qk-norm fusion PR (already cited under SGLang). - H&P / vLLM-ATOM: dropped #614 from the DeepSeek FP4 validation bullet (SGLang-ATOM PR), leaving the genuine #639 TP8/EP8 case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(release-notes): correct Qwen3.5 vLLM citation and drop nonexistent SGLang V3.2 Follow-up to the citation audit, two more verified corrections in the plugin sections: - vLLM-ATOM / Qwen3.5: the prior pass dropped Qwen3.5 along with the misattributed #532, but Qwen3.5 does have real vLLM-plugin support. Restore it with the correct PRs: #448 (fp8 functionality/accuracy, touches atom/plugin/vllm/model_wrapper.py) and #593 (Qwen3.5 FP4 nightly + benchmark, recipes/atom_vllm/Qwen3.5.md), keeping #772 (Qwen3-Next MTP). - SGLang-ATOM: dropped "V3.2" from the DeepSeek model list. No SGLang DeepSeek V3.2 PR landed in v0.1.2..v0.1.3 (V3 MTP=#643, R1 FP4=#614, FP4 MTP=#834/#846); the cited PRs only cover V3 and R1 FP4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(release-notes): fix 3 more cross-section PR misattributions Verified each PR's changed files to confirm which engine path it belongs to: - vLLM-ATOM Engine Core: drop #793 + "handles scalar KV scales". #793 only touches atom/model_ops/{attention_mha,base_attention}.py (native, no plugin files) and is already cited correctly in the native section. - vLLM-ATOM H&P: drop the DeepSeek FP4 TP8/EP8 bullet and move #639 to SGLang-ATOM H&P. #639 only touches sglang_benchmark_models.json and atom-sglang-benchmark.yaml -> it is a SGLang benchmark PR, not vLLM. - vLLM-ATOM H&P: drop "V4 DP benchmark coverage (#949)". #949 touches the native benchmark (.github/benchmark/models.json, atom-benchmark.yaml), already cited under ATOM Server; it is not a plugin PR. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

qichu-yun force-pushed the fuse_norm_quant_sgl branch from 4be8ee4 to 88ecd01 Compare April 15, 2026 03:52

qichu-yun requested review from valarLip, wuhuikx, zejunchen-zejun and zhuyuhua-v April 16, 2026 08:15

zhuyuhua-v previously approved these changes Apr 16, 2026

View reviewed changes

qichu-yun requested a review from ZhiweiYan-96 April 16, 2026 08:54

ZhiweiYan-96 previously approved these changes Apr 16, 2026

View reviewed changes

valarLip previously approved these changes Apr 17, 2026

View reviewed changes

[Plugin] [Feature] Support fused q/k norm with SGLang + ATOM plugin f…

c210d47

…or DeepSeek

qichu-yun dismissed stale reviews from valarLip, ZhiweiYan-96, and zhuyuhua-v via c210d47 April 17, 2026 10:30

qichu-yun force-pushed the fuse_norm_quant_sgl branch from 9ca5902 to c210d47 Compare April 17, 2026 10:30

qichu-yun requested a review from ZhiweiYan-96 April 17, 2026 10:31

Merge branch 'main' into fuse_norm_quant_sgl

97c8589

Merge branch 'main' into fuse_norm_quant_sgl

f1d9165

valarLip approved these changes Apr 21, 2026

View reviewed changes

valarLip merged commit c4961e3 into ROCm:main Apr 21, 2026
22 of 28 checks passed

zejunchen-zejun mentioned this pull request Jun 4, 2026

[to hattie branch] docs(release-notes): fix misattributed plugin PR citations in v0.1.3 #1061

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek#528

[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek#528
valarLip merged 3 commits into
ROCm:mainfrom
qichu-yun:fuse_norm_quant_sgl

qichu-yun commented Apr 9, 2026 •

edited

Loading

Uh oh!

qichu-yun commented Apr 16, 2026

Uh oh!

qichu-yun commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

qichu-yun commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Plan

Test Result

Submission Checklist

Uh oh!

qichu-yun commented Apr 16, 2026

Uh oh!

qichu-yun commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qichu-yun commented Apr 9, 2026 •

edited

Loading