Skip to content

[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek#528

Merged
valarLip merged 3 commits into
ROCm:mainfrom
qichu-yun:fuse_norm_quant_sgl
Apr 21, 2026
Merged

[Plugin] [Feature] Supoort MLA q/k norm-quant fusion with SGLang + ATOM plugin for Deepseek#528
valarLip merged 3 commits into
ROCm:mainfrom
qichu-yun:fuse_norm_quant_sgl

Conversation

@qichu-yun

@qichu-yun qichu-yun commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Motivation

DeepSeek MLA preprocessing in the SGLang + ATOM plugin was still doing q/k RMSNorm and q quantization in separate steps, leaving unnecessary kernel and memory overhead in a hot path. Since ATOM already provides a gated fused norm-quant implementation for DeepSeek, this PR integrates that path into the plugin so supported workloads can benefit from the fusion while unsupported cases continue to use the existing fallback path.

before :
image

after :
image

Test Plan

lauch server:

export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export SGLANG_AITER_FP8_PREFILL_ATTN=0
export SGLANG_USE_AITER=1
export ATOM_ENABLE_DS_QKNORM_QUANT_FUSION=1

model_path=/shared/data/amd_int/models/DeepSeek-R1-0528

export SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models

TORCHINDUCTOR_COMPILE_THREADS=128 python3 -m sglang.launch_server \
    --model-path $model_path \
    --host localhost \
    --port 9000 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8_e4m3 \
    --mem-fraction-static 0.9 \
    --page-size 1 \
    --disable-radix-cache \

client:

model_path=/shared/data/amd_int/models/DeepSeek-R1-0528-MXFP4

ISL=8000
OSL=1000
CON=4
NUM=$(( CON * 2 ))
RANGE_RATIO=1.0

PYTHONDONTWRITEBYTECODE=1 python "/home/qichu_qle/my_sgl/bench_serving/benchmark_serving.py" \
  --model=$model_path \
  --backend=sglang \
  --base-url=http://127.0.0.1:9000 \
  --dataset-name=random \
  --random-input-len="${ISL}" \
  --random-output-len="${OSL}" \
  --random-range-ratio "${RANGE_RATIO}" \
  --num-prompts="${NUM}" \
  --max-concurrency="${CON}" \
  --trust-remote-code \
  --request-rate=inf \
  --num-warmups="$(( 2 * CON ))" \
  --ignore-eos \
  --save-result \
  --percentile-metrics="ttft,tpot,itl,e2el" \
  --result-dir="./tmp/oot-benchmark-results" \
  --result-filename="${ISL}_${OSL}_${CON}.json" \
  --profile

Test Result

============ Serving Benchmark Result ============
Successful requests:                     8         
Benchmark duration (s):                  97.66     
Total input tokens:                      64000     
Total generated tokens:                  8000      
Request throughput (req/s):              0.08      
Output token throughput (tok/s):         81.92     
Total Token throughput (tok/s):          737.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          1330.96   
Median TTFT (ms):                        1457.61   
P99 TTFT (ms):                           1891.41   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          20.20     
Median TPOT (ms):                        20.08     
P99 TPOT (ms):                           21.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           20.20     
Median ITL (ms):                         19.68     
P99 ITL (ms):                            20.15     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          21514.63  
Median E2EL (ms):                        21514.56  
P99 E2EL (ms):                           21516.52  
==================================================

Submission Checklist

zhuyuhua-v
zhuyuhua-v previously approved these changes Apr 16, 2026
@qichu-yun qichu-yun requested a review from ZhiweiYan-96 April 16, 2026 08:54
ZhiweiYan-96
ZhiweiYan-96 previously approved these changes Apr 16, 2026
@qichu-yun

Copy link
Copy Markdown
Contributor Author

Could you please kindly help review this PR? @valarLip @wuhuikx @zejunchen-zejun

valarLip
valarLip previously approved these changes Apr 17, 2026
@qichu-yun qichu-yun dismissed stale reviews from valarLip, ZhiweiYan-96, and zhuyuhua-v via c210d47 April 17, 2026 10:30
@qichu-yun qichu-yun force-pushed the fuse_norm_quant_sgl branch from 9ca5902 to c210d47 Compare April 17, 2026 10:30
@qichu-yun qichu-yun requested a review from ZhiweiYan-96 April 17, 2026 10:31
@qichu-yun

Copy link
Copy Markdown
Contributor Author

The failed case is irrelevant to this PR change, please review the code when you're free, thanks! @valarLip

@valarLip valarLip merged commit c4961e3 into ROCm:main Apr 21, 2026
22 of 28 checks passed
zejunchen-zejun added a commit that referenced this pull request Jun 4, 2026
…itations in v0.1.3 (#1061)

* docs(release-notes): fix misattributed plugin PR citations in v0.1.3

Four citations in the vLLM-ATOM sections referenced PRs that actually
belong to SGLang-ATOM or the native ATOM path. Verified each PR title
against GitHub before correcting.

- Model Support / DeepSeek V4 / R1 FP4: dropped the bullet. #650 is the
  native DeepSeek V4 triton-MoE path (already cited under ATOM Server)
  and #614 is a SGLang-ATOM R1 FP4 PR (already cited under SGLang-ATOM);
  neither supports a vLLM-ATOM V4 / R1 FP4 claim.
- Model Support / Qwen3.5 / Qwen3-Next: dropped #532 (it adds Qwen3.5 /
  Qwen3-Next to SGLang, not vLLM); keep #772 (Qwen3-Next MTP for vLLM).
- H&P / vLLM-ATOM: dropped #528 + the "Q/K norm-quant fusion" claim;
  #528 is the SGLang+ATOM qk-norm fusion PR (already cited under SGLang).
- H&P / vLLM-ATOM: dropped #614 from the DeepSeek FP4 validation bullet
  (SGLang-ATOM PR), leaving the genuine #639 TP8/EP8 case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(release-notes): correct Qwen3.5 vLLM citation and drop nonexistent SGLang V3.2

Follow-up to the citation audit, two more verified corrections in the
plugin sections:

- vLLM-ATOM / Qwen3.5: the prior pass dropped Qwen3.5 along with the
  misattributed #532, but Qwen3.5 does have real vLLM-plugin support.
  Restore it with the correct PRs: #448 (fp8 functionality/accuracy,
  touches atom/plugin/vllm/model_wrapper.py) and #593 (Qwen3.5 FP4
  nightly + benchmark, recipes/atom_vllm/Qwen3.5.md), keeping #772
  (Qwen3-Next MTP).
- SGLang-ATOM: dropped "V3.2" from the DeepSeek model list. No SGLang
  DeepSeek V3.2 PR landed in v0.1.2..v0.1.3 (V3 MTP=#643, R1 FP4=#614,
  FP4 MTP=#834/#846); the cited PRs only cover V3 and R1 FP4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(release-notes): fix 3 more cross-section PR misattributions

Verified each PR's changed files to confirm which engine path it belongs
to:

- vLLM-ATOM Engine Core: drop #793 + "handles scalar KV scales". #793
  only touches atom/model_ops/{attention_mha,base_attention}.py (native,
  no plugin files) and is already cited correctly in the native section.
- vLLM-ATOM H&P: drop the DeepSeek FP4 TP8/EP8 bullet and move #639 to
  SGLang-ATOM H&P. #639 only touches sglang_benchmark_models.json and
  atom-sglang-benchmark.yaml -> it is a SGLang benchmark PR, not vLLM.
- vLLM-ATOM H&P: drop "V4 DP benchmark coverage (#949)". #949 touches the
  native benchmark (.github/benchmark/models.json, atom-benchmark.yaml),
  already cited under ATOM Server; it is not a plugin PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants