Skip to content

Add Qwen3.5-FP8 GB200 SGLang disaggregated benchmark#1810

Merged
adibarra merged 6 commits into
mainfrom
qwen3.5-fp8-gb200-dynamo-sglang
Jun 22, 2026
Merged

Add Qwen3.5-FP8 GB200 SGLang disaggregated benchmark#1810
adibarra merged 6 commits into
mainfrom
qwen3.5-fp8-gb200-dynamo-sglang

Conversation

@RohitNagraj

@RohitNagraj RohitNagraj commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Adds qwen3.5-fp8-gb200-dynamo-sglang: Qwen3.5-397B-A17B-FP8 disaggregated SGLang-via-Dynamo on GB200.

  • 6 topologies across 1k/1k and 8k/1k: 1P1D TP4 STP plus wide-EP (DEP4 prefill / DEP16 decode), from 1P1D up to 8P1D
  • Recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/
  • Image: lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc
  • Adds the qwen3.5/fp8 model-path branch to launch_gb200-nv.sh

Note

Low Risk
Benchmark and CI launch/recipe configuration only; no application runtime, auth, or data-path changes.

Overview
Introduces qwen3.5-fp8-gb200-dynamo-sglang so CI can sweep Qwen/Qwen3.5-397B-A17B-FP8 on GB200 with disaggregated SGLang via Dynamo (lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc).

nvidia-master.yaml defines fixed-seq-len scenarios for 1k/1k and 8k/1k with six topology variants: 1P1D TP4 (pure STP) and wide-EP layouts (DEP4 prefill / DEP16 decode), scaling from 1P1D through 2P1D, 4P1D, and 8P1D, each pointing at a dedicated recipe under benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/.

Six new srt-slurm recipe YAMLs encode Dynamo frontends, node/GPU counts, SGLang prefill/decode settings (including DeepEP on wide-EP decode where rebuild-deepep.sh is used), and sa-bench concurrency sweeps matched to the master config.

launch_gb200-nv.sh maps qwen3.5 + fp8 to the Lustre model path and qwen3.5-fp8 alias, and overlays the local Qwen3.5 recipes into srt-slurm on clone (same pattern as DSV4). perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 85a1949. Bugbot is set up for automated code reviews on this repo. Configure here.

Qwen3.5-397B-A17B-FP8 GB200 disaggregated SGLang-via-Dynamo, 6 topologies
across 1k/1k and 8k/1k (1P1D TP4 STP plus wide-EP DEP4 prefill / DEP16
decode from 1P1D up to 8P1D). Adds the recipe set, the nvidia-master entry,
the gb200 launch-script model-path and recipe-copy branches, and the
perf-changelog entry.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit de00324. Configure here.

osl: 1024
req_rate: "inf"
random_range_ratio: 0.8
concurrencies: "2048x4096"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4096 concurrency exceeds decode cap

Medium Severity

The 8P1D recipe and master config sweep concurrency 4096, but decode max-running-requests and max-mamba-cache-size stay at 2048. The 2P1D recipe uses 4096 for both when benchmarking at 4096, so the 4096 sweep point cannot honor intended load.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit de00324. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@RohitNagraj

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@RohitNagraj

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@RohitNagraj

Copy link
Copy Markdown
Collaborator Author

@functionstackx, could you please approve this.

@functionstackx

Copy link
Copy Markdown
Collaborator

@Ankur-singh

Copy link
Copy Markdown
Collaborator

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation.
  • Verified that this PR passes evals.
  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

This is a dis-agg config

Signed: ankur-singh

@cquil11

cquil11 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

@claude ensure parity for the major parameters between the new entries in master yaml and their corresponding srt slurm config file (such as tp, ep, num workers, etc)

@Klaud-Cold

Klaud-Cold commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @cquil11's task in 1m 53s —— View job


Parity Check: Master YAML vs SRT-Slurm Recipes

  • Read new entries in nvidia-master.yaml
  • Read all 6 recipe YAML files
  • Compare major parameters (TP, EP, num workers, concurrency, ISL/OSL, dp-attn)
  • Report findings

All 6 scenarios were checked for parity between the master YAML search-space entries and their corresponding srt-slurm recipe files. Major parameters are consistent across all configs.

1k/1k Scenarios

Parameter Master YAML Recipe Match?
1P1D STP (1k1k/1p1d-tp4-tp4.yaml)
Prefill TP / EP 4 / 1 tensor-parallel-size: 4 (no EP)
Decode TP / EP 4 / 1 tensor-parallel-size: 4 (no EP)
Prefill num-worker 1 prefill_workers: 1
Decode num-worker 1 decode_workers: 1
dp-attn false / false Not set (absent = false)
Concurrency [1,2,4,8,16,32,64] "1x2x4x8x16x32x64"
ISL / OSL 1024 / 1024 isl: 1024, osl: 1024
1P1D wide-EP (1k1k/1p1d-dep4-dep16.yaml)
Prefill TP / EP 4 / 4 tensor-parallel-size: 4, expert-parallel-size: 4
Decode TP / EP 16 / 16 tensor-parallel-size: 16, expert-parallel-size: 16
Prefill num-worker 1 prefill_workers: 1
Decode num-worker 1 decode_workers: 1
dp-attn true / true enable-dp-attention: true (both)
Concurrency [512,1024,2048] "512x1024x2048"
ISL / OSL 1024 / 1024 isl: 1024, osl: 1024
2P1D wide-EP (1k1k/2p1d-dep4-dep16.yaml)
Prefill TP / EP 4 / 4 tensor-parallel-size: 4, expert-parallel-size: 4
Decode TP / EP 16 / 16 tensor-parallel-size: 16, expert-parallel-size: 16
Prefill num-worker 2 prefill_workers: 2
Decode num-worker 1 decode_workers: 1
dp-attn true / true enable-dp-attention: true (both)
Concurrency [4096] "4096"
ISL / OSL 1024 / 1024 isl: 1024, osl: 1024

8k/1k Scenarios

Parameter Master YAML Recipe Match?
1P1D STP (8k1k/1p1d-tp4-tp4.yaml)
Prefill TP / EP 4 / 1 tensor-parallel-size: 4 (no EP)
Decode TP / EP 4 / 1 tensor-parallel-size: 4 (no EP)
Prefill num-worker 1 prefill_workers: 1
Decode num-worker 1 decode_workers: 1
dp-attn false / false Not set (absent = false)
Concurrency [1,2,4,8,16,32,64,128] "1x2x4x8x16x32x64x128"
ISL / OSL 8192 / 1024 isl: 8192, osl: 1024
4P1D wide-EP (8k1k/4p1d-dep4-dep16.yaml)
Prefill TP / EP 4 / 4 tensor-parallel-size: 4, expert-parallel-size: 4
Decode TP / EP 16 / 16 tensor-parallel-size: 16, expert-parallel-size: 16
Prefill num-worker 4 prefill_workers: 4
Decode num-worker 1 decode_workers: 1
dp-attn true / true enable-dp-attention: true (both)
Concurrency [1024] "1024"
ISL / OSL 8192 / 1024 isl: 8192, osl: 1024
8P1D wide-EP (8k1k/8p1d-dep4-dep16.yaml)
Prefill TP / EP 4 / 4 tensor-parallel-size: 4, expert-parallel-size: 4
Decode TP / EP 16 / 16 tensor-parallel-size: 16, expert-parallel-size: 16
Prefill num-worker 8 prefill_workers: 8
Decode num-worker 1 decode_workers: 1
dp-attn true / true enable-dp-attention: true (both)
Concurrency [2048, 4096] "2048x4096"
ISL / OSL 8192 / 1024 isl: 8192, osl: 1024

Summary

All major parameters match across all 6 master YAML entries and their corresponding srt-slurm recipe files: TP, EP, num-worker (prefill/decode), dp-attn, concurrency lists, and ISL/OSL are all consistent.

One note (already flagged by Cursor Bugbot): in 8k1k/8p1d-dep4-dep16.yaml, the decode max-running-requests and max-mamba-cache-size are both 2048, while the master YAML sweeps concurrency up to 4096. The 2P1D recipe (1k1k) correctly uses 4096 for both when sweeping at 4096 concurrency. This is not a master↔recipe parity issue per se, but it means the 4096 concurrency point in the 8P1D sweep may be capped at the recipe level.


Comment thread runners/launch_gb200-nv.sh
@adibarra adibarra merged commit e5f44cf into main Jun 22, 2026
32 checks passed
@adibarra adibarra deleted the qwen3.5-fp8-gb200-dynamo-sglang branch June 22, 2026 23:15
arygupt added a commit that referenced this pull request Jun 22, 2026
Resolve perf-changelog.yaml: keep main's qwen3.5-fp8-gb200 entry (#1810) + the dsr1 gb200/b200 powercheck fan-out (#1791). sudo/SSSD cleanup fix (94c2add) preserved through merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

7 participants