Add Qwen3.5-FP8 GB200 SGLang disaggregated benchmark by RohitNagraj · Pull Request #1810 · SemiAnalysisAI/InferenceX

RohitNagraj · 2026-06-16T22:16:30Z

Adds qwen3.5-fp8-gb200-dynamo-sglang: Qwen3.5-397B-A17B-FP8 disaggregated SGLang-via-Dynamo on GB200.

6 topologies across 1k/1k and 8k/1k: 1P1D TP4 STP plus wide-EP (DEP4 prefill / DEP16 decode), from 1P1D up to 8P1D
Recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/
Image: lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc
Adds the qwen3.5/fp8 model-path branch to launch_gb200-nv.sh

Note

Low Risk
Benchmark and CI launch/recipe configuration only; no application runtime, auth, or data-path changes.

Overview
Introduces qwen3.5-fp8-gb200-dynamo-sglang so CI can sweep Qwen/Qwen3.5-397B-A17B-FP8 on GB200 with disaggregated SGLang via Dynamo (lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc).

nvidia-master.yaml defines fixed-seq-len scenarios for 1k/1k and 8k/1k with six topology variants: 1P1D TP4 (pure STP) and wide-EP layouts (DEP4 prefill / DEP16 decode), scaling from 1P1D through 2P1D, 4P1D, and 8P1D, each pointing at a dedicated recipe under benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/.

Six new srt-slurm recipe YAMLs encode Dynamo frontends, node/GPU counts, SGLang prefill/decode settings (including DeepEP on wide-EP decode where rebuild-deepep.sh is used), and sa-bench concurrency sweeps matched to the master config.

launch_gb200-nv.sh maps qwen3.5 + fp8 to the Lustre model path and qwen3.5-fp8 alias, and overlays the local Qwen3.5 recipes into srt-slurm on clone (same pattern as DSV4). perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 85a1949. Bugbot is set up for automated code reviews on this repo. Configure here.}

Qwen3.5-397B-A17B-FP8 GB200 disaggregated SGLang-via-Dynamo, 6 topologies across 1k/1k and 8k/1k (1P1D TP4 STP plus wide-EP DEP4 prefill / DEP16 decode from 1P1D up to 8P1D). Adds the recipe set, the nvidia-master entry, the gb200 launch-script model-path and recipe-copy branches, and the perf-changelog entry.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit de00324. Configure here.}

cursor · 2026-06-16T22:59:55Z

+  osl: 1024
+  req_rate: "inf"
+  random_range_ratio: 0.8
+  concurrencies: "2048x4096"


4096 concurrency exceeds decode cap

Medium Severity

The 8P1D recipe and master config sweep concurrency 4096, but decode max-running-requests and max-mamba-cache-size stay at 2048. The 2P1D recipe uses 4096 for both when benchmarking at 4096, so the 4096 sweep point cannot honor intended load.

Additional Locations (1)

.github/configs/nvidia-master.yaml#L8975-L8977

^{Reviewed by Cursor Bugbot for commit de00324. Configure here.}

github-actions · 2026-06-16T23:02:08Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27652043520
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27652043520

github-actions · 2026-06-17T01:57:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27653706015
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27653706015

RohitNagraj · 2026-06-22T17:01:39Z

/reuse-sweep-run

# Conflicts: # perf-changelog.yaml

RohitNagraj · 2026-06-22T17:04:23Z

/reuse-sweep-run

RohitNagraj · 2026-06-22T17:48:44Z

@functionstackx, could you please approve this.

functionstackx · 2026-06-22T20:15:31Z

@functionstackx, could you please approve this.

@kedarpotdar-nv @Ankur-singh https://github.com/SemiAnalysisAI/InferenceX/blob/main/docs/PR_REVIEW_CHECKLIST.md

Ankur-singh · 2026-06-22T21:14:44Z

As a PR reviewer and CODEOWNER, I have reviewed this and have:

Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation.
Verified that this PR passes evals.
If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
- If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

This is a dis-agg config

Signed: ankur-singh

cquil11 · 2026-06-22T22:48:06Z

@claude ensure parity for the major parameters between the new entries in master yaml and their corresponding srt slurm config file (such as tp, ep, num workers, etc)

Klaud-Cold · 2026-06-22T22:48:29Z

Claude finished @cquil11's task in 1m 53s —— View job

Parity Check: Master YAML vs SRT-Slurm Recipes

Read new entries in nvidia-master.yaml
Read all 6 recipe YAML files
Compare major parameters (TP, EP, num workers, concurrency, ISL/OSL, dp-attn)
Report findings

All 6 scenarios were checked for parity between the master YAML search-space entries and their corresponding srt-slurm recipe files. Major parameters are consistent across all configs.

1k/1k Scenarios

Parameter	Master YAML	Recipe	Match?
1P1D STP (`1k1k/1p1d-tp4-tp4.yaml`)
Prefill TP / EP	4 / 1	`tensor-parallel-size: 4` (no EP)	✅
Decode TP / EP	4 / 1	`tensor-parallel-size: 4` (no EP)	✅
Prefill num-worker	1	`prefill_workers: 1`	✅
Decode num-worker	1	`decode_workers: 1`	✅
dp-attn	false / false	Not set (absent = false)	✅
Concurrency	[1,2,4,8,16,32,64]	`"1x2x4x8x16x32x64"`	✅
ISL / OSL	1024 / 1024	`isl: 1024`, `osl: 1024`	✅
1P1D wide-EP (`1k1k/1p1d-dep4-dep16.yaml`)
Prefill TP / EP	4 / 4	`tensor-parallel-size: 4`, `expert-parallel-size: 4`	✅
Decode TP / EP	16 / 16	`tensor-parallel-size: 16`, `expert-parallel-size: 16`	✅
Prefill num-worker	1	`prefill_workers: 1`	✅
Decode num-worker	1	`decode_workers: 1`	✅
dp-attn	true / true	`enable-dp-attention: true` (both)	✅
Concurrency	[512,1024,2048]	`"512x1024x2048"`	✅
ISL / OSL	1024 / 1024	`isl: 1024`, `osl: 1024`	✅
2P1D wide-EP (`1k1k/2p1d-dep4-dep16.yaml`)
Prefill TP / EP	4 / 4	`tensor-parallel-size: 4`, `expert-parallel-size: 4`	✅
Decode TP / EP	16 / 16	`tensor-parallel-size: 16`, `expert-parallel-size: 16`	✅
Prefill num-worker	2	`prefill_workers: 2`	✅
Decode num-worker	1	`decode_workers: 1`	✅
dp-attn	true / true	`enable-dp-attention: true` (both)	✅
Concurrency	[4096]	`"4096"`	✅
ISL / OSL	1024 / 1024	`isl: 1024`, `osl: 1024`	✅

8k/1k Scenarios

Parameter	Master YAML	Recipe	Match?
1P1D STP (`8k1k/1p1d-tp4-tp4.yaml`)
Prefill TP / EP	4 / 1	`tensor-parallel-size: 4` (no EP)	✅
Decode TP / EP	4 / 1	`tensor-parallel-size: 4` (no EP)	✅
Prefill num-worker	1	`prefill_workers: 1`	✅
Decode num-worker	1	`decode_workers: 1`	✅
dp-attn	false / false	Not set (absent = false)	✅
Concurrency	[1,2,4,8,16,32,64,128]	`"1x2x4x8x16x32x64x128"`	✅
ISL / OSL	8192 / 1024	`isl: 8192`, `osl: 1024`	✅
4P1D wide-EP (`8k1k/4p1d-dep4-dep16.yaml`)
Prefill TP / EP	4 / 4	`tensor-parallel-size: 4`, `expert-parallel-size: 4`	✅
Decode TP / EP	16 / 16	`tensor-parallel-size: 16`, `expert-parallel-size: 16`	✅
Prefill num-worker	4	`prefill_workers: 4`	✅
Decode num-worker	1	`decode_workers: 1`	✅
dp-attn	true / true	`enable-dp-attention: true` (both)	✅
Concurrency	[1024]	`"1024"`	✅
ISL / OSL	8192 / 1024	`isl: 8192`, `osl: 1024`	✅
8P1D wide-EP (`8k1k/8p1d-dep4-dep16.yaml`)
Prefill TP / EP	4 / 4	`tensor-parallel-size: 4`, `expert-parallel-size: 4`	✅
Decode TP / EP	16 / 16	`tensor-parallel-size: 16`, `expert-parallel-size: 16`	✅
Prefill num-worker	8	`prefill_workers: 8`	✅
Decode num-worker	1	`decode_workers: 1`	✅
dp-attn	true / true	`enable-dp-attention: true` (both)	✅
Concurrency	[2048, 4096]	`"2048x4096"`	✅
ISL / OSL	8192 / 1024	`isl: 8192`, `osl: 1024`	✅

Summary

All major parameters match across all 6 master YAML entries and their corresponding srt-slurm recipe files: TP, EP, num-worker (prefill/decode), dp-attn, concurrency lists, and ISL/OSL are all consistent.

One note (already flagged by Cursor Bugbot): in 8k1k/8p1d-dep4-dep16.yaml, the decode max-running-requests and max-mamba-cache-size are both 2048, while the master YAML sweeps concurrency up to 4096. The 2P1D recipe (1k1k) correctly uses 4096 for both when sweeping at 4096 concurrency. This is not a master↔recipe parity issue per se, but it means the 4096 concurrency point in the 8P1D sweep may be capped at the recipe level.

Resolve perf-changelog.yaml: keep main's qwen3.5-fp8-gb200 entry (#1810) + the dsr1 gb200/b200 powercheck fan-out (#1791). sudo/SSSD cleanup fix (94c2add) preserved through merge.

RohitNagraj requested a review from a team June 16, 2026 22:16

RohitNagraj requested review from jgangani and kedarpotdar-nv as code owners June 16, 2026 22:16

github-project-automation Bot added this to InferenceMAX Board Jun 16, 2026

Update perf-changelog pr-link for #1810

b3537f8

RohitNagraj added the full-sweep-enabled label Jun 16, 2026

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread benchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/1k1k/1p1d-tp4-tp4.yaml Outdated

RohitNagraj and others added 2 commits June 16, 2026 15:22

Merge branch 'main' into qwen3.5-fp8-gb200-dynamo-sglang

2c94721

Set 1k1k 1p1d-tp4-tp4 context-length to 4096

de00324

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into pr-1810-sync

9aec473

# Conflicts: # perf-changelog.yaml

chore: refresh PR #1810 for sweep reuse [skip-sweep]

85a1949

kedarpotdar-nv approved these changes Jun 22, 2026

View reviewed changes

cquil11 reviewed Jun 22, 2026

View reviewed changes

Comment thread runners/launch_gb200-nv.sh

adibarra approved these changes Jun 22, 2026

View reviewed changes

adibarra merged commit e5f44cf into main Jun 22, 2026
32 checks passed

adibarra deleted the qwen3.5-fp8-gb200-dynamo-sglang branch June 22, 2026 23:15

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 22, 2026

Uh oh!

Conversation

RohitNagraj commented Jun 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

4096 concurrency exceeds decode cap

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

RohitNagraj commented Jun 22, 2026

Uh oh!

RohitNagraj commented Jun 22, 2026

Uh oh!

RohitNagraj commented Jun 22, 2026

Uh oh!

functionstackx commented Jun 22, 2026

Uh oh!

Ankur-singh commented Jun 22, 2026

Additional detail section:

Uh oh!

cquil11 commented Jun 22, 2026

Uh oh!

Klaud-Cold commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parity Check: Master YAML vs SRT-Slurm Recipes

1k/1k Scenarios

8k/1k Scenarios

Summary

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

RohitNagraj commented Jun 16, 2026 •

edited by cursor Bot

Loading

Klaud-Cold commented Jun 22, 2026 •

edited

Loading