Add Qwen3.5-FP8 GB200 SGLang disaggregated benchmark#1810
Conversation
Qwen3.5-397B-A17B-FP8 GB200 disaggregated SGLang-via-Dynamo, 6 topologies across 1k/1k and 8k/1k (1P1D TP4 STP plus wide-EP DEP4 prefill / DEP16 decode from 1P1D up to 8P1D). Adds the recipe set, the nvidia-master entry, the gb200 launch-script model-path and recipe-copy branches, and the perf-changelog entry.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit de00324. Configure here.
| osl: 1024 | ||
| req_rate: "inf" | ||
| random_range_ratio: 0.8 | ||
| concurrencies: "2048x4096" |
There was a problem hiding this comment.
4096 concurrency exceeds decode cap
Medium Severity
The 8P1D recipe and master config sweep concurrency 4096, but decode max-running-requests and max-mamba-cache-size stay at 2048. The 2P1D recipe uses 4096 for both when benchmarking at 4096, so the 4096 sweep point cannot honor intended load.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit de00324. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27652043520 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27653706015 |
|
/reuse-sweep-run |
# Conflicts: # perf-changelog.yaml
|
/reuse-sweep-run |
|
@functionstackx, could you please approve this. |
@kedarpotdar-nv @Ankur-singh https://github.com/SemiAnalysisAI/InferenceX/blob/main/docs/PR_REVIEW_CHECKLIST.md |
|
As a PR reviewer and CODEOWNER, I have reviewed this and have:
Additional detail section:This is a dis-agg config Signed: |
|
@claude ensure parity for the major parameters between the new entries in master yaml and their corresponding srt slurm config file (such as tp, ep, num workers, etc) |
|
Claude finished @cquil11's task in 1m 53s —— View job Parity Check: Master YAML vs SRT-Slurm Recipes
All 6 scenarios were checked for parity between the master YAML search-space entries and their corresponding srt-slurm recipe files. Major parameters are consistent across all configs. 1k/1k Scenarios
8k/1k Scenarios
SummaryAll major parameters match across all 6 master YAML entries and their corresponding srt-slurm recipe files: TP, EP, num-worker (prefill/decode), dp-attn, concurrency lists, and ISL/OSL are all consistent. One note (already flagged by Cursor Bugbot): in |


Adds
qwen3.5-fp8-gb200-dynamo-sglang: Qwen3.5-397B-A17B-FP8 disaggregated SGLang-via-Dynamo on GB200.Note
Low Risk
Benchmark and CI launch/recipe configuration only; no application runtime, auth, or data-path changes.
Overview
Introduces
qwen3.5-fp8-gb200-dynamo-sglangso CI can sweep Qwen/Qwen3.5-397B-A17B-FP8 on GB200 with disaggregated SGLang via Dynamo (lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc).nvidia-master.yamldefines fixed-seq-len scenarios for 1k/1k and 8k/1k with six topology variants: 1P1D TP4 (pure STP) and wide-EP layouts (DEP4 prefill / DEP16 decode), scaling from 1P1D through 2P1D, 4P1D, and 8P1D, each pointing at a dedicated recipe underbenchmarks/multi_node/srt-slurm-recipes/sglang/qwen3.5/gb200-fp8/.Six new srt-slurm recipe YAMLs encode Dynamo frontends, node/GPU counts, SGLang prefill/decode settings (including DeepEP on wide-EP decode where
rebuild-deepep.shis used), and sa-bench concurrency sweeps matched to the master config.launch_gb200-nv.shmapsqwen3.5+fp8to the Lustre model path andqwen3.5-fp8alias, and overlays the local Qwen3.5 recipes intosrt-slurmon clone (same pattern as DSV4).perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 85a1949. Bugbot is set up for automated code reviews on this repo. Configure here.