[NVIDIA] feat: MiniMax M3 Day 0 support B200#1723
Conversation
MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on B200. --block-size 128 (MSA sparse attention), --language-model-only for text-only throughput, dedicated vllm/vllm-openai:minimax-m3 image (vllm-project/vllm#45381). Adds the b200-dgxc runner-type group and a launch_b200-dgxc.sh MODEL_PATH case for the gharunner-staged weights. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirror PR #1724 review changes to B200: TP8+EP8 conc-start 128->4 (1k1k and 8k1k) to probe whether TEP8 extends the min-latency frontier below plain TP8; TP4+EP4 conc-start 128->64 (1k1k) to fill the mid-curve. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27439459853 |
Lower conc-start 4->1 on the latency-probing layouts (tp8, tp8+ep8, tp4) for both 1k1k and 8k1k to capture single/dual-request min-latency points. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27441417384 |
# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27449751544 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27449751544 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568 |
MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on B200.
minimaxm3-fp8-b200-vllm(.github/configs/nvidia-master.yaml) — TP8/TP4/TEP/DEP layouts across 1k1k and 8k1k (33 jobs).benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh—--block-size 128(MSA sparse attention),--language-model-only, conc-scaled cudagraph capture, MXFP8 checkpoint.vllm/vllm-openai:minimax-m3(M3 support unmerged upstream — [Model] Add MiniMax M3 support vllm-project/vllm#45381).runners.yaml: adds theb200-dgxcrunner-type group for targeted dispatch.launch_b200-dgxc.sh: MODEL_PATH case for the gharunner-staged weights (M3 is not SRE-staged).Status: full sweep green (33/33 + 4/4 GSM8K). Pareto: TP8 wins latency (~63 tok/s/user @ c4); TP4+EP4 wins 1k1k throughput (1289 tok/s/GPU @ c512); TP4 wins 8k1k. Runs: canary, full.
🤖 Generated with Claude Code
Note
Low Risk
Benchmark and CI wiring only (YAML, shell launcher, changelog); no changes to core inference or auth paths.
Overview
Adds day-zero MiniMax-M3 MXFP8 single-node vLLM benchmarking on B200, mirroring the existing B300 setup with runner-specific weight staging.
A new
minimaxm3-fp8-b200-vllmentry innvidia-master.yamltargetsb200-dgxc, usesvllm/vllm-openai:minimax-m3, and sweeps fixed-seq-len 1k1k and 8k1k across TP4/TP8, TEP (EP), and DEP (dp-attn) concurrency ranges.runners.yamlintroduces ab200-dgxcrunner-type group (the tenb200-dgxc_*labels) so jobs can dispatch to that pool.launch_b200-dgxc.shmapsminimaxm3+fp8to/lustre/fsw/gharunners/models/MiniMax-M3-MXFP8because M3 weights are not SRE-staged under/lustre/fsw/models. The newbenchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.shscript runs vLLM with--block-size 128,--language-model-only, extended engine ready timeout, conditional TP/EP/DEP args, and concurrency-scaled CUDA graph capture.perf-changelog.yamldocuments the new config key and PR link.Reviewed by Cursor Bugbot for commit b7e1588. Bugbot is set up for automated code reviews on this repo. Configure here.