Skip to content

[NVIDIA] feat: MiniMax M3 Day 0 support B200#1723

Merged
cquil11 merged 6 commits into
mainfrom
feat/minimax-m3-b200
Jun 13, 2026
Merged

[NVIDIA] feat: MiniMax M3 Day 0 support B200#1723
cquil11 merged 6 commits into
mainfrom
feat/minimax-m3-b200

Conversation

@cquil11

@cquil11 cquil11 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on B200.

  • New config minimaxm3-fp8-b200-vllm (.github/configs/nvidia-master.yaml) — TP8/TP4/TEP/DEP layouts across 1k1k and 8k1k (33 jobs).
  • New bench script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh--block-size 128 (MSA sparse attention), --language-model-only, conc-scaled cudagraph capture, MXFP8 checkpoint.
  • Image: dedicated vllm/vllm-openai:minimax-m3 (M3 support unmerged upstream — [Model] Add MiniMax M3 support vllm-project/vllm#45381).
  • runners.yaml: adds the b200-dgxc runner-type group for targeted dispatch.
  • launch_b200-dgxc.sh: MODEL_PATH case for the gharunner-staged weights (M3 is not SRE-staged).

Status: full sweep green (33/33 + 4/4 GSM8K). Pareto: TP8 wins latency (~63 tok/s/user @ c4); TP4+EP4 wins 1k1k throughput (1289 tok/s/GPU @ c512); TP4 wins 8k1k. Runs: canary, full.

🤖 Generated with Claude Code


Note

Low Risk
Benchmark and CI wiring only (YAML, shell launcher, changelog); no changes to core inference or auth paths.

Overview
Adds day-zero MiniMax-M3 MXFP8 single-node vLLM benchmarking on B200, mirroring the existing B300 setup with runner-specific weight staging.

A new minimaxm3-fp8-b200-vllm entry in nvidia-master.yaml targets b200-dgxc, uses vllm/vllm-openai:minimax-m3, and sweeps fixed-seq-len 1k1k and 8k1k across TP4/TP8, TEP (EP), and DEP (dp-attn) concurrency ranges. runners.yaml introduces a b200-dgxc runner-type group (the ten b200-dgxc_* labels) so jobs can dispatch to that pool.

launch_b200-dgxc.sh maps minimaxm3 + fp8 to /lustre/fsw/gharunners/models/MiniMax-M3-MXFP8 because M3 weights are not SRE-staged under /lustre/fsw/models. The new benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh script runs vLLM with --block-size 128, --language-model-only, extended engine ready timeout, conditional TP/EP/DEP args, and concurrency-scaled CUDA graph capture. perf-changelog.yaml documents the new config key and PR link.

Reviewed by Cursor Bugbot for commit b7e1588. Bugbot is set up for automated code reviews on this repo. Configure here.

MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on B200.
--block-size 128 (MSA sparse attention), --language-model-only for
text-only throughput, dedicated vllm/vllm-openai:minimax-m3 image
(vllm-project/vllm#45381). Adds the b200-dgxc runner-type group and a
launch_b200-dgxc.sh MODEL_PATH case for the gharunner-staged weights.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cquil11 and others added 2 commits June 12, 2026 14:54
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mirror PR #1724 review changes to B200: TP8+EP8 conc-start 128->4
(1k1k and 8k1k) to probe whether TEP8 extends the min-latency frontier
below plain TP8; TP4+EP4 conc-start 128->64 (1k1k) to fill the mid-curve.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Lower conc-start 4->1 on the latency-probing layouts (tp8, tp8+ep8, tp4)
for both 1k1k and 8k1k to capture single/dual-request min-latency points.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

# Conflicts:
#	.github/configs/nvidia-master.yaml
#	perf-changelog.yaml
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@cquil11

cquil11 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@github-actions

Copy link
Copy Markdown
Contributor

@cquil11 cquil11 merged commit fa0f483 into main Jun 13, 2026
41 of 45 checks passed
@cquil11 cquil11 deleted the feat/minimax-m3-b200 branch June 13, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant