[NVIDIA] feat: MiniMax M3 Day 0 support B200 by cquil11 · Pull Request #1723 · SemiAnalysisAI/InferenceX

cquil11 · 2026-06-12T19:50:43Z

MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on B200.

New config minimaxm3-fp8-b200-vllm (.github/configs/nvidia-master.yaml) — TP8/TP4/TEP/DEP layouts across 1k1k and 8k1k (33 jobs).
New bench script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh — --block-size 128 (MSA sparse attention), --language-model-only, conc-scaled cudagraph capture, MXFP8 checkpoint.
Image: dedicated vllm/vllm-openai:minimax-m3 (M3 support unmerged upstream — [Model] Add MiniMax M3 support vllm-project/vllm#45381).
runners.yaml: adds the b200-dgxc runner-type group for targeted dispatch.
launch_b200-dgxc.sh: MODEL_PATH case for the gharunner-staged weights (M3 is not SRE-staged).

Status: full sweep green (33/33 + 4/4 GSM8K). Pareto: TP8 wins latency (~63 tok/s/user @ c4); TP4+EP4 wins 1k1k throughput (1289 tok/s/GPU @ c512); TP4 wins 8k1k. Runs: canary, full.

🤖 Generated with Claude Code

Note

Low Risk
Benchmark and CI wiring only (YAML, shell launcher, changelog); no changes to core inference or auth paths.

Overview
Adds day-zero MiniMax-M3 MXFP8 single-node vLLM benchmarking on B200, mirroring the existing B300 setup with runner-specific weight staging.

A new minimaxm3-fp8-b200-vllm entry in nvidia-master.yaml targets b200-dgxc, uses vllm/vllm-openai:minimax-m3, and sweeps fixed-seq-len 1k1k and 8k1k across TP4/TP8, TEP (EP), and DEP (dp-attn) concurrency ranges. runners.yaml introduces a b200-dgxc runner-type group (the ten b200-dgxc_* labels) so jobs can dispatch to that pool.

launch_b200-dgxc.sh maps minimaxm3 + fp8 to /lustre/fsw/gharunners/models/MiniMax-M3-MXFP8 because M3 weights are not SRE-staged under /lustre/fsw/models. The new benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh script runs vLLM with --block-size 128, --language-model-only, extended engine ready timeout, conditional TP/EP/DEP args, and concurrency-scaled CUDA graph capture. perf-changelog.yaml documents the new config key and PR link.

^{Reviewed by Cursor Bugbot for commit b7e1588. Bugbot is set up for automated code reviews on this repo. Configure here.}

MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on B200. --block-size 128 (MSA sparse attention), --language-model-only for text-only throughput, dedicated vllm/vllm-openai:minimax-m3 image (vllm-project/vllm#45381). Adds the b200-dgxc runner-type group and a launch_b200-dgxc.sh MODEL_PATH case for the gharunner-staged weights. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-12T19:50:57Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Mirror PR #1724 review changes to B200: TP8+EP8 conc-start 128->4 (1k1k and 8k1k) to probe whether TEP8 extends the min-latency frontier below plain TP8; TP4+EP4 conc-start 128->64 (1k1k) to fill the mid-curve. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-12T20:34:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27439459853
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27439459853

Lower conc-start 4->1 on the latency-probing layouts (tp8, tp8+ep8, tp4) for both 1k1k and 8k1k to capture single/dual-request min-latency points. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-12T21:14:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27441417384
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27441417384

# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml

github-actions · 2026-06-12T23:53:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27443445568

github-actions · 2026-06-12T23:54:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27449751544
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27449751544

github-actions · 2026-06-13T00:04:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27443445568

github-actions · 2026-06-13T00:18:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27443445568

github-actions · 2026-06-13T00:20:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27449751544
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27449751544

cquil11 · 2026-06-13T00:25:53Z

/reuse-sweep-run

github-actions · 2026-06-13T01:12:36Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443445568
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27443445568

cquil11 requested a review from a team June 12, 2026 19:50

cquil11 requested review from jgangani and kedarpotdar-nv as code owners June 12, 2026 19:50

github-project-automation Bot added this to InferenceMAX Board Jun 12, 2026

Merge branch 'main' into feat/minimax-m3-b200

26fc0b0

cquil11 added the full-sweep-fail-fast label Jun 12, 2026

cquil11 and others added 2 commits June 12, 2026 14:54

minimaxm3-fp8-b200-vllm: add perf-changelog entry

0d27477

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

functionstackx mentioned this pull request Jun 12, 2026

[Klaud Cold][NVIDIA] feat: MiniMax M3 Day 0 support H200 #1728

Closed

minimaxm3-fp8-b200-vllm: add conc 1 and 2 to latency layouts

4aee03e

Lower conc-start 4->1 on the latency-probing layouts (tp8, tp8+ep8, tp4) for both 1k1k and 8k1k to capture single/dual-request min-latency points. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into feat/minimax-m3-b200

b7e1588

# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml

cquil11 merged commit fa0f483 into main Jun 13, 2026
41 of 45 checks passed

cquil11 deleted the feat/minimax-m3-b200 branch June 13, 2026 01:17

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

cquil11 mentioned this pull request Jun 13, 2026

[NVIDIA] feat: MiniMax M3 Day 0 MTP (EAGLE3) support B200 #1736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA] feat: MiniMax M3 Day 0 support B200#1723

[NVIDIA] feat: MiniMax M3 Day 0 support B200#1723
cquil11 merged 6 commits into
mainfrom
feat/minimax-m3-b200

cquil11 commented Jun 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

cquil11 commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cquil11 commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

cquil11 commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cquil11 commented Jun 12, 2026 •

edited by cursor Bot

Loading