Day-zero MiniMax-M3 MXFP8 single-node recipes for H200/H100 (vLLM) by Oseltamivir · Pull Request #1731 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-12T22:36:49Z

Summary

Adds day-zero MiniMax-M3 benchmark coverage for the Hopper SKUs, following the
vLLM recipe. Complements the
sibling day-zero branches (feat/minimax-m3-dayzero → B200/B300/MI355X
single-node, feat/minimax-m3-day0 → GB200/GB300/MI300X multi-node).

New configs: minimaxm3-fp8-h200-vllm, minimaxm3-fp8-h100-vllm in nvidia-master.yaml
New scripts: benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_{h200,h100}.sh (modeled on the sibling minimaxm3_fp8_b200.sh)
Image: vllm/vllm-openai:minimax-m3 — the dedicated day-zero image built from vLLM's m3_release branch; M3 has not shipped in a stable release (v0.23.0rc2 has no minimax_m3)
Model: MiniMaxAI/MiniMax-M3-MXFP8 (~444 GB) — lowest precision available. BF16 (~854 GB) cannot fit 8x H100 at all and is a tight fit on 8x H200. On Hopper (no native MX tensor cores) the MoE runs via the image's Marlin/DeepGEMM MXFP8 paths
Mandatory recipe flags: --block-size 128 (MSA sparse-attention block alignment); --language-model-only (text-only benchmark, frees vision-encoder VRAM)
Parallelism sweep: H200 = TP4 / TP4+EP4 / TP8 / TP8+EP8 / DEP8 (dp-attn: true → --data-parallel-size 8 --enable-expert-parallel); H100 = TP8 / TEP8 (weights need ~56 GB of each 80 GB GPU, so TP8-class only)

Validation

Full sweep 27441767143: 53/55 jobs green, including all gsm8k eval jobs on both SKUs.

H200: every mode green through DEP8 @ conc 1024 (1k1k) and DEP8 @ conc 512 (8k1k)
H100: TP8 and TEP8 green through conc 256 (1k1k) and 256 (8k1k)
The 2 failures were H100 DEP (1k1k conc 256/512): per-DP-rank replicated attention/dense/embedding weights (~20 GB BF16-dequantized) next to the ~52 GB expert shard leave no KV headroom on 80 GB — KV-cache init fails ("No available memory for the cache blocks"). Those two search-space points are removed from the matrix (TEP8 covers high concurrency on H100); DEP remains on H200 where it passes.

Earlier 8-job smoke 27439940008 validated every parallelism mode at conc 4 (7/8; the one failure was a shared-HF-cache WeakFileLock stale-file-handle race during concurrent first-downloads of the 444 GB checkpoint, fixed by retrying hf download and serving with HF_HUB_OFFLINE=1).

🤖 Generated with Claude Code

Note

Low Risk
Benchmark-only additions (YAML sweep definitions and shell launchers); no changes to production serving paths or shared library logic.

Overview
Introduces day-zero MiniMax-M3 (MiniMaxAI/MiniMax-M3-MXFP8) single-node coverage on Hopper via vllm/vllm-openai:minimax-m3, aligned with the vLLM M3 recipe.

nvidia-master.yaml gains minimaxm3-fp8-h100-vllm and minimaxm3-fp8-h200-vllm with fixed-seq-len sweeps at 1k/1k and 8k/1k. H200 explores TP4/TP8, expert parallel (TEP), and DEP (dp-attn: true). H100 is constrained to TP8-class layouts only—DEP is excluded from the search matrix after OOM on KV init at high concurrency; TEP8 carries the high-concurrency points.

New minimaxm3_fp8_h100.sh and minimaxm3_fp8_h200.sh wire up vLLM serve with recipe-required flags (--block-size 128, --language-model-only), parallel modes (TP vs TEP vs DEP), HF download retries plus offline serve to avoid cache lock races, extended engine ready timeout, and concurrency-scaled CUDA graph capture. The H100 script adds DEP-specific memory tuning even though DEP is not swept in YAML.

perf-changelog.yaml documents the two new config keys.

^{Reviewed by Cursor Bugbot for commit 27b1d41. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds minimaxm3-fp8-h200-vllm and minimaxm3-fp8-h100-vllm to nvidia-master.yaml plus the matching fixed-seq-len benchmark scripts, following https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3: - Dedicated day-zero image vllm/vllm-openai:minimax-m3 (M3 has not shipped in a stable vLLM release; v0.23.0rc2 has no minimax_m3) - MiniMaxAI/MiniMax-M3-MXFP8 (~427 GB weights), the lowest precision available; BF16 (~854 GB) cannot fit 8x H100 and is a tight fit on H200 - --block-size 128 is mandatory (MSA sparse attention block size) - minimax_m3 tool-call/reasoning parsers, --language-model-only for the text-only fixed-seq-len scenarios - Sweeps TP4/TP8, TP+EP (TEP) and DP-attention+EP (DEP) on H200; H100 is TP8-only since MXFP8 weights need ~56 GB of each 80 GB GPU Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Adopt the minimaxm3_fp8_b200.sh structure from feat/minimax-m3-dayzero: pow2 cudagraph capture sized to CONC (ragged DP arrival makes the full CONC bound safer than a per-rank max-num-seqs cap), ISL*2 batched tokens, and no chat parsers (benchmark drives /v1/completions). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Smoke run 27439940008: h100 tp8 died in vllm load_weights -> snapshot_download -> WeakFileLock with OSError [Errno 116] Stale file handle — sibling nodes concurrently first-downloading the 444 GB checkpoint into the shared network-FS HF cache race on lock-file deletion. Retry hf download (resumable), then launch the server with HF_HUB_OFFLINE=1 so it reads the now-complete cache without taking hub locks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Sweep 27441767143 h100 tp8/ep8/dp-attn conc-256 failed KV-cache init ("No available memory for the cache blocks"): each DP rank replicates ~20 GB of BF16-dequantized attention/dense/embedding weights next to its ~52 GB expert shard, so the 0.90 gmu budget (72 GB) is consumed by weights + conc-sized cudagraphs (~2.3 GiB) before a single KV block fits. DP-attention path now uses gmu 0.94 and caps decode-graph capture at 2x the per-rank batch share (CONC/DP) instead of full CONC. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Both DEP points failed in sweep 27441767143 with "No available memory for the cache blocks": per-rank replicated attention weights leave no KV headroom on 80 GB at high concurrency. TEP8 covers the high-concurrency regime on H100; DEP remains in the H200 sweep where it passes through conc 1024. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rbatim + 1731 at tail)

Oseltamivir · 2026-06-12T22:41:32Z

Final state of the pre-PR full-sweep dispatch 27441767143 (run conclusion shows "cancelled" — clarification):

53 jobs green including all gsm8k evals on both SKUs
2 failures: H100 DEP (1k1k conc 256/512) — KV-cache init failure, those search-space points are removed in this PR (see 3201938)
1 cancelled: H200 1k1k tp8 conc-128 — preempted on the runner near the end (likely by the PR-label sweep claiming the H200 pool); the config itself passed at conc 4–64 and the missing point is produced by the post-merge full sweep

🤖 Generated with Claude Code

github-actions · 2026-06-12T22:42:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27447107640
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27447107640

github-actions · 2026-06-13T01:50:32Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27447149961
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27447149961

github-actions · 2026-06-13T02:11:13Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27447149961
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27447149961

functionstackx · 2026-06-13T02:22:52Z

/reuse-sweep-run

github-actions · 2026-06-13T02:23:25Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453771654
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453771654

…) recipes (#1739) * minimaxm3 H200+H100 MTP: day-zero MiniMax-M3 EAGLE3 recipes Adds the spec-decoding=mtp siblings of the day-zero H200/H100 recipes (PR #1731): same MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 3 speculative tokens). The drafter is pinned to FLASH_ATTN — the EAGLE3 head is MHA and FlashInfer only supports the mandatory page size 128 through its GQA-only trtllm-gen kernel (the failure hit on the B300 MTP canary). Cudagraph capture scales to CONC * (1 + spec tokens); benchmark prompts run through the chat template so acceptance reflects real text. Search spaces mirror the non-MTP entries trimmed at the extreme-concurrency end (dsv4 / minimaxm3 b300-mtp precedent); H100 stays TP8-only with DEP omitted. Also adds SPEC_SUFFIX to the three H100 launchers (cw, cr, dgxc-slurm), which hardcoded _h100.sh and never gained the _mtp routing the H200 launchers have carried since #392 — without this, an mtp config on H100 silently runs the non-MTP script. This also fixes the latent same-bug for the existing qwen3.5-fp8-h100-sglang-mtp config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3 H200/H100 MTP Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Oseltamivir and others added 5 commits June 12, 2026 13:00

Oseltamivir requested a review from a team June 12, 2026 22:36

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners June 12, 2026 22:36

github-project-automation Bot added this to InferenceMAX Board Jun 12, 2026

Backfill PR link 1731 in minimaxm3 changelog entry

9b70547

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Oseltamivir added sweep-enabled full-sweep-enabled and removed sweep-enabled labels Jun 12, 2026

Oseltamivir added 3 commits June 12, 2026 15:38

Merge branch 'main' into feat/minimax-m3-hopper-dayzero

e755fd0

Merge origin/main; re-append minimaxm3 changelog entry at tail

f5aaa70

Merge remote update-branch; keep changelog append-only order (main ve…

2264c84

…rbatim + 1731 at tail)

functionstackx mentioned this pull request Jun 13, 2026

[Klaud Cold] minimaxm3 H200+H100 MTP: day-zero MiniMax-M3 EAGLE3 (MTP) recipes #1739

Merged

Merge branch 'main' into feat/minimax-m3-hopper-dayzero

27b1d41

functionstackx merged commit 1ef98e3 into main Jun 13, 2026
15 of 22 checks passed

functionstackx deleted the feat/minimax-m3-hopper-dayzero branch June 13, 2026 02:23

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Day-zero MiniMax-M3 MXFP8 single-node recipes for H200/H100 (vLLM)#1731

Day-zero MiniMax-M3 MXFP8 single-node recipes for H200/H100 (vLLM)#1731
functionstackx merged 10 commits into
mainfrom
feat/minimax-m3-hopper-dayzero

Oseltamivir commented Jun 12, 2026 •

edited by cursor Bot

Loading

Uh oh!

Oseltamivir commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Oseltamivir commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Oseltamivir commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Oseltamivir commented Jun 12, 2026 •

edited by cursor Bot

Loading