Skip to content

Day-zero MiniMax-M3 MXFP8 single-node recipes for H200/H100 (vLLM)#1731

Merged
functionstackx merged 10 commits into
mainfrom
feat/minimax-m3-hopper-dayzero
Jun 13, 2026
Merged

Day-zero MiniMax-M3 MXFP8 single-node recipes for H200/H100 (vLLM)#1731
functionstackx merged 10 commits into
mainfrom
feat/minimax-m3-hopper-dayzero

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds day-zero MiniMax-M3 benchmark coverage for the Hopper SKUs, following the
vLLM recipe. Complements the
sibling day-zero branches (feat/minimax-m3-dayzero → B200/B300/MI355X
single-node, feat/minimax-m3-day0 → GB200/GB300/MI300X multi-node).

  • New configs: minimaxm3-fp8-h200-vllm, minimaxm3-fp8-h100-vllm in nvidia-master.yaml
  • New scripts: benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_{h200,h100}.sh (modeled on the sibling minimaxm3_fp8_b200.sh)
  • Image: vllm/vllm-openai:minimax-m3 — the dedicated day-zero image built from vLLM's m3_release branch; M3 has not shipped in a stable release (v0.23.0rc2 has no minimax_m3)
  • Model: MiniMaxAI/MiniMax-M3-MXFP8 (~444 GB) — lowest precision available. BF16 (~854 GB) cannot fit 8x H100 at all and is a tight fit on 8x H200. On Hopper (no native MX tensor cores) the MoE runs via the image's Marlin/DeepGEMM MXFP8 paths
  • Mandatory recipe flags: --block-size 128 (MSA sparse-attention block alignment); --language-model-only (text-only benchmark, frees vision-encoder VRAM)
  • Parallelism sweep: H200 = TP4 / TP4+EP4 / TP8 / TP8+EP8 / DEP8 (dp-attn: true--data-parallel-size 8 --enable-expert-parallel); H100 = TP8 / TEP8 (weights need ~56 GB of each 80 GB GPU, so TP8-class only)

Validation

Full sweep 27441767143: 53/55 jobs green, including all gsm8k eval jobs on both SKUs.

  • H200: every mode green through DEP8 @ conc 1024 (1k1k) and DEP8 @ conc 512 (8k1k)
  • H100: TP8 and TEP8 green through conc 256 (1k1k) and 256 (8k1k)
  • The 2 failures were H100 DEP (1k1k conc 256/512): per-DP-rank replicated attention/dense/embedding weights (~20 GB BF16-dequantized) next to the ~52 GB expert shard leave no KV headroom on 80 GB — KV-cache init fails ("No available memory for the cache blocks"). Those two search-space points are removed from the matrix (TEP8 covers high concurrency on H100); DEP remains on H200 where it passes.

Earlier 8-job smoke 27439940008 validated every parallelism mode at conc 4 (7/8; the one failure was a shared-HF-cache WeakFileLock stale-file-handle race during concurrent first-downloads of the 444 GB checkpoint, fixed by retrying hf download and serving with HF_HUB_OFFLINE=1).

🤖 Generated with Claude Code


Note

Low Risk
Benchmark-only additions (YAML sweep definitions and shell launchers); no changes to production serving paths or shared library logic.

Overview
Introduces day-zero MiniMax-M3 (MiniMaxAI/MiniMax-M3-MXFP8) single-node coverage on Hopper via vllm/vllm-openai:minimax-m3, aligned with the vLLM M3 recipe.

nvidia-master.yaml gains minimaxm3-fp8-h100-vllm and minimaxm3-fp8-h200-vllm with fixed-seq-len sweeps at 1k/1k and 8k/1k. H200 explores TP4/TP8, expert parallel (TEP), and DEP (dp-attn: true). H100 is constrained to TP8-class layouts only—DEP is excluded from the search matrix after OOM on KV init at high concurrency; TEP8 carries the high-concurrency points.

New minimaxm3_fp8_h100.sh and minimaxm3_fp8_h200.sh wire up vLLM serve with recipe-required flags (--block-size 128, --language-model-only), parallel modes (TP vs TEP vs DEP), HF download retries plus offline serve to avoid cache lock races, extended engine ready timeout, and concurrency-scaled CUDA graph capture. The H100 script adds DEP-specific memory tuning even though DEP is not swept in YAML.

perf-changelog.yaml documents the two new config keys.

Reviewed by Cursor Bugbot for commit 27b1d41. Bugbot is set up for automated code reviews on this repo. Configure here.

Oseltamivir and others added 5 commits June 12, 2026 13:00
Adds minimaxm3-fp8-h200-vllm and minimaxm3-fp8-h100-vllm to
nvidia-master.yaml plus the matching fixed-seq-len benchmark scripts,
following https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3:

- Dedicated day-zero image vllm/vllm-openai:minimax-m3 (M3 has not
  shipped in a stable vLLM release; v0.23.0rc2 has no minimax_m3)
- MiniMaxAI/MiniMax-M3-MXFP8 (~427 GB weights), the lowest precision
  available; BF16 (~854 GB) cannot fit 8x H100 and is a tight fit on H200
- --block-size 128 is mandatory (MSA sparse attention block size)
- minimax_m3 tool-call/reasoning parsers, --language-model-only for the
  text-only fixed-seq-len scenarios
- Sweeps TP4/TP8, TP+EP (TEP) and DP-attention+EP (DEP) on H200;
  H100 is TP8-only since MXFP8 weights need ~56 GB of each 80 GB GPU

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adopt the minimaxm3_fp8_b200.sh structure from feat/minimax-m3-dayzero:
pow2 cudagraph capture sized to CONC (ragged DP arrival makes the full
CONC bound safer than a per-rank max-num-seqs cap), ISL*2 batched
tokens, and no chat parsers (benchmark drives /v1/completions).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Smoke run 27439940008: h100 tp8 died in vllm load_weights ->
snapshot_download -> WeakFileLock with OSError [Errno 116] Stale file
handle — sibling nodes concurrently first-downloading the 444 GB
checkpoint into the shared network-FS HF cache race on lock-file
deletion. Retry hf download (resumable), then launch the server with
HF_HUB_OFFLINE=1 so it reads the now-complete cache without taking
hub locks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sweep 27441767143 h100 tp8/ep8/dp-attn conc-256 failed KV-cache init
("No available memory for the cache blocks"): each DP rank replicates
~20 GB of BF16-dequantized attention/dense/embedding weights next to
its ~52 GB expert shard, so the 0.90 gmu budget (72 GB) is consumed by
weights + conc-sized cudagraphs (~2.3 GiB) before a single KV block
fits. DP-attention path now uses gmu 0.94 and caps decode-graph
capture at 2x the per-rank batch share (CONC/DP) instead of full CONC.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both DEP points failed in sweep 27441767143 with "No available memory
for the cache blocks": per-rank replicated attention weights leave no
KV headroom on 80 GB at high concurrency. TEP8 covers the
high-concurrency regime on H100; DEP remains in the H200 sweep where
it passes through conc 1024.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Oseltamivir

Copy link
Copy Markdown
Collaborator Author

Final state of the pre-PR full-sweep dispatch 27441767143 (run conclusion shows "cancelled" — clarification):

  • 53 jobs green including all gsm8k evals on both SKUs
  • 2 failures: H100 DEP (1k1k conc 256/512) — KV-cache init failure, those search-space points are removed in this PR (see 3201938)
  • 1 cancelled: H200 1k1k tp8 conc-128 — preempted on the runner near the end (likely by the PR-label sweep claiming the H200 pool); the config itself passed at conc 4–64 and the missing point is produced by the post-merge full sweep

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 1ef98e3 into main Jun 13, 2026
15 of 22 checks passed
@functionstackx functionstackx deleted the feat/minimax-m3-hopper-dayzero branch June 13, 2026 02:23
@github-actions

Copy link
Copy Markdown
Contributor

functionstackx added a commit that referenced this pull request Jun 13, 2026
…) recipes (#1739)

* minimaxm3 H200+H100 MTP: day-zero MiniMax-M3 EAGLE3 recipes

Adds the spec-decoding=mtp siblings of the day-zero H200/H100 recipes
(PR #1731): same MXFP8 target and serve shape, plus the
Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method
eagle3, 3 speculative tokens). The drafter is pinned to FLASH_ATTN —
the EAGLE3 head is MHA and FlashInfer only supports the mandatory page
size 128 through its GQA-only trtllm-gen kernel (the failure hit on the
B300 MTP canary). Cudagraph capture scales to CONC * (1 + spec tokens);
benchmark prompts run through the chat template so acceptance reflects
real text. Search spaces mirror the non-MTP entries trimmed at the
extreme-concurrency end (dsv4 / minimaxm3 b300-mtp precedent); H100
stays TP8-only with DEP omitted.

Also adds SPEC_SUFFIX to the three H100 launchers (cw, cr,
dgxc-slurm), which hardcoded _h100.sh and never gained the _mtp
routing the H200 launchers have carried since #392 — without this, an
mtp config on H100 silently runs the non-MTP script. This also fixes
the latent same-bug for the existing qwen3.5-fp8-h100-sglang-mtp
config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* perf-changelog: fill in PR link for minimaxm3 H200/H100 MTP

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants