Day-zero MiniMax-M3 MXFP8 single-node recipes for H200/H100 (vLLM)#1731
Conversation
Adds minimaxm3-fp8-h200-vllm and minimaxm3-fp8-h100-vllm to nvidia-master.yaml plus the matching fixed-seq-len benchmark scripts, following https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3: - Dedicated day-zero image vllm/vllm-openai:minimax-m3 (M3 has not shipped in a stable vLLM release; v0.23.0rc2 has no minimax_m3) - MiniMaxAI/MiniMax-M3-MXFP8 (~427 GB weights), the lowest precision available; BF16 (~854 GB) cannot fit 8x H100 and is a tight fit on H200 - --block-size 128 is mandatory (MSA sparse attention block size) - minimax_m3 tool-call/reasoning parsers, --language-model-only for the text-only fixed-seq-len scenarios - Sweeps TP4/TP8, TP+EP (TEP) and DP-attention+EP (DEP) on H200; H100 is TP8-only since MXFP8 weights need ~56 GB of each 80 GB GPU Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adopt the minimaxm3_fp8_b200.sh structure from feat/minimax-m3-dayzero: pow2 cudagraph capture sized to CONC (ragged DP arrival makes the full CONC bound safer than a per-rank max-num-seqs cap), ISL*2 batched tokens, and no chat parsers (benchmark drives /v1/completions). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Smoke run 27439940008: h100 tp8 died in vllm load_weights -> snapshot_download -> WeakFileLock with OSError [Errno 116] Stale file handle — sibling nodes concurrently first-downloading the 444 GB checkpoint into the shared network-FS HF cache race on lock-file deletion. Retry hf download (resumable), then launch the server with HF_HUB_OFFLINE=1 so it reads the now-complete cache without taking hub locks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sweep 27441767143 h100 tp8/ep8/dp-attn conc-256 failed KV-cache init
("No available memory for the cache blocks"): each DP rank replicates
~20 GB of BF16-dequantized attention/dense/embedding weights next to
its ~52 GB expert shard, so the 0.90 gmu budget (72 GB) is consumed by
weights + conc-sized cudagraphs (~2.3 GiB) before a single KV block
fits. DP-attention path now uses gmu 0.94 and caps decode-graph
capture at 2x the per-rank batch share (CONC/DP) instead of full CONC.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Both DEP points failed in sweep 27441767143 with "No available memory for the cache blocks": per-rank replicated attention weights leave no KV headroom on 80 GB at high concurrency. TEP8 covers the high-concurrency regime on H100; DEP remains in the H200 sweep where it passes through conc 1024. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Final state of the pre-PR full-sweep dispatch 27441767143 (run conclusion shows "cancelled" — clarification):
🤖 Generated with Claude Code |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27447107640 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27447149961 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27447149961 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453771654 |
…) recipes (#1739) * minimaxm3 H200+H100 MTP: day-zero MiniMax-M3 EAGLE3 recipes Adds the spec-decoding=mtp siblings of the day-zero H200/H100 recipes (PR #1731): same MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 3 speculative tokens). The drafter is pinned to FLASH_ATTN — the EAGLE3 head is MHA and FlashInfer only supports the mandatory page size 128 through its GQA-only trtllm-gen kernel (the failure hit on the B300 MTP canary). Cudagraph capture scales to CONC * (1 + spec tokens); benchmark prompts run through the chat template so acceptance reflects real text. Search spaces mirror the non-MTP entries trimmed at the extreme-concurrency end (dsv4 / minimaxm3 b300-mtp precedent); H100 stays TP8-only with DEP omitted. Also adds SPEC_SUFFIX to the three H100 launchers (cw, cr, dgxc-slurm), which hardcoded _h100.sh and never gained the _mtp routing the H200 launchers have carried since #392 — without this, an mtp config on H100 silently runs the non-MTP script. This also fixes the latent same-bug for the existing qwen3.5-fp8-h100-sglang-mtp config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3 H200/H100 MTP Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Summary
Adds day-zero MiniMax-M3 benchmark coverage for the Hopper SKUs, following the
vLLM recipe. Complements the
sibling day-zero branches (
feat/minimax-m3-dayzero→ B200/B300/MI355Xsingle-node,
feat/minimax-m3-day0→ GB200/GB300/MI300X multi-node).minimaxm3-fp8-h200-vllm,minimaxm3-fp8-h100-vllminnvidia-master.yamlbenchmarks/single_node/fixed_seq_len/minimaxm3_fp8_{h200,h100}.sh(modeled on the siblingminimaxm3_fp8_b200.sh)vllm/vllm-openai:minimax-m3— the dedicated day-zero image built from vLLM'sm3_releasebranch; M3 has not shipped in a stable release (v0.23.0rc2 has nominimax_m3)MiniMaxAI/MiniMax-M3-MXFP8(~444 GB) — lowest precision available. BF16 (~854 GB) cannot fit 8x H100 at all and is a tight fit on 8x H200. On Hopper (no native MX tensor cores) the MoE runs via the image's Marlin/DeepGEMM MXFP8 paths--block-size 128(MSA sparse-attention block alignment);--language-model-only(text-only benchmark, frees vision-encoder VRAM)dp-attn: true→--data-parallel-size 8 --enable-expert-parallel); H100 = TP8 / TEP8 (weights need ~56 GB of each 80 GB GPU, so TP8-class only)Validation
Full sweep 27441767143: 53/55 jobs green, including all gsm8k eval jobs on both SKUs.
Earlier 8-job smoke 27439940008 validated every parallelism mode at conc 4 (7/8; the one failure was a shared-HF-cache
WeakFileLockstale-file-handle race during concurrent first-downloads of the 444 GB checkpoint, fixed by retryinghf downloadand serving withHF_HUB_OFFLINE=1).🤖 Generated with Claude Code
Note
Low Risk
Benchmark-only additions (YAML sweep definitions and shell launchers); no changes to production serving paths or shared library logic.
Overview
Introduces day-zero MiniMax-M3 (
MiniMaxAI/MiniMax-M3-MXFP8) single-node coverage on Hopper viavllm/vllm-openai:minimax-m3, aligned with the vLLM M3 recipe.nvidia-master.yamlgainsminimaxm3-fp8-h100-vllmandminimaxm3-fp8-h200-vllmwithfixed-seq-lensweeps at 1k/1k and 8k/1k. H200 explores TP4/TP8, expert parallel (TEP), and DEP (dp-attn: true). H100 is constrained to TP8-class layouts only—DEP is excluded from the search matrix after OOM on KV init at high concurrency; TEP8 carries the high-concurrency points.New
minimaxm3_fp8_h100.shandminimaxm3_fp8_h200.shwire up vLLM serve with recipe-required flags (--block-size 128,--language-model-only), parallel modes (TP vs TEP vs DEP), HF download retries plus offline serve to avoid cache lock races, extended engine ready timeout, and concurrency-scaled CUDA graph capture. The H100 script adds DEP-specific memory tuning even though DEP is not swept in YAML.perf-changelog.yamldocuments the two new config keys.Reviewed by Cursor Bugbot for commit 27b1d41. Bugbot is set up for automated code reviews on this repo. Configure here.