[Klaud Cold][NVIDIA] feat: MiniMax M3 Day 0 support H200#1728
[Klaud Cold][NVIDIA] feat: MiniMax M3 Day 0 support H200#1728functionstackx wants to merge 3 commits into
Conversation
Day-zero single-node vLLM recipe for MiniMaxAI/MiniMax-M3-MXFP8 on H200, following https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3. Uses the dedicated vllm/vllm-openai:minimax-m3 image (M3 has not shipped in a stable vLLM release). Sweeps TP4/TP8, TP+EP, and DP-attention+EP at 1k1k and 8k1k. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443411976 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27443433612 |
Summary
Day-zero single-node vLLM recipe for MiniMaxAI/MiniMax-M3-MXFP8 on H200, following the official vLLM recipe (https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3). Sibling of the B200 (#1723) / B300 (#1724) / MI355X (#1725) day-zero PRs.
minimaxm3-fp8-h200-vllm(runner pool:h200)benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_h200.shperf-changelog.yamlentryModel & image
MiniMaxAI/MiniMax-M3-MXFP8— 427B total / 26B active MoE with MSA sparse attention, NVIDIA-quantized MXFP8 (~427 GB weights, roughly half of BF16). Verified on HF.vllm/vllm-openai:minimax-m3— dedicated day-zero image (M3 support has not shipped in a stable vLLM release). Tag verified on Docker Hub (amd64 + arm64, pushed 2026-06-12).Recipe details (per recipes.vllm.ai + repo conventions)
--block-size 128is mandatory: MSA sparse_block_size is 128; the default 16 misaligns sparse indexing.--language-model-only: the benchmark is text-only, skipping the vision encoder frees VRAM for KV.ep > 1→ TP+EP (--enable-expert-parallel);dp-attn: true→ the recipe's "DP8 + Expert Parallel" mode (--data-parallel-size 8 --enable-expert-parallel).MAX_MODEL_LEN = isl + osl + 256(not the model's full 1M context), and CUDA graph capture is bounded at the next power of two ≥ CONC (--max-cudagraph-capture-size), matching the other day-zero M3 recipes.VLLM_ENGINE_READY_TIMEOUT_S=3600(~444 GB weights off shared FS can exceed the default 600 s readiness window) plus a retryinghf download+HF_HUB_OFFLINE=1serve to dodge the shared-FS WeakFileLock race on day-zero concurrent downloads.Sweep space
Concurrency/parallelism chosen from the official recipe's serve modes plus existing H200 large-MoE configs (
dsv4-fp8-h200-vllm,minimaxm2.5-fp8-h200-vllm). On 8x H200 (1128 GB), TP8 leaves ~70 GB/GPU of KV headroom; TP4 (~112 GB weights/GPU) is memory-tight and only swept at low/mid concurrency.Validated with
generate_sweep_configs.py test-config→ 39 sweep points, including eval entries.🤖 Generated with Claude Code
Note
Low Risk
Additive benchmark harness and CI config only; no changes to production serving, auth, or application runtime paths.
Overview
Adds day-zero single-node benchmarking for MiniMax-M3 MXFP8 on H200 via vLLM, aligned with the official vLLM recipe.
A new config key
minimaxm3-fp8-h200-vllminnvidia-master.yamlpoints atvllm/vllm-openai:minimax-m3andMiniMaxAI/MiniMax-M3-MXFP8, with fixed-seq-len sweeps at 1k/1k and 8k/1k over TP4/TP8, TP+EP, and DP-attention + EP concurrency ranges tuned for H200 memory headroom.The companion script
minimaxm3_fp8_h200.shimplements serve flags required for M3 (--block-size 128,--language-model-only), mapsdp-attn/ EP to the right vLLM parallel args, bounds CUDA graph capture to concurrency, retries largehf downloadon shared FS lock races, and extends engine readiness timeout for the ~444 GB checkpoint.perf-changelog.yamldocuments the new config.Reviewed by Cursor Bugbot for commit 086f643. Bugbot is set up for automated code reviews on this repo. Configure here.