[Klaud Cold] minimaxm3-fp8-b200-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B200 recipe by functionstackx · Pull Request #1741 · SemiAnalysisAI/InferenceX

functionstackx · 2026-06-13T04:14:35Z

Summary

Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) sibling of minimaxm3-fp8-b200-vllm: MiniMax-M3 MXFP8 on B200 single-node vLLM, pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head. Based on the B200 non-MTP recipe, mirroring the merged B300 MTP entry (#1733).

New benchmark script

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200_mtp.sh, based on minimaxm3_fp8_b200.sh (mandatory --block-size 128 for the MSA sparse/index cache, --language-model-only, scenario-trimmed MAX_MODEL_LEN). Additions for MTP:

--speculative-config '{"method": "eagle3", "model": <draft>, "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'.
Drafter pinned to FLASH_ATTN: the EAGLE3 head is MHA, and FlashInfer only supports page size 128 (mandatory here) through its GQA-only trtllm-gen kernel — engine init dies in FlashInferMetadataBuilder otherwise. This exact failure hit the B300 MTP canary; FLASH_ATTN takes any multiple-of-16 block size and handles MHA.
Draft download follows the B200 model handling: fetched next to the pre-staged target on b200-dgxc (the /lustre/fsw/gharunners tree is writable), or into the HF cache on bare-HF-id runners (b200-cw / b200-nb).
Cudagraph capture scaled to CONC * (1 + spec tokens), capped at 2048.
--use-chat-template on the benchmark client so draft acceptance reflects real text rather than random tokens.

Config (`nvidia-master.yaml`)

minimaxm3-fp8-b200-vllm-mtp, same vllm/vllm-openai:minimax-m3 image and b200-dgxc runner. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (identical to minimaxm3-fp8-b300-vllm-mtp):

1k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP4+EP4 (64–256), TP8+EP8 dp-attn (256–512)
8k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP8+EP8 dp-attn (128–256)

No launcher change needed — launch_b200-dgxc.sh already carries SPEC_SUFFIX and resolves minimaxm3_fp8_b200_mtp.sh.

perf-changelog

Entry for the new config key.

Validation

generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-b200-vllm-mtp generates 53 configs cleanly (scenario-trimmed max-model-len 2304 / 9472, all spec-decoding=mtp on b200-dgxc).
bash -n passes on the new script.
Launcher script-name resolution simulated: falls back from _vllm_mtp to minimaxm3_fp8_b200_mtp.sh (exists).

🤖 Generated with Claude Code

Note

Low Risk
Benchmark-only additions (YAML config, shell script, changelog) with no changes to runtime application or security-sensitive paths.

Overview
Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) benchmark for MiniMax-M3 MXFP8 on B200, pairing the target MiniMaxAI/MiniMax-M3-MXFP8 with draft Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens).

New fixed-seq-len script minimaxm3_fp8_b200_mtp.sh extends the non-MTP B200 serve shape with --speculative-config (eagle3), drafter attention_backend: FLASH_ATTN (MHA draft vs FlashInfer/GQA constraint at block size 128), draft weight download beside pre-staged target on dgxc or HF cache otherwise, cudagraph capture scaled to concurrency × (1 + spec tokens), and --use-chat-template on the serving benchmark client.

nvidia-master.yaml gains minimaxm3-fp8-b200-vllm-mtp with the same image/runner as the base B200 entry and a search space trimmed at high concurrency (aligned with the existing B300 MTP config). perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 1a66768. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds the spec-decoding=mtp sibling of minimaxm3-fp8-b200-vllm: same MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 3 speculative tokens). The drafter is pinned to FLASH_ATTN — the EAGLE3 head is MHA and FlashInfer only supports the mandatory page size 128 through its GQA-only trtllm-gen kernel (the failure that hit the B300 MTP canary). Cudagraph capture scales to CONC * (1 + spec tokens); benchmark prompts run through the chat template so acceptance reflects real text. The EAGLE3 draft is fetched next to the pre-staged target on b200-dgxc (gharunners tree is writable) or into the HF cache on bare-HF-id runners. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end, identical to the minimaxm3-fp8-b300-vllm-mtp precedent. Launcher needs no change — launch_b200-dgxc.sh already carries SPEC_SUFFIX and resolves the _mtp script. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-13T06:04:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27456333115
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27456333115

functionstackx · 2026-06-13T06:05:42Z

/reuse-sweep-run

github-actions · 2026-06-13T06:06:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27458619687
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27458619687

functionstackx requested a review from a team June 13, 2026 04:14

functionstackx requested review from jgangani and kedarpotdar-nv as code owners June 13, 2026 04:14

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

functionstackx added the full-sweep-enabled label Jun 13, 2026

functionstackx and others added 2 commits June 13, 2026 00:18

perf-changelog: fill in PR link for minimaxm3-fp8-b200-vllm-mtp

2a04b15

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

functionstackx force-pushed the feat/minimax-m3-b200-mtp-dayzero branch from e31ffdf to 2a04b15 Compare June 13, 2026 04:18

functionstackx mentioned this pull request Jun 13, 2026

[NVIDIA] feat: MiniMax M3 Day 0 MTP (EAGLE3) support B200 #1736

Closed

Merge branch 'main' into feat/minimax-m3-b200-mtp-dayzero

1a66768

functionstackx merged commit 8f1f213 into main Jun 13, 2026
4 of 6 checks passed

functionstackx deleted the feat/minimax-m3-b200-mtp-dayzero branch June 13, 2026 06:06

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

functionstackx mentioned this pull request Jun 13, 2026

[Klaud Cold] minimaxm3-fp8-mi355x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI355X recipe #1742

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Klaud Cold] minimaxm3-fp8-b200-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B200 recipe#1741

[Klaud Cold] minimaxm3-fp8-b200-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B200 recipe#1741
functionstackx merged 3 commits into
mainfrom
feat/minimax-m3-b200-mtp-dayzero

functionstackx commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

functionstackx commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New benchmark script

Config (nvidia-master.yaml)

perf-changelog

Validation

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented Jun 13, 2026 •

edited by cursor Bot

Loading

Config (`nvidia-master.yaml`)