Skip to content

[Klaud Cold] minimaxm3-fp8-b200-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B200 recipe#1741

Merged
functionstackx merged 3 commits into
mainfrom
feat/minimax-m3-b200-mtp-dayzero
Jun 13, 2026
Merged

[Klaud Cold] minimaxm3-fp8-b200-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B200 recipe#1741
functionstackx merged 3 commits into
mainfrom
feat/minimax-m3-b200-mtp-dayzero

Conversation

@functionstackx

@functionstackx functionstackx commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) sibling of minimaxm3-fp8-b200-vllm: MiniMax-M3 MXFP8 on B200 single-node vLLM, pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head. Based on the B200 non-MTP recipe, mirroring the merged B300 MTP entry (#1733).

New benchmark script

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200_mtp.sh, based on minimaxm3_fp8_b200.sh (mandatory --block-size 128 for the MSA sparse/index cache, --language-model-only, scenario-trimmed MAX_MODEL_LEN). Additions for MTP:

  • --speculative-config '{"method": "eagle3", "model": <draft>, "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'.
  • Drafter pinned to FLASH_ATTN: the EAGLE3 head is MHA, and FlashInfer only supports page size 128 (mandatory here) through its GQA-only trtllm-gen kernel — engine init dies in FlashInferMetadataBuilder otherwise. This exact failure hit the B300 MTP canary; FLASH_ATTN takes any multiple-of-16 block size and handles MHA.
  • Draft download follows the B200 model handling: fetched next to the pre-staged target on b200-dgxc (the /lustre/fsw/gharunners tree is writable), or into the HF cache on bare-HF-id runners (b200-cw / b200-nb).
  • Cudagraph capture scaled to CONC * (1 + spec tokens), capped at 2048.
  • --use-chat-template on the benchmark client so draft acceptance reflects real text rather than random tokens.

Config (nvidia-master.yaml)

minimaxm3-fp8-b200-vllm-mtp, same vllm/vllm-openai:minimax-m3 image and b200-dgxc runner. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (identical to minimaxm3-fp8-b300-vllm-mtp):

  • 1k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP4+EP4 (64–256), TP8+EP8 dp-attn (256–512)
  • 8k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP8+EP8 dp-attn (128–256)

No launcher change needed — launch_b200-dgxc.sh already carries SPEC_SUFFIX and resolves minimaxm3_fp8_b200_mtp.sh.

perf-changelog

Entry for the new config key.

Validation

  • generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-b200-vllm-mtp generates 53 configs cleanly (scenario-trimmed max-model-len 2304 / 9472, all spec-decoding=mtp on b200-dgxc).
  • bash -n passes on the new script.
  • Launcher script-name resolution simulated: falls back from _vllm_mtp to minimaxm3_fp8_b200_mtp.sh (exists).

🤖 Generated with Claude Code


Note

Low Risk
Benchmark-only additions (YAML config, shell script, changelog) with no changes to runtime application or security-sensitive paths.

Overview
Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) benchmark for MiniMax-M3 MXFP8 on B200, pairing the target MiniMaxAI/MiniMax-M3-MXFP8 with draft Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens).

New fixed-seq-len script minimaxm3_fp8_b200_mtp.sh extends the non-MTP B200 serve shape with --speculative-config (eagle3), drafter attention_backend: FLASH_ATTN (MHA draft vs FlashInfer/GQA constraint at block size 128), draft weight download beside pre-staged target on dgxc or HF cache otherwise, cudagraph capture scaled to concurrency × (1 + spec tokens), and --use-chat-template on the serving benchmark client.

nvidia-master.yaml gains minimaxm3-fp8-b200-vllm-mtp with the same image/runner as the base B200 entry and a search space trimmed at high concurrency (aligned with the existing B300 MTP config). perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 1a66768. Bugbot is set up for automated code reviews on this repo. Configure here.

functionstackx and others added 2 commits June 13, 2026 00:18
Adds the spec-decoding=mtp sibling of minimaxm3-fp8-b200-vllm: same
MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft
head via --speculative-config (method eagle3, 3 speculative tokens).
The drafter is pinned to FLASH_ATTN — the EAGLE3 head is MHA and
FlashInfer only supports the mandatory page size 128 through its
GQA-only trtllm-gen kernel (the failure that hit the B300 MTP canary).
Cudagraph capture scales to CONC * (1 + spec tokens); benchmark prompts
run through the chat template so acceptance reflects real text. The
EAGLE3 draft is fetched next to the pre-staged target on b200-dgxc
(gharunners tree is writable) or into the HF cache on bare-HF-id
runners. Search space mirrors the non-MTP entry trimmed at the
extreme-concurrency end, identical to the minimaxm3-fp8-b300-vllm-mtp
precedent. Launcher needs no change — launch_b200-dgxc.sh already
carries SPEC_SUFFIX and resolves the _mtp script.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit 8f1f213 into main Jun 13, 2026
4 of 6 checks passed
@functionstackx functionstackx deleted the feat/minimax-m3-b200-mtp-dayzero branch June 13, 2026 06:06
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant