[Klaud Cold] minimaxm3-fp8-b300-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B300 recipe by functionstackx · Pull Request #1733 · SemiAnalysisAI/InferenceX

functionstackx · 2026-06-13T00:48:52Z

Summary

Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) sibling of minimaxm3-fp8-b300-vllm (#1724): MiniMax-M3 MXFP8 on B300 single-node vLLM, pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head.

New script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300_mtp.sh, based on minimaxm3_fp8_b300.sh (same serve shape: mandatory --block-size 128 for the MSA sparse/index cache, --language-model-only, scenario-trimmed MAX_MODEL_LEN from the matrix). Adds:
- --speculative-config '{"method": "eagle3", "model": <draft path>, "num_speculative_tokens": 3}'.
- Draft download follows the same MODEL/MODEL_PATH split as the target: lands next to the MXFP8 weights in writable /data/models on b300, HF cache for stand-alone runs.
- Cudagraph capture scaled to CONC * (1 + NUM_SPEC_TOKENS) (each running request contributes the extra draft tokens per decode step), still capped at vLLM's 2048 ceiling.
- Benchmark prompts routed through the chat template (--use-chat-template) so draft acceptance reflects real text rather than raw random tokens, matching the other MTP recipes.
Config minimaxm3-fp8-b300-vllm-mtp in nvidia-master.yaml, same vllm/vllm-openai:minimax-m3 image. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end, per the dsv4-fp4-b300-vllm-mtp precedent (spec decode pays off at low/mid concurrency; acceptance dilutes in big batches; draft weights + draft KV shave headroom):
- 1k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP4+EP4 (64–256), TP8+EP8 dp-attn (256–512)
- 8k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP8+EP8 dp-attn (128–256)
- tp2-ep2 dropped entirely — KV headroom was already thin without a draft.
perf-changelog entry for the new config key.

Validation

generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-b300-vllm-mtp generates the full matrix cleanly (scenario-trimmed max-model-len 2304 / 9472, eval points auto-assigned).
bash -n on the new script passes.

🤖 Generated with Claude Code

Note

Low Risk
Benchmark-only additions (YAML matrix, shell recipe, changelog) with no changes to production services or security-sensitive paths.

Overview
Adds a spec-decoding (MTP) benchmark track for MiniMax-M3 MXFP8 on B300: target MiniMaxAI/MiniMax-M3-MXFP8 plus draft Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens), alongside the existing non-MTP minimaxm3-fp8-b300-vllm entry.

New shell recipe minimaxm3_fp8_b300_mtp.sh extends the base MiniMax B300 serve shape with --speculative-config (EAGLE3), draft weight download beside the target, FLASH_ATTN on the drafter, cudagraph capture sized for CONC * (1 + spec tokens), and --use-chat-template on serving benchmarks. minimaxm3-fp8-b300-vllm-mtp in nvidia-master.yaml wires the sweep with spec-decoding: mtp and a trimmed concurrency search space (high-concurrency points and tp2-ep2 dropped vs non-MTP). perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit e032b16. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds the spec-decoding=mtp sibling of minimaxm3-fp8-b300-vllm: same MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 2 speculative tokens — house default per dsv4_fp4_b300_vllm_mtp.sh). Cudagraph capture scales to CONC * (1 + spec tokens) since each running request contributes the extra draft tokens per decode step; benchmark prompts go through the chat template so acceptance reflects real text rather than random tokens. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (dsv4 b300 vllm mtp precedent); tp2-ep2 is dropped because its KV headroom was already thin before adding the draft weights + draft KV. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-13T00:49:00Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-13T00:58:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27451583496
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27451583496

The canary died in FlashInferMetadataBuilder: the EAGLE3 head is MHA, and FlashInfer only supports page size 128 (mandatory for the target's MSA sparse/index cache) through its trtllm-gen kernel, which requires GQA/MQA. SpeculativeConfig.attention_backend exists for exactly this — the draft runs on FLASH_ATTN (any multiple-of-16 block size, MHA fine, SM10x fine) while the target keeps FlashInfer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-13T01:44:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27451860491
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27451860491

functionstackx · 2026-06-13T01:48:14Z

/reuse-sweep-run

# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml

functionstackx · 2026-06-13T01:50:34Z

/reuse-sweep-run

functionstackx · 2026-06-13T01:50:44Z

/reuse-sweep-run

github-actions · 2026-06-13T01:50:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453028374
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453028374

functionstackx requested a review from a team June 13, 2026 00:48

functionstackx requested review from jgangani and kedarpotdar-nv as code owners June 13, 2026 00:48

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

functionstackx and others added 2 commits June 12, 2026 20:49

perf-changelog: fill in PR link for minimaxm3-fp8-b300-vllm-mtp

7579422

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

minimaxm3-fp8-b300-vllm-mtp: bump to 3 speculative tokens

9d39b68

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

functionstackx added the full-sweep-enabled label Jun 13, 2026

functionstackx mentioned this pull request Jun 13, 2026

[NVIDIA] feat: MiniMax M3 Day 0 MTP (EAGLE3) support B300 #1737

Closed

Merge remote-tracking branch 'origin/main' into pr-1733-reuse

e032b16

# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml

functionstackx merged commit 9cb2670 into main Jun 13, 2026
16 of 24 checks passed

functionstackx deleted the feat/minimax-m3-b300-mtp-dayzero branch June 13, 2026 01:50

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Klaud Cold] minimaxm3-fp8-b300-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B300 recipe#1733

[Klaud Cold] minimaxm3-fp8-b300-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B300 recipe#1733
functionstackx merged 5 commits into
mainfrom
feat/minimax-m3-b300-mtp-dayzero

functionstackx commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

functionstackx commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented Jun 13, 2026 •

edited by cursor Bot

Loading