Skip to content

[Klaud Cold] minimaxm3-fp8-b300-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B300 recipe#1733

Merged
functionstackx merged 5 commits into
mainfrom
feat/minimax-m3-b300-mtp-dayzero
Jun 13, 2026
Merged

[Klaud Cold] minimaxm3-fp8-b300-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B300 recipe#1733
functionstackx merged 5 commits into
mainfrom
feat/minimax-m3-b300-mtp-dayzero

Conversation

@functionstackx

@functionstackx functionstackx commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) sibling of minimaxm3-fp8-b300-vllm (#1724): MiniMax-M3 MXFP8 on B300 single-node vLLM, pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head.

  • New script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300_mtp.sh, based on minimaxm3_fp8_b300.sh (same serve shape: mandatory --block-size 128 for the MSA sparse/index cache, --language-model-only, scenario-trimmed MAX_MODEL_LEN from the matrix). Adds:
    • --speculative-config '{"method": "eagle3", "model": <draft path>, "num_speculative_tokens": 3}'.
    • Draft download follows the same MODEL/MODEL_PATH split as the target: lands next to the MXFP8 weights in writable /data/models on b300, HF cache for stand-alone runs.
    • Cudagraph capture scaled to CONC * (1 + NUM_SPEC_TOKENS) (each running request contributes the extra draft tokens per decode step), still capped at vLLM's 2048 ceiling.
    • Benchmark prompts routed through the chat template (--use-chat-template) so draft acceptance reflects real text rather than raw random tokens, matching the other MTP recipes.
  • Config minimaxm3-fp8-b300-vllm-mtp in nvidia-master.yaml, same vllm/vllm-openai:minimax-m3 image. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end, per the dsv4-fp4-b300-vllm-mtp precedent (spec decode pays off at low/mid concurrency; acceptance dilutes in big batches; draft weights + draft KV shave headroom):
    • 1k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP4+EP4 (64–256), TP8+EP8 dp-attn (256–512)
    • 8k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP8+EP8 dp-attn (128–256)
    • tp2-ep2 dropped entirely — KV headroom was already thin without a draft.
  • perf-changelog entry for the new config key.

Validation

  • generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-b300-vllm-mtp generates the full matrix cleanly (scenario-trimmed max-model-len 2304 / 9472, eval points auto-assigned).
  • bash -n on the new script passes.

🤖 Generated with Claude Code


Note

Low Risk
Benchmark-only additions (YAML matrix, shell recipe, changelog) with no changes to production services or security-sensitive paths.

Overview
Adds a spec-decoding (MTP) benchmark track for MiniMax-M3 MXFP8 on B300: target MiniMaxAI/MiniMax-M3-MXFP8 plus draft Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens), alongside the existing non-MTP minimaxm3-fp8-b300-vllm entry.

New shell recipe minimaxm3_fp8_b300_mtp.sh extends the base MiniMax B300 serve shape with --speculative-config (EAGLE3), draft weight download beside the target, FLASH_ATTN on the drafter, cudagraph capture sized for CONC * (1 + spec tokens), and --use-chat-template on serving benchmarks. minimaxm3-fp8-b300-vllm-mtp in nvidia-master.yaml wires the sweep with spec-decoding: mtp and a trimmed concurrency search space (high-concurrency points and tp2-ep2 dropped vs non-MTP). perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit e032b16. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds the spec-decoding=mtp sibling of minimaxm3-fp8-b300-vllm: same
MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft
head via --speculative-config (method eagle3, 2 speculative tokens —
house default per dsv4_fp4_b300_vllm_mtp.sh). Cudagraph capture scales
to CONC * (1 + spec tokens) since each running request contributes the
extra draft tokens per decode step; benchmark prompts go through the
chat template so acceptance reflects real text rather than random
tokens. Search space mirrors the non-MTP entry trimmed at the
extreme-concurrency end (dsv4 b300 vllm mtp precedent); tp2-ep2 is
dropped because its KV headroom was already thin before adding the
draft weights + draft KV.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

functionstackx and others added 2 commits June 12, 2026 20:49
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

The canary died in FlashInferMetadataBuilder: the EAGLE3 head is MHA,
and FlashInfer only supports page size 128 (mandatory for the target's
MSA sparse/index cache) through its trtllm-gen kernel, which requires
GQA/MQA. SpeculativeConfig.attention_backend exists for exactly this —
the draft runs on FLASH_ATTN (any multiple-of-16 block size, MHA fine,
SM10x fine) while the target keeps FlashInfer.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

# Conflicts:
#	.github/configs/nvidia-master.yaml
#	perf-changelog.yaml
@functionstackx

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

1 similar comment
@functionstackx

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit 9cb2670 into main Jun 13, 2026
16 of 24 checks passed
@functionstackx functionstackx deleted the feat/minimax-m3-b300-mtp-dayzero branch June 13, 2026 01:50
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant