[Klaud Cold] minimaxm3-fp8-b300-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B300 recipe#1733
Conversation
Adds the spec-decoding=mtp sibling of minimaxm3-fp8-b300-vllm: same MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 2 speculative tokens — house default per dsv4_fp4_b300_vllm_mtp.sh). Cudagraph capture scales to CONC * (1 + spec tokens) since each running request contributes the extra draft tokens per decode step; benchmark prompts go through the chat template so acceptance reflects real text rather than random tokens. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (dsv4 b300 vllm mtp precedent); tp2-ep2 is dropped because its KV headroom was already thin before adding the draft weights + draft KV. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27451583496 |
The canary died in FlashInferMetadataBuilder: the EAGLE3 head is MHA, and FlashInfer only supports page size 128 (mandatory for the target's MSA sparse/index cache) through its trtllm-gen kernel, which requires GQA/MQA. SpeculativeConfig.attention_backend exists for exactly this — the draft runs on FLASH_ATTN (any multiple-of-16 block size, MHA fine, SM10x fine) while the target keeps FlashInfer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27451860491 |
|
/reuse-sweep-run |
# Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml
|
/reuse-sweep-run |
1 similar comment
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453028374 |
Summary
Adds the EAGLE3 speculative-decoding (
spec-decoding: mtp) sibling ofminimaxm3-fp8-b300-vllm(#1724): MiniMax-M3 MXFP8 on B300 single-node vLLM, pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head.benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300_mtp.sh, based onminimaxm3_fp8_b300.sh(same serve shape: mandatory--block-size 128for the MSA sparse/index cache,--language-model-only, scenario-trimmedMAX_MODEL_LENfrom the matrix). Adds:--speculative-config '{"method": "eagle3", "model": <draft path>, "num_speculative_tokens": 3}'.MODEL/MODEL_PATHsplit as the target: lands next to the MXFP8 weights in writable/data/modelson b300, HF cache for stand-alone runs.CONC * (1 + NUM_SPEC_TOKENS)(each running request contributes the extra draft tokens per decode step), still capped at vLLM's 2048 ceiling.--use-chat-template) so draft acceptance reflects real text rather than raw random tokens, matching the other MTP recipes.minimaxm3-fp8-b300-vllm-mtpinnvidia-master.yaml, samevllm/vllm-openai:minimax-m3image. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end, per thedsv4-fp4-b300-vllm-mtpprecedent (spec decode pays off at low/mid concurrency; acceptance dilutes in big batches; draft weights + draft KV shave headroom):Validation
generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-b300-vllm-mtpgenerates the full matrix cleanly (scenario-trimmed max-model-len 2304 / 9472, eval points auto-assigned).bash -non the new script passes.🤖 Generated with Claude Code
Note
Low Risk
Benchmark-only additions (YAML matrix, shell recipe, changelog) with no changes to production services or security-sensitive paths.
Overview
Adds a spec-decoding (MTP) benchmark track for MiniMax-M3 MXFP8 on B300: target
MiniMaxAI/MiniMax-M3-MXFP8plus draftInferact/MiniMax-M3-EAGLE3(3 speculative tokens), alongside the existing non-MTPminimaxm3-fp8-b300-vllmentry.New shell recipe
minimaxm3_fp8_b300_mtp.shextends the base MiniMax B300 serve shape with--speculative-config(EAGLE3), draft weight download beside the target, FLASH_ATTN on the drafter, cudagraph capture sized forCONC * (1 + spec tokens), and--use-chat-templateon serving benchmarks.minimaxm3-fp8-b300-vllm-mtpinnvidia-master.yamlwires the sweep withspec-decoding: mtpand a trimmed concurrency search space (high-concurrency points and tp2-ep2 dropped vs non-MTP).perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit e032b16. Bugbot is set up for automated code reviews on this repo. Configure here.