[Klaud Cold] minimaxm3-fp8-b200-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) B200 recipe#1741
Merged
Merged
Conversation
Adds the spec-decoding=mtp sibling of minimaxm3-fp8-b200-vllm: same MXFP8 target and serve shape, plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 3 speculative tokens). The drafter is pinned to FLASH_ATTN — the EAGLE3 head is MHA and FlashInfer only supports the mandatory page size 128 through its GQA-only trtllm-gen kernel (the failure that hit the B300 MTP canary). Cudagraph capture scales to CONC * (1 + spec tokens); benchmark prompts run through the chat template so acceptance reflects real text. The EAGLE3 draft is fetched next to the pre-staged target on b200-dgxc (gharunners tree is writable) or into the HF cache on bare-HF-id runners. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end, identical to the minimaxm3-fp8-b300-vllm-mtp precedent. Launcher needs no change — launch_b200-dgxc.sh already carries SPEC_SUFFIX and resolves the _mtp script. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
e31ffdf to
2a04b15
Compare
Contributor
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27456333115 |
Collaborator
Author
|
/reuse-sweep-run |
Contributor
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27458619687 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the EAGLE3 speculative-decoding (
spec-decoding: mtp) sibling ofminimaxm3-fp8-b200-vllm: MiniMax-M3 MXFP8 on B200 single-node vLLM, pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head. Based on the B200 non-MTP recipe, mirroring the merged B300 MTP entry (#1733).New benchmark script
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200_mtp.sh, based onminimaxm3_fp8_b200.sh(mandatory--block-size 128for the MSA sparse/index cache,--language-model-only, scenario-trimmedMAX_MODEL_LEN). Additions for MTP:--speculative-config '{"method": "eagle3", "model": <draft>, "num_speculative_tokens": 3, "attention_backend": "FLASH_ATTN"}'.FLASH_ATTN: the EAGLE3 head is MHA, and FlashInfer only supports page size 128 (mandatory here) through its GQA-only trtllm-gen kernel — engine init dies inFlashInferMetadataBuilderotherwise. This exact failure hit the B300 MTP canary; FLASH_ATTN takes any multiple-of-16 block size and handles MHA.b200-dgxc(the/lustre/fsw/gharunnerstree is writable), or into the HF cache on bare-HF-id runners (b200-cw/b200-nb).CONC * (1 + spec tokens), capped at 2048.--use-chat-templateon the benchmark client so draft acceptance reflects real text rather than random tokens.Config (
nvidia-master.yaml)minimaxm3-fp8-b200-vllm-mtp, samevllm/vllm-openai:minimax-m3image andb200-dgxcrunner. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (identical tominimaxm3-fp8-b300-vllm-mtp):No launcher change needed —
launch_b200-dgxc.shalready carriesSPEC_SUFFIXand resolvesminimaxm3_fp8_b200_mtp.sh.perf-changelog
Entry for the new config key.
Validation
generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-b200-vllm-mtpgenerates 53 configs cleanly (scenario-trimmed max-model-len 2304 / 9472, allspec-decoding=mtponb200-dgxc).bash -npasses on the new script._vllm_mtptominimaxm3_fp8_b200_mtp.sh(exists).🤖 Generated with Claude Code
Note
Low Risk
Benchmark-only additions (YAML config, shell script, changelog) with no changes to runtime application or security-sensitive paths.
Overview
Adds the EAGLE3 speculative-decoding (
spec-decoding: mtp) benchmark for MiniMax-M3 MXFP8 on B200, pairing the target MiniMaxAI/MiniMax-M3-MXFP8 with draft Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens).New fixed-seq-len script
minimaxm3_fp8_b200_mtp.shextends the non-MTP B200 serve shape with--speculative-config(eagle3), drafterattention_backend: FLASH_ATTN(MHA draft vs FlashInfer/GQA constraint at block size 128), draft weight download beside pre-staged target on dgxc or HF cache otherwise, cudagraph capture scaled to concurrency × (1 + spec tokens), and--use-chat-templateon the serving benchmark client.nvidia-master.yamlgainsminimaxm3-fp8-b200-vllm-mtpwith the same image/runner as the base B200 entry and a search space trimmed at high concurrency (aligned with the existing B300 MTP config).perf-changelog.yamldocuments the new config key.Reviewed by Cursor Bugbot for commit 1a66768. Bugbot is set up for automated code reviews on this repo. Configure here.