[WIP][NV] Use Marlin for MiniMax M3 TP-only configs#1807
[WIP][NV] Use Marlin for MiniMax M3 TP-only configs#1807jasonlizhengjian wants to merge 3 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @jasonlizhengjian's task in 59s —— View job Review
LGTM — no blocking issues found. All four scripts consistently add |
|
/sweep test-config --config-files .github/configs/nvidia-master.yaml --config-keys minimaxm3-fp8-b200-vllm minimaxm3-fp8-b300-vllm minimaxm3-fp8-b200-vllm-mtp minimaxm3-fp8-b300-vllm-mtp |
|
@jasonlizhengjian Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27630154690 |
Stacked on #1781 and #1784.
Adds
--moe-backend marlinfor MiniMax-M3 B200/B300 TP-only vLLM launch paths when expert parallelism is disabled.Validation:
bash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300.sh benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200_mtp.sh benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b300_mtp.shgit diff --checkperf-changelog.yamland.github/configs/nvidia-master.yamlNote
Low Risk
Benchmark-only vLLM serve flags for a specific MoE backend when expert parallelism is off; no auth, data, or core application logic changes.
Overview
For MiniMax-M3 MXFP8 B200/B300 vLLM fixed-sequence recipes, the TP-only launch path (
DP_ATTENTIONfalse andEP_SIZE1) now passes--moe-backend marlinalongside--tensor-parallel-size=$TP. DP-attention and expert-parallel branches are unchanged.The same
PARALLEL_ARGStweak is applied to the non-MTP and EAGLE3 MTP scripts (minimaxm3_fp8_b200.sh,minimaxm3_fp8_b300.sh, and their_mtpsiblings).perf-changelog.yamlrecords the change forminimaxm3-fp8-b200-vllm,minimaxm3-fp8-b300-vllm, and the matching MTP config keys.Reviewed by Cursor Bugbot for commit 32c5fe0. Bugbot is set up for automated code reviews on this repo. Configure here.