[AMD] feat: MiniMax M3 Day 0 support MI355X#1725
Conversation
MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on MI355X (gfx950). --block-size 128 (MSA sparse attention; default 16 fails on AMD), --attention-backend TRITON_ATTN, --language-model-only. Day-zero enablement: no public ROCm image carries M3 yet (vllm-project/vllm#45381 unmerged), so the bench script overlays the m3_release python tree onto the nightly-6fbfdd18 image and compiles the missing fused qknorm/rope/kv-insert _C op for gfx950 (cached on the shared mount). launch_mi355x-amds.sh routes M3 weights to NFS /it-share/hf-hub-cache (not node-local NVMe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve $MODEL --port $PORT \ |
There was a problem hiding this comment.
Serve starts after op failure
Medium Severity
The day-zero fused-op path ends with a Python assert that the _C symbol exists, but the script never checks that command’s exit status. If overlay, compile, or load_library fails, execution still reaches vllm serve, so jobs can run benchmarks without the mandatory fused_minimax_m3_qknorm_rope_kv_insert op loaded.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e803275. Configure here.
# Conflicts: # perf-changelog.yaml
|
Claude finished @cquil11's task in 2m 30s —— View job Review complete
Findings1 blocking issue found — see inline comment:
The Cursor Bugbot findings about missing overlay/compile steps and missing TRITON_ATTN backend appear to have been addressed in subsequent commits — the current script uses a dev image ( |
| vllm serve "$MODEL" --port "$PORT" \ | ||
| "${PARALLEL_ARGS[@]}" \ | ||
| --block-size 128 \ | ||
| --language-model-only \ | ||
| --max-model-len "$MAX_MODEL_LEN" \ | ||
| --attention-backend TRITON_ATTN \ | ||
| --enforce-eager \ | ||
| --tool-call-parser minimax_m3 \ | ||
| --reasoning-parser minimax_m3 \ | ||
| --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 & |
There was a problem hiding this comment.
🔴 BLOCKING: Missing --trust-remote-code on vllm serve
Why it matters: Every other MiniMax benchmark script (M2.5 MI355X, M2.5 MI300X/MI325X/H200/B300, and M3 B300) passes --trust-remote-code to vllm serve. MiniMax models use custom modeling code that vLLM needs to download and execute. Without this flag, the server will fail to load the model. The flag on run_benchmark_serving (line 77) only applies to the benchmark client, not the server.
Fix:
| vllm serve "$MODEL" --port "$PORT" \ | |
| "${PARALLEL_ARGS[@]}" \ | |
| --block-size 128 \ | |
| --language-model-only \ | |
| --max-model-len "$MAX_MODEL_LEN" \ | |
| --attention-backend TRITON_ATTN \ | |
| --enforce-eager \ | |
| --tool-call-parser minimax_m3 \ | |
| --reasoning-parser minimax_m3 \ | |
| --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 & | |
| vllm serve "$MODEL" --port "$PORT" \ | |
| "${PARALLEL_ARGS[@]}" \ | |
| --block-size 128 \ | |
| --language-model-only \ | |
| --max-model-len "$MAX_MODEL_LEN" \ | |
| --attention-backend TRITON_ATTN \ | |
| --enforce-eager \ | |
| --tool-call-parser minimax_m3 \ | |
| --reasoning-parser minimax_m3 \ | |
| --enable-auto-tool-choice \ | |
| --trust-remote-code > "$SERVER_LOG" 2>&1 & |
|
/reuse-sweep-run |
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27450194359 |
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| fi |
There was a problem hiding this comment.
Eval-only max length not applied
Medium Severity
In EVAL_ONLY mode the script calls setup_eval_context but never assigns MAX_MODEL_LEN from EVAL_MAX_MODEL_LEN before vllm serve. The server keeps the sweep’s benchmark MAX_MODEL_LEN while eval uses the capped context, which can break eval-only runs or over-allocate KV versus the model limit.
Reviewed by Cursor Bugbot for commit 2d15f24. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452292029 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3b7e102. Configure here.
| --enforce-eager \ | ||
| --tool-call-parser minimax_m3 \ | ||
| --reasoning-parser minimax_m3 \ | ||
| --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 & |
There was a problem hiding this comment.
Missing trust-remote-code on serve
Medium Severity
The new MI355X MiniMax-M3 script starts vllm serve without --trust-remote-code, while the sibling minimaxm3_fp8_b200.sh and dsv4_fp4_mi355x_vllm.sh pass it on the server command. MiniMax checkpoints often need custom model code at load time, so this mismatch can cause serve startup failures or divergent behavior on ROCm even when the benchmark client still passes --trust-remote-code.
Reviewed by Cursor Bugbot for commit 3b7e102. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452497472 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27472154647 |


MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on MI355X (gfx950).
minimaxm3-fp8-mi355x-vllm(.github/configs/amd-master.yaml) — TP8/TP4-EP4/TEP/DEP across 1k1k and 8k1k (30 jobs).benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh—--block-size 128(MSA sparse attention; default 16 fails on AMD with "No common block size for 16"),--attention-backend TRITON_ATTN,--language-model-only, MXFP8 checkpoint.m3_releasepython tree ([Model] Add MiniMax M3 support vllm-project/vllm#45381) ontovllm/vllm-openai-rocm:nightly-6fbfdd18and compiles the missing fusedqknorm/rope/kv-insert_Cop for gfx950 (cached on the shared mount; one build per image).launch_mi355x-amds.sh: routes M3 weights to NFS /it-share/hf-hub-cache (not node-local NVMe).Status: enablement works through engine load + KV alloc, but blocked on a gfx950 kernel fault — the first real forward faults with
HSA_STATUS_ERROR_EXCEPTION 0x1016in both eager and cudagraph mode (root-causing in progress). Sweep not yet green; do not merge until the forward-pass fault is resolved.🤖 Generated with Claude Code
Note
Medium Risk
Adds a new multi-job AMD sweep and launcher HF cache routing for a large MoE model; serving flags are specialized but changes are benchmark/infra-only with no auth or production runtime impact.
Overview
Adds day-zero MI355X (gfx950) fixed-sequence benchmarking for MiniMax-M3 MXFP8 via vLLM.
Registers
minimaxm3-fp8-mi355x-vllminamd-master.yamlwithvllm/vllm-openai-rocm:minimax-m3, modelMiniMaxAI/MiniMax-M3-MXFP8, and B300-style TP/EP/DEP sweeps on 1k1k and 8k1k.Introduces
minimaxm3_fp8_mi355x.sh, which serves with block size 128, TRITON_ATTN, FP8 KV cache, language-model-only, enforce-eager, and MiniMax-M3 tool/reasoning parsers, then runs the standard serving benchmark (optional lm-eval).Updates
launch_mi355x-amds.shsoMiniMaxAI/MiniMax-M3*weights use the NFS/it-share/hf-hub-cachemount instead of node-local NVMe. Documents the submission inperf-changelog.yaml.Reviewed by Cursor Bugbot for commit e94de69. Bugbot is set up for automated code reviews on this repo. Configure here.