Skip to content

[AMD] feat: MiniMax M3 Day 0 support MI355X#1725

Merged
functionstackx merged 23 commits into
mainfrom
feat/minimax-m3-mi355x
Jun 13, 2026
Merged

[AMD] feat: MiniMax M3 Day 0 support MI355X#1725
functionstackx merged 23 commits into
mainfrom
feat/minimax-m3-mi355x

Conversation

@cquil11

@cquil11 cquil11 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on MI355X (gfx950).

  • New config minimaxm3-fp8-mi355x-vllm (.github/configs/amd-master.yaml) — TP8/TP4-EP4/TEP/DEP across 1k1k and 8k1k (30 jobs).
  • New bench script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh--block-size 128 (MSA sparse attention; default 16 fails on AMD with "No common block size for 16"), --attention-backend TRITON_ATTN, --language-model-only, MXFP8 checkpoint.
  • Day-zero enablement (no public ROCm M3 image exists yet): the script overlays the unmerged m3_release python tree ([Model] Add MiniMax M3 support vllm-project/vllm#45381) onto vllm/vllm-openai-rocm:nightly-6fbfdd18 and compiles the missing fused qknorm/rope/kv-insert _C op for gfx950 (cached on the shared mount; one build per image).
  • launch_mi355x-amds.sh: routes M3 weights to NFS /it-share/hf-hub-cache (not node-local NVMe).

Status: enablement works through engine load + KV alloc, but blocked on a gfx950 kernel fault — the first real forward faults with HSA_STATUS_ERROR_EXCEPTION 0x1016 in both eager and cudagraph mode (root-causing in progress). Sweep not yet green; do not merge until the forward-pass fault is resolved.

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new multi-job AMD sweep and launcher HF cache routing for a large MoE model; serving flags are specialized but changes are benchmark/infra-only with no auth or production runtime impact.

Overview
Adds day-zero MI355X (gfx950) fixed-sequence benchmarking for MiniMax-M3 MXFP8 via vLLM.

Registers minimaxm3-fp8-mi355x-vllm in amd-master.yaml with vllm/vllm-openai-rocm:minimax-m3, model MiniMaxAI/MiniMax-M3-MXFP8, and B300-style TP/EP/DEP sweeps on 1k1k and 8k1k.

Introduces minimaxm3_fp8_mi355x.sh, which serves with block size 128, TRITON_ATTN, FP8 KV cache, language-model-only, enforce-eager, and MiniMax-M3 tool/reasoning parsers, then runs the standard serving benchmark (optional lm-eval).

Updates launch_mi355x-amds.sh so MiniMaxAI/MiniMax-M3* weights use the NFS /it-share/hf-hub-cache mount instead of node-local NVMe. Documents the submission in perf-changelog.yaml.

Reviewed by Cursor Bugbot for commit e94de69. Bugbot is set up for automated code reviews on this repo. Configure here.

MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on MI355X
(gfx950). --block-size 128 (MSA sparse attention; default 16 fails on
AMD), --attention-backend TRITON_ATTN, --language-model-only.

Day-zero enablement: no public ROCm image carries M3 yet
(vllm-project/vllm#45381 unmerged), so the bench script overlays the
m3_release python tree onto the nightly-6fbfdd18 image and compiles the
missing fused qknorm/rope/kv-insert _C op for gfx950 (cached on the
shared mount). launch_mi355x-amds.sh routes M3 weights to NFS
/it-share/hf-hub-cache (not node-local NVMe).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

start_gpu_monitor

set -x
vllm serve $MODEL --port $PORT \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serve starts after op failure

Medium Severity

The day-zero fused-op path ends with a Python assert that the _C symbol exists, but the script never checks that command’s exit status. If overlay, compile, or load_library fails, execution still reaches vllm serve, so jobs can run benchmarks without the mandatory fused_minimax_m3_qknorm_rope_kv_insert op loaded.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e803275. Configure here.

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh Outdated
Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh Outdated
Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh Outdated
Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh
@cquil11 cquil11 changed the title [AMD] feat: MiniMax M3 Day 0 support MI355X [AMD][needs rocm m3 vllm image] feat: MiniMax M3 Day 0 support MI355X Jun 12, 2026
@cquil11 cquil11 marked this pull request as draft June 12, 2026 20:54
@cquil11 cquil11 marked this pull request as ready for review June 13, 2026 00:05
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Claude finished @cquil11's task in 2m 30s —— View job


Review complete

  • Read repo guidelines
  • Read all changed files (benchmark script, amd-master.yaml, perf-changelog.yaml, launch script)
  • Validate master config and perf-changelog — both present ✅
  • Validate perf-changelog entry position — appended to end ✅
  • Validate container image accessibility — rocm/vllm-dev:v0.22.1rc-0612 (Docker Hub, public) ✅
  • Validate enroot import in launch script — present ✅
  • Validate expert parallelism — correct conditional EP_SIZE pattern ✅
  • Validate benchmark script code style — proper line continuations ✅
  • Check for bugs and logic errors

Findings

1 blocking issue found — see inline comment:

  • 🔴 vllm serve is missing --trust-remote-code. Every other MiniMax script (M2.5 and M3 B300) passes this flag to the server. Without it, vLLM cannot load MiniMax-M3's custom model code and the server will fail to start.

The Cursor Bugbot findings about missing overlay/compile steps and missing TRITON_ATTN backend appear to have been addressed in subsequent commits — the current script uses a dev image (rocm/vllm-dev:v0.22.1rc-0612) that presumably includes M3 support, and --attention-backend TRITON_ATTN is present on line 57.

@cquil11 cquil11 changed the title [AMD][needs rocm m3 vllm image] feat: MiniMax M3 Day 0 support MI355X [AMD] feat: MiniMax M3 Day 0 support MI355X Jun 13, 2026
Comment on lines +52 to +61
vllm serve "$MODEL" --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--block-size 128 \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--enforce-eager \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: Missing --trust-remote-code on vllm serve

Why it matters: Every other MiniMax benchmark script (M2.5 MI355X, M2.5 MI300X/MI325X/H200/B300, and M3 B300) passes --trust-remote-code to vllm serve. MiniMax models use custom modeling code that vLLM needs to download and execute. Without this flag, the server will fail to load the model. The flag on run_benchmark_serving (line 77) only applies to the benchmark client, not the server.

Fix:

Suggested change
vllm serve "$MODEL" --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--block-size 128 \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--enforce-eager \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &
vllm serve "$MODEL" --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--block-size 128 \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--enforce-eager \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice \
--trust-remote-code > "$SERVER_LOG" 2>&1 &

@cquil11

cquil11 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@github-actions

Copy link
Copy Markdown
Contributor


if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval-only max length not applied

Medium Severity

In EVAL_ONLY mode the script calls setup_eval_context but never assigns MAX_MODEL_LEN from EVAL_MAX_MODEL_LEN before vllm serve. The server keeps the sweep’s benchmark MAX_MODEL_LEN while eval uses the capped context, which can break eval-only runs or over-allocate KV versus the model limit.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2d15f24. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3b7e102. Configure here.

--enforce-eager \
--tool-call-parser minimax_m3 \
--reasoning-parser minimax_m3 \
--enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing trust-remote-code on serve

Medium Severity

The new MI355X MiniMax-M3 script starts vllm serve without --trust-remote-code, while the sibling minimaxm3_fp8_b200.sh and dsv4_fp4_mi355x_vllm.sh pass it on the server command. MiniMax checkpoints often need custom model code at load time, so this mismatch can cause serve startup failures or divergent behavior on ROCm even when the benchmark client still passes --trust-remote-code.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3b7e102. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 8dc7ef6 into main Jun 13, 2026
13 of 16 checks passed
@functionstackx functionstackx deleted the feat/minimax-m3-mi355x branch June 13, 2026 16:17
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants