[AMD] feat: MiniMax M3 Day 0 support MI355X by cquil11 · Pull Request #1725 · SemiAnalysisAI/InferenceX

cquil11 · 2026-06-12T19:51:13Z

MiniMax-M3 MXFP8 day-zero single-node vLLM sweep on MI355X (gfx950).

New config minimaxm3-fp8-mi355x-vllm (.github/configs/amd-master.yaml) — TP8/TP4-EP4/TEP/DEP across 1k1k and 8k1k (30 jobs).
New bench script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh — --block-size 128 (MSA sparse attention; default 16 fails on AMD with "No common block size for 16"), --attention-backend TRITON_ATTN, --language-model-only, MXFP8 checkpoint.
Day-zero enablement (no public ROCm M3 image exists yet): the script overlays the unmerged m3_release python tree ([Model] Add MiniMax M3 support vllm-project/vllm#45381) onto vllm/vllm-openai-rocm:nightly-6fbfdd18 and compiles the missing fused qknorm/rope/kv-insert _C op for gfx950 (cached on the shared mount; one build per image).
launch_mi355x-amds.sh: routes M3 weights to NFS /it-share/hf-hub-cache (not node-local NVMe).

Status: enablement works through engine load + KV alloc, but blocked on a gfx950 kernel fault — the first real forward faults with HSA_STATUS_ERROR_EXCEPTION 0x1016 in both eager and cudagraph mode (root-causing in progress). Sweep not yet green; do not merge until the forward-pass fault is resolved.

🤖 Generated with Claude Code

Note

Medium Risk
Adds a new multi-job AMD sweep and launcher HF cache routing for a large MoE model; serving flags are specialized but changes are benchmark/infra-only with no auth or production runtime impact.

Overview
Adds day-zero MI355X (gfx950) fixed-sequence benchmarking for MiniMax-M3 MXFP8 via vLLM.

Registers minimaxm3-fp8-mi355x-vllm in amd-master.yaml with vllm/vllm-openai-rocm:minimax-m3, model MiniMaxAI/MiniMax-M3-MXFP8, and B300-style TP/EP/DEP sweeps on 1k1k and 8k1k.

Introduces minimaxm3_fp8_mi355x.sh, which serves with block size 128, TRITON_ATTN, FP8 KV cache, language-model-only, enforce-eager, and MiniMax-M3 tool/reasoning parsers, then runs the standard serving benchmark (optional lm-eval).

Updates launch_mi355x-amds.sh so MiniMaxAI/MiniMax-M3* weights use the NFS /it-share/hf-hub-cache mount instead of node-local NVMe. Documents the submission in perf-changelog.yaml.

^{Reviewed by Cursor Bugbot for commit e94de69. Bugbot is set up for automated code reviews on this repo. Configure here.}

MXFP8 single-node vLLM sweep (TP/TEP/DEP) for MiniMax-M3 on MI355X (gfx950). --block-size 128 (MSA sparse attention; default 16 fails on AMD), --attention-backend TRITON_ATTN, --language-model-only. Day-zero enablement: no public ROCm image carries M3 yet (vllm-project/vllm#45381 unmerged), so the bench script overlays the m3_release python tree onto the nightly-6fbfdd18 image and compiles the missing fused qknorm/rope/kv-insert _C op for gfx950 (cached on the shared mount). launch_mi355x-amds.sh routes M3 weights to NFS /it-share/hf-hub-cache (not node-local NVMe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-12T19:51:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-12T19:51:23Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor · 2026-06-12T19:55:07Z

+start_gpu_monitor
+
+set -x
+vllm serve $MODEL --port $PORT \


Serve starts after op failure

Medium Severity

The day-zero fused-op path ends with a Python assert that the _C symbol exists, but the script never checks that command’s exit status. If overlay, compile, or load_library fails, execution still reaches vllm serve, so jobs can run benchmarks without the mandatory fused_minimax_m3_qknorm_rope_kv_insert op loaded.

Additional Locations (1)

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh#L95-L121

^{Reviewed by Cursor Bugbot for commit e803275. Configure here.}

# Conflicts: # perf-changelog.yaml

claude · 2026-06-13T00:05:53Z

Claude finished @cquil11's task in 2m 30s —— View job

Review complete

Read repo guidelines
Read all changed files (benchmark script, amd-master.yaml, perf-changelog.yaml, launch script)
Validate master config and perf-changelog — both present ✅
Validate perf-changelog entry position — appended to end ✅
Validate container image accessibility — rocm/vllm-dev:v0.22.1rc-0612 (Docker Hub, public) ✅
Validate enroot import in launch script — present ✅
Validate expert parallelism — correct conditional EP_SIZE pattern ✅
Validate benchmark script code style — proper line continuations ✅
Check for bugs and logic errors

Findings

1 blocking issue found — see inline comment:

🔴 vllm serve is missing --trust-remote-code. Every other MiniMax script (M2.5 and M3 B300) passes this flag to the server. Without it, vLLM cannot load MiniMax-M3's custom model code and the server will fail to start.

The Cursor Bugbot findings about missing overlay/compile steps and missing TRITON_ATTN backend appear to have been addressed in subsequent commits — the current script uses a dev image (rocm/vllm-dev:v0.22.1rc-0612) that presumably includes M3 support, and --attention-backend TRITON_ATTN is present on line 57.

claude · 2026-06-13T00:08:10Z

+vllm serve "$MODEL" --port "$PORT" \
+    "${PARALLEL_ARGS[@]}" \
+    --block-size 128 \
+    --language-model-only \
+    --max-model-len "$MAX_MODEL_LEN" \
+    --attention-backend TRITON_ATTN \
+    --enforce-eager \
+    --tool-call-parser minimax_m3 \
+    --reasoning-parser minimax_m3 \
+    --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &


🔴 BLOCKING: Missing --trust-remote-code on vllm serve

Why it matters: Every other MiniMax benchmark script (M2.5 MI355X, M2.5 MI300X/MI325X/H200/B300, and M3 B300) passes --trust-remote-code to vllm serve. MiniMax models use custom modeling code that vLLM needs to download and execute. Without this flag, the server will fail to load the model. The flag on run_benchmark_serving (line 77) only applies to the benchmark client, not the server.

Fix:

Suggested change

vllm serve "$MODEL" --port "$PORT" \

"${PARALLEL_ARGS[@]}" \

--block-size 128 \

--language-model-only \

--max-model-len "$MAX_MODEL_LEN" \

--attention-backend TRITON_ATTN \

--enforce-eager \

--tool-call-parser minimax_m3 \

--reasoning-parser minimax_m3 \

--enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &

vllm serve "$MODEL" --port "$PORT" \

"${PARALLEL_ARGS[@]}" \

--block-size 128 \

--language-model-only \

--max-model-len "$MAX_MODEL_LEN" \

--attention-backend TRITON_ATTN \

--enforce-eager \

--tool-call-parser minimax_m3 \

--reasoning-parser minimax_m3 \

--enable-auto-tool-choice \

--trust-remote-code > "$SERVER_LOG" 2>&1 &

cquil11 · 2026-06-13T00:25:54Z

/reuse-sweep-run

# Conflicts: # perf-changelog.yaml

github-actions · 2026-06-13T01:21:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27450194359
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27450194359

cursor · 2026-06-13T01:22:36Z

+
+if [ "${EVAL_ONLY}" = "true" ]; then
+    setup_eval_context
+fi


Eval-only max length not applied

Medium Severity

In EVAL_ONLY mode the script calls setup_eval_context but never assigns MAX_MODEL_LEN from EVAL_MAX_MODEL_LEN before vllm serve. The server keeps the sweep’s benchmark MAX_MODEL_LEN while eval uses the capped context, which can break eval-only runs or over-allocate KV versus the model limit.

^{Reviewed by Cursor Bugbot for commit 2d15f24. Configure here.}

github-actions · 2026-06-13T01:30:07Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452292029
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452292029

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3b7e102. Configure here.}

cursor · 2026-06-13T01:31:05Z

+    --enforce-eager \
+    --tool-call-parser minimax_m3 \
+    --reasoning-parser minimax_m3 \
+    --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &


Missing trust-remote-code on serve

Medium Severity

The new MI355X MiniMax-M3 script starts vllm serve without --trust-remote-code, while the sibling minimaxm3_fp8_b200.sh and dsv4_fp4_mi355x_vllm.sh pass it on the server command. MiniMax checkpoints often need custom model code at load time, so this mismatch can cause serve startup failures or divergent behavior on ROCm even when the benchmark client still passes --trust-remote-code.

^{Reviewed by Cursor Bugbot for commit 3b7e102. Configure here.}

github-actions · 2026-06-13T08:04:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452497472
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452497472

functionstackx · 2026-06-13T16:17:13Z

/reuse-sweep-run

github-actions · 2026-06-13T16:17:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27472154647
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27472154647

cquil11 requested a review from a team June 12, 2026 19:51

cquil11 requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 12, 2026 19:51

github-project-automation Bot added this to InferenceMAX Board Jun 12, 2026

cquil11 and others added 2 commits June 12, 2026 14:53

Merge branch 'main' into feat/minimax-m3-mi355x

e803275

minimaxm3-fp8-mi355x-vllm: add perf-changelog entry

447bbe5

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor Bot reviewed Jun 12, 2026

View reviewed changes

fix(amd): disable M3 graph capture on MI355X

ad1df2a

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh Outdated

cquil11 added 3 commits June 12, 2026 15:25

fix(amd): keep eager flag in serve command

7209a3b

fix(amd): use wave32 mask for M3 fused op

44741e3

fix(amd): compile M3 fused op as wave32

277613f

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh Outdated

cquil11 added 2 commits June 12, 2026 15:42

fix(amd): refresh M3 worker op loader

40a5008

refactor(amd): use upstream MiniMax M3 recipe

137b4d3

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh Outdated

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x.sh

fix(amd): use available MiniMax parsers

bf697c8

cquil11 changed the title ~~[AMD] feat: MiniMax M3 Day 0 support MI355X~~ [AMD][needs rocm m3 vllm image] feat: MiniMax M3 Day 0 support MI355X Jun 12, 2026

cquil11 marked this pull request as draft June 12, 2026 20:54

functionstackx mentioned this pull request Jun 12, 2026

[Klaud Cold][NVIDIA] feat: MiniMax M3 Day 0 support H200 #1728

Closed

cquil11 added 4 commits June 12, 2026 16:32

chore(amd): test M3 ROCm release candidate image

e453206

fix(amd): use M3 parsers with ROCm RC image

8f73bda

fix(amd): bypass M3 decode graph capture

a2f466e

fix(amd): use M3 recipe attention backend

e4cc6a6

cquil11 added 5 commits June 12, 2026 17:17

fix(amd): match published M3 MXFP8 recipe

8de1cd6

fix(amd): run M3 text sweeps without graph capture

a9ce153

fix(amd): force M3 Triton attention backend

18fb62e

feat(amd): expand M3 MI355X sweep matrix

567f264

Merge remote-tracking branch 'origin/main' into feat/minimax-m3-mi355x

eb48c54

# Conflicts: # perf-changelog.yaml

cquil11 added the full-sweep-fail-fast label Jun 13, 2026

cquil11 marked this pull request as ready for review June 13, 2026 00:05

cquil11 changed the title ~~[AMD][needs rocm m3 vllm image] feat: MiniMax M3 Day 0 support MI355X~~ [AMD] feat: MiniMax M3 Day 0 support MI355X Jun 13, 2026

claude Bot reviewed Jun 13, 2026

View reviewed changes

cquil11 added 2 commits June 12, 2026 20:19

fix(amd): use dedicated M3 ROCm image

4a583eb

Merge remote-tracking branch 'origin/main' into feat/minimax-m3-mi355x

2d15f24

# Conflicts: # perf-changelog.yaml

cursor Bot reviewed Jun 13, 2026

View reviewed changes

fix(amd): align M3 with official ROCm recipe

3b7e102

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Merge branch 'main' into feat/minimax-m3-mi355x

e94de69

functionstackx merged commit 8dc7ef6 into main Jun 13, 2026
13 of 16 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

functionstackx deleted the feat/minimax-m3-mi355x branch June 13, 2026 16:17

functionstackx mentioned this pull request Jun 13, 2026

[Klaud Cold] minimaxm3-fp8-mi355x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI355X recipe #1742

Closed

Uh oh!

Conversation

cquil11 commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Serve starts after op failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review complete

Findings

Uh oh!

claude Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

cquil11 commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Eval-only max length not applied

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Missing trust-remote-code on serve

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cquil11 commented Jun 12, 2026 •

edited by cursor Bot

Loading

claude Bot commented Jun 13, 2026 •

edited

Loading