From a2f7a8ac665201fca0b183286e477626d1b6b4cd Mon Sep 17 00:00:00 2001
From: Fangzhou Ai <fangzhouai@gmail.com>
Date: Tue, 30 Jun 2026 23:47:46 +0000
Subject: [PATCH 1/3] [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM MTP
 (incl. EP)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The MTP sweep hangs for ~8h on every expert-parallel (ep>1 / dp-attn)
config. Root cause: with EP enabled and moe_backend=auto, vLLM's MXFP4
MoE selection skips the AITER backend (it doesn't support the EP
BatchedExperts format) and falls back to Mxfp4MoeBackend.EMULATION. The
emulation path builds the Quark hw-emulation C++ kernel (kernel_ext, for
9 ROCm arches) at warmup; all EP workers build into the shared
torch_extensions dir simultaneously and deadlock on the build lock, so
engine-core never finishes startup (shm_broadcast starves until the CI
job timeout).

Fix: mirror minimaxm3_fp4_mi355x_vllm.sh (the STP recipe) — export the
AITER env knobs and pass --moe-backend aiter unconditionally. Forcing
the AITER backend selects AITER_MXFP4_MXFP4 (no emulation, no C++ build).
Verified locally on the pinned tip (4559c43a9 + aiter tip) under TP4+EP4
with EAGLE3 spec decoding: server starts, serves correctly, acceptance
length ~3.5, conc=16 burst all-successful. Also pin the MTP image to
nightly-4559c43a9 (the verified tip), matching the STP recipe.

This differs from #1958/#1955, which disabled AITER MoE under EP
(VLLM_ROCM_USE_AITER_MOE=0, no --moe-backend aiter) on the assumption
that AITER MoE is EP-incompatible — that is precisely what triggers the
emulation-build hang. AITER MoE works under EP here.

AI assistance (Claude) was used to root-cause and verify this change.

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/configs/amd-master.yaml                     |  2 +-
 .../fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh  | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml
index a437f4ecda..2633449469 100644
--- a/.github/configs/amd-master.yaml
+++ b/.github/configs/amd-master.yaml
@@ -2666,7 +2666,7 @@ minimaxm3-fp4-mi355x-vllm:
 # tokens. Search space mirrors the MI355X MXFP8 MTP entry, trimming the base
 # FP4 sweep at extreme concurrency where speculative decoding loses value.
 minimaxm3-fp4-mi355x-vllm-mtp:
-  image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
+  image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1
   model: amd/MiniMax-M3-MXFP4
   model-prefix: minimaxm3
   runner: mi355x
diff --git a/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh b/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh
index 96a5604934..8a15b8c892 100755
--- a/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh
+++ b/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh
@@ -36,6 +36,18 @@ fi
 SERVER_LOG=/workspace/server.log
 export VLLM_ENGINE_READY_TIMEOUT_S=3600
 export VLLM_USE_BREAKABLE_CUDAGRAPH=0
+# Use AITER MoE for the MXFP4 experts, matching minimaxm3_fp4_mi355x_vllm.sh.
+# This is required for ALL configs including expert parallelism: with EP enabled
+# and moe_backend=auto, the AITER MXFP4 backend is skipped and selection falls
+# back to Mxfp4MoeBackend.EMULATION, which triggers a first-time build of the
+# Quark hw-emulation C++ kernel (kernel_ext, 9 ROCm arches) on every worker at
+# warmup. Concurrent EP workers deadlock on the shared torch_extensions build
+# lock, hanging engine-core for hours. Forcing --moe-backend aiter selects the
+# AITER_MXFP4_MXFP4 backend instead (verified working under TP4+EP4 with EAGLE3
+# spec decoding), avoiding the emulation build entirely.
+export VLLM_ROCM_USE_AITER=1
+export VLLM_ROCM_USE_AITER_MOE=1
+export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
 
 if [ "${EVAL_ONLY}" = "true" ]; then
     setup_eval_context
@@ -65,6 +77,7 @@ vllm serve "$MODEL" --port "$PORT" \
     --language-model-only \
     --max-model-len "$MAX_MODEL_LEN" \
     --attention-backend TRITON_ATTN \
+    --moe-backend aiter \
     --speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
     --tool-call-parser minimax_m3 \
     --enable-auto-tool-choice \

From dcc1331bf37d7f7bc17a945bb935c32cf9c5478c Mon Sep 17 00:00:00 2001
From: Fangzhou Ai <fangzhouai@gmail.com>
Date: Tue, 30 Jun 2026 23:49:00 +0000
Subject: [PATCH 2/3] Add perf-changelog entry for
 minimaxm3-fp4-mi355x-vllm-mtp AITER MoE fix

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 perf-changelog.yaml | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index c318f2a2ac..f2ab367105 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -4343,3 +4343,11 @@
     - "Use nvidia/MiniMax-M3-NVFP4 from /scratch/models/MiniMax-M3-NVFP4 with vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41, which includes vllm-project/vllm PR #46380; no runtime patch needed"
     - "Reuse the existing MXFP8 B300 topology and concurrency matrix across 15 srt-slurm recipes, while dropping the FP8-only Marlin override from TP4 decode"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1931
+
+- config-keys:
+    - minimaxm3-fp4-mi355x-vllm-mtp
+  description:
+    - "Enable AITER MoE on MiniMax-M3 MXFP4 MI355X single-node vLLM MTP (EAGLE3), mirroring the STP recipe: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter unconditionally (including expert parallelism)."
+    - "Fixes the ~8h engine-core startup hang on EP configs: with moe_backend=auto, EP fell back to Mxfp4MoeBackend.EMULATION, which deadlocked all expert-parallel workers building the Quark hw-emulation C++ kernel into a shared torch_extensions dir. Forcing --moe-backend aiter selects AITER_MXFP4_MXFP4 (no emulation build)."
+    - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e), matching the STP recipe."
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1964

From f8a11653a6fe73c76ca07ef0e739a559e5f66fd5 Mon Sep 17 00:00:00 2001
From: functionstackx <47992694+functionstackx@users.noreply.github.com>
Date: Wed, 1 Jul 2026 20:49:14 -0400
Subject: [PATCH 3/3] Restore dropped config-keys header for
 minimaxm3-fp4-b300-dynamo-vllm changelog entry

The new minimaxm3-fp4-mi355x-vllm-mtp entry accidentally clobbered the
- config-keys: line of the following #1966 entry, merging the two into
one changelog block. Re-add the header so it's a distinct entry again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 perf-changelog.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/perf-changelog.yaml b/perf-changelog.yaml
index 95e944bac1..450226250f 100644
--- a/perf-changelog.yaml
+++ b/perf-changelog.yaml
@@ -4351,6 +4351,8 @@
     - "Fixes the ~8h engine-core startup hang on EP configs: with moe_backend=auto, EP fell back to Mxfp4MoeBackend.EMULATION, which deadlocked all expert-parallel workers building the Quark hw-emulation C++ kernel into a shared torch_extensions dir. Forcing --moe-backend aiter selects AITER_MXFP4_MXFP4 (no emulation build)."
     - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e), matching the STP recipe."
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1964
+
+- config-keys:
     - minimaxm3-fp4-b300-dynamo-vllm
   description:
     - "Add MiniMax-M3 NVFP4 B300 disaggregated vLLM benchmarks via Dynamo for 1k1k and 8k1k STP (no MTP)"