From a2f7a8ac665201fca0b183286e477626d1b6b4cd Mon Sep 17 00:00:00 2001 From: Fangzhou Ai Date: Tue, 30 Jun 2026 23:47:46 +0000 Subject: [PATCH 1/3] [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM MTP (incl. EP) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The MTP sweep hangs for ~8h on every expert-parallel (ep>1 / dp-attn) config. Root cause: with EP enabled and moe_backend=auto, vLLM's MXFP4 MoE selection skips the AITER backend (it doesn't support the EP BatchedExperts format) and falls back to Mxfp4MoeBackend.EMULATION. The emulation path builds the Quark hw-emulation C++ kernel (kernel_ext, for 9 ROCm arches) at warmup; all EP workers build into the shared torch_extensions dir simultaneously and deadlock on the build lock, so engine-core never finishes startup (shm_broadcast starves until the CI job timeout). Fix: mirror minimaxm3_fp4_mi355x_vllm.sh (the STP recipe) — export the AITER env knobs and pass --moe-backend aiter unconditionally. Forcing the AITER backend selects AITER_MXFP4_MXFP4 (no emulation, no C++ build). Verified locally on the pinned tip (4559c43a9 + aiter tip) under TP4+EP4 with EAGLE3 spec decoding: server starts, serves correctly, acceptance length ~3.5, conc=16 burst all-successful. Also pin the MTP image to nightly-4559c43a9 (the verified tip), matching the STP recipe. This differs from #1958/#1955, which disabled AITER MoE under EP (VLLM_ROCM_USE_AITER_MOE=0, no --moe-backend aiter) on the assumption that AITER MoE is EP-incompatible — that is precisely what triggers the emulation-build hang. AITER MoE works under EP here. AI assistance (Claude) was used to root-cause and verify this change. Signed-off-by: Fangzhou Ai Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/configs/amd-master.yaml | 2 +- .../fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh | 13 +++++++++++++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/.github/configs/amd-master.yaml b/.github/configs/amd-master.yaml index a437f4ecda..2633449469 100644 --- a/.github/configs/amd-master.yaml +++ b/.github/configs/amd-master.yaml @@ -2666,7 +2666,7 @@ minimaxm3-fp4-mi355x-vllm: # tokens. Search space mirrors the MI355X MXFP8 MTP entry, trimming the base # FP4 sweep at extreme concurrency where speculative decoding loses value. minimaxm3-fp4-mi355x-vllm-mtp: - image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e + image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 model: amd/MiniMax-M3-MXFP4 model-prefix: minimaxm3 runner: mi355x diff --git a/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh b/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh index 96a5604934..8a15b8c892 100755 --- a/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh +++ b/benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh @@ -36,6 +36,18 @@ fi SERVER_LOG=/workspace/server.log export VLLM_ENGINE_READY_TIMEOUT_S=3600 export VLLM_USE_BREAKABLE_CUDAGRAPH=0 +# Use AITER MoE for the MXFP4 experts, matching minimaxm3_fp4_mi355x_vllm.sh. +# This is required for ALL configs including expert parallelism: with EP enabled +# and moe_backend=auto, the AITER MXFP4 backend is skipped and selection falls +# back to Mxfp4MoeBackend.EMULATION, which triggers a first-time build of the +# Quark hw-emulation C++ kernel (kernel_ext, 9 ROCm arches) on every worker at +# warmup. Concurrent EP workers deadlock on the shared torch_extensions build +# lock, hanging engine-core for hours. Forcing --moe-backend aiter selects the +# AITER_MXFP4_MXFP4 backend instead (verified working under TP4+EP4 with EAGLE3 +# spec decoding), avoiding the emulation build entirely. +export VLLM_ROCM_USE_AITER=1 +export VLLM_ROCM_USE_AITER_MOE=1 +export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 if [ "${EVAL_ONLY}" = "true" ]; then setup_eval_context @@ -65,6 +77,7 @@ vllm serve "$MODEL" --port "$PORT" \ --language-model-only \ --max-model-len "$MAX_MODEL_LEN" \ --attention-backend TRITON_ATTN \ + --moe-backend aiter \ --speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \ --tool-call-parser minimax_m3 \ --enable-auto-tool-choice \ From dcc1331bf37d7f7bc17a945bb935c32cf9c5478c Mon Sep 17 00:00:00 2001 From: Fangzhou Ai Date: Tue, 30 Jun 2026 23:49:00 +0000 Subject: [PATCH 2/3] Add perf-changelog entry for minimaxm3-fp4-mi355x-vllm-mtp AITER MoE fix Signed-off-by: Fangzhou Ai Co-Authored-By: Claude Opus 4.8 (1M context) --- perf-changelog.yaml | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index c318f2a2ac..f2ab367105 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -4343,3 +4343,11 @@ - "Use nvidia/MiniMax-M3-NVFP4 from /scratch/models/MiniMax-M3-NVFP4 with vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41, which includes vllm-project/vllm PR #46380; no runtime patch needed" - "Reuse the existing MXFP8 B300 topology and concurrency matrix across 15 srt-slurm recipes, while dropping the FP8-only Marlin override from TP4 decode" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1931 + +- config-keys: + - minimaxm3-fp4-mi355x-vllm-mtp + description: + - "Enable AITER MoE on MiniMax-M3 MXFP4 MI355X single-node vLLM MTP (EAGLE3), mirroring the STP recipe: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter unconditionally (including expert parallelism)." + - "Fixes the ~8h engine-core startup hang on EP configs: with moe_backend=auto, EP fell back to Mxfp4MoeBackend.EMULATION, which deadlocked all expert-parallel workers building the Quark hw-emulation C++ kernel into a shared torch_extensions dir. Forcing --moe-backend aiter selects AITER_MXFP4_MXFP4 (no emulation build)." + - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e), matching the STP recipe." + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1964 From f8a11653a6fe73c76ca07ef0e739a559e5f66fd5 Mon Sep 17 00:00:00 2001 From: functionstackx <47992694+functionstackx@users.noreply.github.com> Date: Wed, 1 Jul 2026 20:49:14 -0400 Subject: [PATCH 3/3] Restore dropped config-keys header for minimaxm3-fp4-b300-dynamo-vllm changelog entry The new minimaxm3-fp4-mi355x-vllm-mtp entry accidentally clobbered the - config-keys: line of the following #1966 entry, merging the two into one changelog block. Re-add the header so it's a distinct entry again. Co-Authored-By: Claude Opus 4.8 (1M context) --- perf-changelog.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/perf-changelog.yaml b/perf-changelog.yaml index 95e944bac1..450226250f 100644 --- a/perf-changelog.yaml +++ b/perf-changelog.yaml @@ -4351,6 +4351,8 @@ - "Fixes the ~8h engine-core startup hang on EP configs: with moe_backend=auto, EP fell back to Mxfp4MoeBackend.EMULATION, which deadlocked all expert-parallel workers building the Quark hw-emulation C++ kernel into a shared torch_extensions dir. Forcing --moe-backend aiter selects AITER_MXFP4_MXFP4 (no emulation build)." - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e), matching the STP recipe." pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1964 + +- config-keys: - minimaxm3-fp4-b300-dynamo-vllm description: - "Add MiniMax-M3 NVFP4 B300 disaggregated vLLM benchmarks via Dynamo for 1k1k and 8k1k STP (no MTP)"