Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2666,7 +2666,7 @@ minimaxm3-fp4-mi355x-vllm:
# tokens. Search space mirrors the MI355X MXFP8 MTP entry, trimming the base
# FP4 sweep at extreme concurrency where speculative decoding loses value.
minimaxm3-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
image: vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR bumps the minimaxm3-fp4-mi355x-vllm-mtp image tag (nightly-3f5a1e173...nightly-4559c43a9...) and adds VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, and --moe-backend aiter, but does not append the required entry to perf-changelog.yaml. Without that entry the run-sweep workflow has no trigger record for the new config on this PR, so the EP-hang fix that this PR exists to deliver will not be re-benchmarked end-to-end by CI before merge. Mirror the sister STP entry at perf-changelog.yaml lines 4320–4325 (from #1954): a 4–6 line block under config-keys: [minimaxm3-fp4-mi355x-vllm-mtp] describing the image pin and AITER MoE enablement, with pr-link pointing at this PR.

Extended reasoning...

What's missing

AGENTS.md §"Updating Docker images" (line 126) explicitly mandates: "Update the image tag in the relevant .github/configs/*-master.yaml and/or benchmarks/*.sh, update any related env vars / config params, and append a perf-changelog.yaml entry (required - triggers benchmarks)". AGENTS.md line 58 reinforces this: "Changes to perf-changelog.yaml trigger benchmark runs."

This PR makes both of the changelog-triggering changes AGENTS.md calls out — (1) bumps the image tag at .github/configs/amd-master.yaml:2669 from nightly-3f5a1e1733200760169ff31ebe60a271072b199e to nightly-4559c43a9526597c00cbcc4f59979496500268d1, and (2) adds three VLLM_ROCM_USE_AITER* env vars plus --moe-backend aiter to benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh — but the PR diff modifies only those two files. perf-changelog.yaml is not touched.

Why this matters here (not generic policy concern)

The sister STP PR #1954 — which makes byte-for-byte equivalent changes (same image pin, same three AITER env vars, same --moe-backend aiter) for the non-MTP variant minimaxm3-fp4-mi355x-vllmdid append a proper entry at perf-changelog.yaml lines 4320–4325 with the matching config-keys: [minimaxm3-fp4-mi355x-vllm] block and a pr-link to #1954. The MTP twin (this PR) omits the mirror entry. The most recent existing entry for minimaxm3-fp4-mi355x-vllm-mtp in perf-changelog.yaml (lines 4292–4298, from PR #1939) still pins the old hanging image nightly-3f5a1e173... and describes "automatic MoE backend selection" — exactly the configuration this PR is trying to replace.

CI consequence (step-by-step)

  1. .github/workflows/run-sweep.yml (push to main and pull_request triggers) gates on paths: perf-changelog.yaml — only changes to that file cause the full sweep to fire for the affected configs.
  2. This PR does not modify perf-changelog.yaml, so on push/PR events the run-sweep workflow has no config-keys delta naming minimaxm3-fp4-mi355x-vllm-mtp to trigger on.
  3. Therefore the new AITER+aiter configuration for the MTP recipe will not be re-benchmarked end-to-end on this PR.
  4. The PR description itself promises this verification: "verified against the local vLLM @ 4559c43a9 + aiter tip; the CI re-sweep on the pinned docker image will confirm end-to-end." That CI re-sweep cannot happen without the changelog entry.
  5. Once merged, future artifact-reuse runs will keep treating the recipe as unchanged from the [codex] add MiniMax M3 FP4 MI355X vLLM MTP benchmark #1939 baseline (hanging image, auto MoE backend) until someone notices the omission.

This is precisely the case run-sweep exists to catch: the entire reason for this PR is to fix an ~8-hour CI hang on EP configs of minimaxm3-fp4-mi355x-vllm-mtp. Merging without the changelog entry leaves the remediation unverified by the gate it was built to satisfy.

Fix

Append an entry to the end of perf-changelog.yaml mirroring the #1954 entry at lines 4320–4325:

- config-keys: [minimaxm3-fp4-mi355x-vllm-mtp]
  description: >-
    Enable AITER MoE on MiniMax-M3 MXFP4 MI355X single-node vLLM MTP:
    export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and
    VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter.
    Pin image to nightly-4559c43a9526597c00cbcc4f59979496500268d1 to
    match the STP recipe and fix the EP startup hang.
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1964

This is the same mechanical 4–6 line append the sister STP PR #1954 already used; copying its shape is the cleanest fix.

model: amd/MiniMax-M3-MXFP4
model-prefix: minimaxm3
runner: mi355x
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,18 @@ fi
SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
# Use AITER MoE for the MXFP4 experts, matching minimaxm3_fp4_mi355x_vllm.sh.
# This is required for ALL configs including expert parallelism: with EP enabled
# and moe_backend=auto, the AITER MXFP4 backend is skipped and selection falls
# back to Mxfp4MoeBackend.EMULATION, which triggers a first-time build of the
# Quark hw-emulation C++ kernel (kernel_ext, 9 ROCm arches) on every worker at
# warmup. Concurrent EP workers deadlock on the shared torch_extensions build
# lock, hanging engine-core for hours. Forcing --moe-backend aiter selects the
# AITER_MXFP4_MXFP4 backend instead (verified working under TP4+EP4 with EAGLE3
# spec decoding), avoiding the emulation build entirely.
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
Expand Down Expand Up @@ -65,6 +77,7 @@ vllm serve "$MODEL" --port "$PORT" \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--moe-backend aiter \
--speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
Expand Down
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4344,6 +4344,14 @@
- "Reuse the existing MXFP8 B300 topology and concurrency matrix across 15 srt-slurm recipes, while dropping the FP8-only Marlin override from TP4 decode"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1931

- config-keys:
- minimaxm3-fp4-mi355x-vllm-mtp
description:
- "Enable AITER MoE on MiniMax-M3 MXFP4 MI355X single-node vLLM MTP (EAGLE3), mirroring the STP recipe: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass --moe-backend aiter unconditionally (including expert parallelism)."
- "Fixes the ~8h engine-core startup hang on EP configs: with moe_backend=auto, EP fell back to Mxfp4MoeBackend.EMULATION, which deadlocked all expert-parallel workers building the Quark hw-emulation C++ kernel into a shared torch_extensions dir. Forcing --moe-backend aiter selects AITER_MXFP4_MXFP4 (no emulation build)."
- "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e), matching the STP recipe."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1964

- config-keys:
- minimaxm3-fp4-b300-dynamo-vllm
description:
Expand Down