Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2636,6 +2636,36 @@ minimaxm3-fp4-mi355x-vllm-disagg:
dp-attn: false
additional-settings:
- "DECODE_NODES=1"
# MiniMax-M3 MXFP4 MI355X vLLM recipe. The pinned nightly includes upstream
# MiniMax-M3 Quark MXFP4 support (vllm-project/vllm#45794). Use the text-only
# language-model path and mirror the MXFP8 MI355X search space for a direct
# precision comparison.
minimaxm3-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
model: amd/MiniMax-M3-MXFP4
model-prefix: minimaxm3
runner: mi355x
precision: fp4
framework: vllm
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 512 }
- { tp: 4, conc-start: 1, conc-end: 64 }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 16, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 512 }
- { tp: 4, conc-start: 1, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 512 }

# MiniMax-M3 MXFP4 MI355X atom recipe:
# https://github.com/ROCm/ATOM/blob/5d42d49f9e4292e5b61475917e92e7ec1b1dacb7/recipes/MiniMax-M3.md
Expand Down
88 changes: 88 additions & 0 deletions benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
#!/usr/bin/env bash

# MiniMax-M3 MXFP4 MI355X (gfx950) single-node vLLM recipe.
# https://huggingface.co/amd/MiniMax-M3-MXFP4#reproduction
# Block size 128 is mandatory for MSA. This fixed-sequence benchmark uses the
# text-only language-model path and lets vLLM select the MoE backend.

source "$(dirname "$0")/../../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
EP_SIZE \
DP_ATTENTION \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
fi

SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0

Comment on lines +32 to +35

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new minimaxm3_fp4_mi355x_vllm.sh is missing export VLLM_USE_BREAKABLE_CUDAGRAPH=0 after the VLLM_ENGINE_READY_TIMEOUT_S line, which every other MiniMax-M3 vLLM recipe in the repo sets (including the MXFP4 multi-node disagg entry at models_vllm.yaml:44 for the SAME amd/MiniMax-M3-MXFP4 model). Without it, the M3 decode path silently falls back to eager mode via the breakable-cudagraph fallback, invalidating the "direct precision comparison" with the MXFP8 baseline (which DOES run with CUDA graphs) that the PR description names as the motivation. Fix: add export VLLM_USE_BREAKABLE_CUDAGRAPH=0 at line 34, matching minimaxm3_fp8_mi355x.sh:33.

Extended reasoning...

The bug

The new benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh script (lines 32-34) sets SERVER_LOG and VLLM_ENGINE_READY_TIMEOUT_S=3600, but does NOT export VLLM_USE_BREAKABLE_CUDAGRAPH=0. Every other MiniMax-M3 vLLM recipe in this repo sets this env var:

File Line
minimaxm3_fp8_mi300x.sh 35
minimaxm3_fp8_mi300x_mtp.sh 52
minimaxm3_fp8_mi325x.sh 33
minimaxm3_fp8_mi325x_mtp.sh 64
minimaxm3_fp8_mi355x.sh 33 (the direct sibling this PR claims to mirror)
minimaxm3_fp8_mi355x_mtp.sh 63
benchmarks/multi_node/amd_utils/models_vllm.yaml 44 (MiniMax-M3-MXFP4 disagg)

The inline comment in those scripts identifies it as a MiniMax-M3 model-specific (not precision-specific) workaround: "VLLM_USE_BREAKABLE_CUDAGRAPH=0 avoids the M3-decode breakable-cudagraph path that previously forced eager execution."

Why this is not specific to MXFP8

The disagg config at benchmarks/multi_node/amd_utils/models_vllm.yaml:44 uses the exact same model (amd/MiniMax-M3-MXFP4) and explicitly sets VLLM_USE_BREAKABLE_CUDAGRAPH=0 in its env string. So the requirement is tied to the MiniMax-M3 model + ROCm decode path, not to the weight quantization. PRs #1750/#1754/#1755/#1756 (recorded in perf-changelog.yaml) landed this fix "per AMD guidance" across every MiniMax-M3 single-node recipe at the time; this new MXFP4 single-node recipe breaks the established pattern without justification.

Concrete trigger walkthrough

  1. Sweep launcher runs bash minimaxm3_fp4_mi355x_vllm.sh with one of the TP/EP shapes from amd-master.yaml.
  2. Script exports only VLLM_ENGINE_READY_TIMEOUT_S=3600; VLLM_USE_BREAKABLE_CUDAGRAPH is unset (default: enabled).
  3. vllm serve is invoked without --enforce-eager, so vLLM normally captures CUDA graphs for decode.
  4. On MiniMax-M3, the decode path hits the "breakable cudagraph" fallback (the issue the env var was added to suppress, per the inline comments in all sister scripts).
  5. Decode silently runs eager-mode while the FP8 MI355X baseline runs with CUDA graphs enabled (since its script DOES export the var).
  6. The PR description explicitly states the motivation is to "mirror the existing MXFP8 MI355X TP/EP/DP-attention sweep for direct precision comparisons" — but the comparison is no longer apples-to-apples: FP4 measures eager-mode decode while FP8 measures graph-captured decode.

Impact

This silently invalidates the benchmark's stated purpose. The numbers will look worse than they should because eager-mode decode throughput is substantially below graph-captured decode on MoE models. Anyone comparing these results to the MXFP8 baseline will draw incorrect conclusions about MXFP4's quality/perf trade-off. This is a normal-severity bug because the measurement validity is the explicit goal of this PR.

Fix

One-line addition at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh:34:

SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0  # avoids the M3-decode breakable-cudagraph path that previously forced eager execution

This exactly mirrors the layout in minimaxm3_fp8_mi355x.sh:31-33, which the PR description claims to mirror.

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi

PARALLEL_ARGS=(--tensor-parallel-size "$TP")
if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS=(
--tensor-parallel-size 1
--data-parallel-size "$TP"
--enable-expert-parallel
)
elif [ "$EP_SIZE" -gt 1 ]; then
PARALLEL_ARGS+=(--enable-expert-parallel)
fi

start_gpu_monitor

set -x
vllm serve "$MODEL" --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--trust-remote-code \
--block-size 128 \
--no-enable-prefix-caching \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m3 > "$SERVER_LOG" 2>&1 &

SERVER_PID=$!
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4196,3 +4196,12 @@
description:
- "Initial submission: MiniMax-M3 MXFP4 disagg (prefill/decode) on MI355X with vLLM over the MoRI-IO KV connector (8k/1k)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1914

- config-keys:
- minimaxm3-fp4-mi355x-vllm
description:
- "Add a MiniMax-M3 MXFP4 single-node vLLM benchmark on MI355X using amd/MiniMax-M3-MXFP4."
- "Pin vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e, which includes upstream Quark MXFP4 support from vllm-project/vllm#45794."
- "Serve through the text-only language-model path with block size 128, TRITON_ATTN, MiniMax-M3 tool/reasoning parsers, automatic tool choice, and VLLM_USE_BREAKABLE_CUDAGRAPH=0; let vLLM select the MoE backend and retain the default KV-cache dtype."
- "Mirror the MiniMax-M3 MXFP8 MI355X TP/EP/DP-attention search space at 1k1k and 8k1k for a direct precision comparison."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1935
Loading