Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_b200.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,13 @@
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
$EP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--no-enable-prefix-caching \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Check notice on line 49 in benchmarks/single_node/minimaxm2.5_fp8_b200.sh

View check run for this annotation

Claude / Claude Code Review

Inconsistent EP_SIZE threshold: -ge 1 vs -gt 1 in b200/h200 scripts

Pre-existing inconsistency: `minimaxm2.5_fp8_b200.sh` (line 30) and `minimaxm2.5_fp8_h200.sh` (line 25) use `-ge 1` for the EP_SIZE expert-parallel guard, enabling `--enable-expert-parallel` even when EP_SIZE=1, while `h100.sh` and `mi355x.sh` correctly use `-gt 1`. When EP_SIZE=1 is set, b200 and h200 will pass `--enable-expert-parallel` while h100 and mi355x will not, producing non-comparable benchmark configurations across hardware targets. Both b200.sh and h200.sh should change `-ge 1` to `-
Comment on lines 43 to 49

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 Pre-existing inconsistency: minimaxm2.5_fp8_b200.sh (line 30) and minimaxm2.5_fp8_h200.sh (line 25) use -ge 1 for the EP_SIZE expert-parallel guard, enabling --enable-expert-parallel even when EP_SIZE=1, while h100.sh and mi355x.sh correctly use -gt 1. When EP_SIZE=1 is set, b200 and h200 will pass --enable-expert-parallel while h100 and mi355x will not, producing non-comparable benchmark configurations across hardware targets. Both b200.sh and h200.sh should change -ge 1 to -gt 1 to match the rest of the suite.

Extended reasoning...

What the bug is: minimaxm2.5_fp8_b200.sh and minimaxm2.5_fp8_h200.sh use -ge 1 (greater-than-or-equal) for the EP_SIZE guard that enables --enable-expert-parallel, while minimaxm2.5_fp8_h100.sh and minimaxm2.5_fp8_mi355x.sh use -gt 1 (greater-than). Since EP_SIZE is always at least 1 in any valid configuration, -ge 1 is effectively always true — it never takes the else branch.

The specific code paths:

  • minimaxm2.5_fp8_b200.sh line 30: if [ "$EP_SIZE" -ge 1 ]; then EP=" --enable-expert-parallel"
  • minimaxm2.5_fp8_h200.sh line 25: if [ "$EP_SIZE" -ge 1 ]; then EP=" --enable-expert-parallel"
  • minimaxm2.5_fp8_h100.sh line 29: if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel" (correct)
  • minimaxm2.5_fp8_mi355x.sh line 32: if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel" (correct)

Why existing code doesn't prevent it: The condition -ge 1 is simply wrong. Any positive EP_SIZE value (including EP_SIZE=1) will satisfy it, so the guard meant to skip expert parallelism when EP_SIZE=1 is completely ineffective on b200 and h200.

Step-by-step proof: Suppose a user sets EP_SIZE=1 and runs the same MoE workload across hardware targets for a cross-platform comparison. On h100 and mi355x, the guard evaluates 1 -gt 1 → false, so EP is empty and --enable-expert-parallel is NOT passed to vLLM. On b200 and h200, the guard evaluates 1 -ge 1 → true, so --enable-expert-parallel IS passed. The b200/h200 benchmarks now run a different vLLM code path (the expert-parallel path with degree 1), while h100/mi355x run the standard path. Results from this run are not apples-to-apples.

Impact: Expert parallelism with degree 1 is semantically a no-op (all experts are on one rank) but still activates vLLM's expert-parallel dispatch logic, which can add communication overhead and use different kernels. This means benchmark numbers from b200/h200 at EP_SIZE=1 cannot be directly compared to h100/mi355x at EP_SIZE=1.

How to fix: Change -ge 1 to -gt 1 in both minimaxm2.5_fp8_b200.sh line 30 and minimaxm2.5_fp8_h200.sh line 25. This is a pre-existing bug not introduced by this PR (which only adds --no-enable-prefix-caching), but since the PR already touches both affected files, it is a good opportunity to normalize the condition across all scripts.


# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
Expand Down
1 change: 1 addition & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_h100.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ $EP \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \
--max-num-seqs 256 \
--no-enable-prefix-caching \
--trust-remote-code \
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' > $SERVER_LOG 2>&1 &

Expand Down
1 change: 1 addition & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_h200.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,13 @@
set -x
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
$EP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--no-enable-prefix-caching \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Check notice on line 43 in benchmarks/single_node/minimaxm2.5_fp8_h200.sh

View check run for this annotation

Claude / Claude Code Review

h200.sh missing nvidia-smi GPU info dump present in b200 and h100 scripts

This is a pre-existing issue: minimaxm2.5_fp8_h200.sh is missing the `nvidia-smi` call present in both b200.sh and h100.sh that captures GPU driver version, memory, and clock state before benchmarking. Adding `nvidia-smi` after `hf download` (matching the pattern in the sibling scripts) would restore this diagnostic snapshot for H200 runs.
Comment on lines 37 to 43

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 This is a pre-existing issue: minimaxm2.5_fp8_h200.sh is missing the nvidia-smi call present in both b200.sh and h100.sh that captures GPU driver version, memory, and clock state before benchmarking. Adding nvidia-smi after hf download (matching the pattern in the sibling scripts) would restore this diagnostic snapshot for H200 runs.

Extended reasoning...

What the bug is: The h200.sh benchmark script lacks an upfront nvidia-smi invocation that both b200.sh and h100.sh include near the top of their scripts. This call serves as a diagnostic snapshot — capturing the NVIDIA driver version, GPU memory state, and clock frequencies — logged before any benchmark workload begins.

The specific code path: In b200.sh (line ~20), nvidia-smi is called directly after the SLURM job header check. In h100.sh it appears after hf download "$MODEL". In h200.sh, however, after hf download the script jumps directly to SERVER_LOG=/workspace/server.log with no nvidia-smi call anywhere in the file.

Why existing code doesn't prevent it: The h200.sh script does call start_gpu_monitor which provides ongoing per-second GPU metrics (power, temperature, clocks) during the benchmark. This ongoing monitoring is valuable but does not substitute for the initial point-in-time snapshot. The nvidia-smi output captures the static driver version and full hardware configuration at benchmark start, which start_gpu_monitor (typically a streaming tool) does not necessarily include.

What the impact would be: When an H200 benchmark run produces unexpected results or needs to be reproduced, engineers cannot look up the driver version or initial GPU state from the logs. This makes post-hoc debugging harder. B200 and H100 runs have this information readily available, creating an inconsistency across the benchmark suite.

How to fix it: Add nvidia-smi after the hf download line in h200.sh, matching the pattern in h100.sh:

hf download "$MODEL"

nvidia-smi

SERVER_LOG=/workspace/server.log

Step-by-step proof:

  1. Open b200.sh — line ~20 contains nvidia-smi between the SLURM header and hf download.
  2. Open h100.sh — contains nvidia-smi between hf download and export PYTHONNOUSERSITE=1.
  3. Open h200.sh — after hf download "$MODEL" the very next line is SERVER_LOG=/workspace/server.log with no nvidia-smi anywhere in the file.
  4. The PR adds --no-enable-prefix-caching to h200.sh but does not add the missing nvidia-smi call, leaving the inconsistency unaddressed.


# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
Expand Down
1 change: 1 addition & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_mi300x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ vllm serve $MODEL --port $PORT \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--disable-log-requests \
--no-enable-prefix-caching \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!
Expand Down
1 change: 1 addition & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ vllm serve $MODEL --port $PORT \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--disable-log-requests \
--no-enable-prefix-caching \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!
Expand Down
1 change: 1 addition & 0 deletions benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,13 @@
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
$EP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--no-enable-prefix-caching \
--trust-remote-code > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Check notice on line 51 in benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh

View check run for this annotation

Claude / Claude Code Review

mi355x.sh missing --disable-log-requests unlike mi300x and mi325x

This is a pre-existing issue: `minimaxm2.5_fp8_mi355x.sh` is missing `--disable-log-requests` unlike the other two AMD ROCm scripts in the same suite (`mi300x` and `mi325x`). This PR touched `mi355x.sh` to add `--no-enable-prefix-caching` but did not add the missing flag, leaving verbose per-request logs mixed into mi355x benchmark output while mi300x/mi325x suppress them.
Comment on lines 45 to 51

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 This is a pre-existing issue: minimaxm2.5_fp8_mi355x.sh is missing --disable-log-requests unlike the other two AMD ROCm scripts in the same suite (mi300x and mi325x). This PR touched mi355x.sh to add --no-enable-prefix-caching but did not add the missing flag, leaving verbose per-request logs mixed into mi355x benchmark output while mi300x/mi325x suppress them.

Extended reasoning...

Bug: minimaxm2.5_fp8_mi355x.sh omits the --disable-log-requests flag that both sibling AMD ROCm scripts pass to vllm serve.

Specific code paths: In the current post-PR state, mi300x.sh (line ~40) has --disable-log-requests and mi325x.sh (line ~42) has --disable-log-requests, but mi355x.sh (lines 45-51) does not include this flag anywhere in its vllm serve invocation.

Why existing code doesn't prevent it: The PR adds --no-enable-prefix-caching to all six MiniMax M2.5 scripts, but does not align the AMD ROCm scripts with respect to --disable-log-requests. There is no shared configuration or template that would enforce flag consistency across these sibling scripts.

Impact: When running benchmarks on mi355x hardware, the vLLM server will emit a log line for every individual inference request processed. In a typical benchmark run with hundreds or thousands of requests, this floods the server log with per-request noise, making it harder to extract signal from $SERVER_LOG and inconsistent with the cleaner mi300x/mi325x logs. The benchmarks still function correctly; this is purely a log hygiene issue.

Note on scope: The NVIDIA scripts (b200, h100, h200) also lack --disable-log-requests, so this inconsistency is specifically within the AMD ROCm script family, not a universal requirement across the entire suite.

How to fix: Add --disable-log-requests \ to the vllm serve argument list in minimaxm2.5_fp8_mi355x.sh, between --block-size=32 and --no-enable-prefix-caching, mirroring the pattern in mi300x and mi325x.

Step-by-step proof:

  1. Run minimaxm2.5_fp8_mi355x.sh with a high-concurrency workload (e.g., CONC=128).
  2. Observe /workspace/server.log: each of the ~1280 requests (CONC * 10) will emit a log line like INFO ... Received request ... and INFO ... Finished request ....
  3. Run equivalent minimaxm2.5_fp8_mi300x.sh benchmark — the same log will have no per-request lines due to --disable-log-requests.
  4. The inconsistency is directly observable by diffing the two server logs.


# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
Expand Down