-
Notifications
You must be signed in to change notification settings - Fork 208
[NVIDIA] Disable prefix caching for MiniMax M2.5 vLLM benchmarks #965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,12 +34,13 @@ | |
| set -x | ||
| vllm serve $MODEL --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| $EP \ | ||
| --gpu-memory-utilization 0.95 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
Check notice on line 43 in benchmarks/single_node/minimaxm2.5_fp8_h200.sh
|
||
|
Comment on lines
37
to
43
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟣 This is a pre-existing issue: minimaxm2.5_fp8_h200.sh is missing the Extended reasoning...What the bug is: The h200.sh benchmark script lacks an upfront The specific code path: In b200.sh (line ~20), Why existing code doesn't prevent it: The h200.sh script does call What the impact would be: When an H200 benchmark run produces unexpected results or needs to be reproduced, engineers cannot look up the driver version or initial GPU state from the logs. This makes post-hoc debugging harder. B200 and H100 runs have this information readily available, creating an inconsistency across the benchmark suite. How to fix it: Add hf download "$MODEL"
nvidia-smi
SERVER_LOG=/workspace/server.logStep-by-step proof:
|
||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,12 +42,13 @@ | |
| vllm serve $MODEL --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| $EP \ | ||
| --gpu-memory-utilization 0.95 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size=32 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
Check notice on line 51 in benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh
|
||
|
Comment on lines
45
to
51
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟣 This is a pre-existing issue: Extended reasoning...Bug: Specific code paths: In the current post-PR state, Why existing code doesn't prevent it: The PR adds Impact: When running benchmarks on mi355x hardware, the vLLM server will emit a log line for every individual inference request processed. In a typical benchmark run with hundreds or thousands of requests, this floods the server log with per-request noise, making it harder to extract signal from Note on scope: The NVIDIA scripts (b200, h100, h200) also lack How to fix: Add Step-by-step proof:
|
||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟣 Pre-existing inconsistency:
minimaxm2.5_fp8_b200.sh(line 30) andminimaxm2.5_fp8_h200.sh(line 25) use-ge 1for the EP_SIZE expert-parallel guard, enabling--enable-expert-paralleleven when EP_SIZE=1, whileh100.shandmi355x.shcorrectly use-gt 1. When EP_SIZE=1 is set, b200 and h200 will pass--enable-expert-parallelwhile h100 and mi355x will not, producing non-comparable benchmark configurations across hardware targets. Both b200.sh and h200.sh should change-ge 1to-gt 1to match the rest of the suite.Extended reasoning...
What the bug is:
minimaxm2.5_fp8_b200.shandminimaxm2.5_fp8_h200.shuse-ge 1(greater-than-or-equal) for the EP_SIZE guard that enables--enable-expert-parallel, whileminimaxm2.5_fp8_h100.shandminimaxm2.5_fp8_mi355x.shuse-gt 1(greater-than). Since EP_SIZE is always at least 1 in any valid configuration,-ge 1is effectively always true — it never takes the else branch.The specific code paths:
minimaxm2.5_fp8_b200.shline 30:if [ "$EP_SIZE" -ge 1 ]; then EP=" --enable-expert-parallel"minimaxm2.5_fp8_h200.shline 25:if [ "$EP_SIZE" -ge 1 ]; then EP=" --enable-expert-parallel"minimaxm2.5_fp8_h100.shline 29:if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel"(correct)minimaxm2.5_fp8_mi355x.shline 32:if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel"(correct)Why existing code doesn't prevent it: The condition
-ge 1is simply wrong. Any positive EP_SIZE value (including EP_SIZE=1) will satisfy it, so the guard meant to skip expert parallelism when EP_SIZE=1 is completely ineffective on b200 and h200.Step-by-step proof: Suppose a user sets
EP_SIZE=1and runs the same MoE workload across hardware targets for a cross-platform comparison. On h100 and mi355x, the guard evaluates1 -gt 1→ false, soEPis empty and--enable-expert-parallelis NOT passed to vLLM. On b200 and h200, the guard evaluates1 -ge 1→ true, so--enable-expert-parallelIS passed. The b200/h200 benchmarks now run a different vLLM code path (the expert-parallel path with degree 1), while h100/mi355x run the standard path. Results from this run are not apples-to-apples.Impact: Expert parallelism with degree 1 is semantically a no-op (all experts are on one rank) but still activates vLLM's expert-parallel dispatch logic, which can add communication overhead and use different kernels. This means benchmark numbers from b200/h200 at EP_SIZE=1 cannot be directly compared to h100/mi355x at EP_SIZE=1.
How to fix: Change
-ge 1to-gt 1in bothminimaxm2.5_fp8_b200.shline 30 andminimaxm2.5_fp8_h200.shline 25. This is a pre-existing bug not introduced by this PR (which only adds--no-enable-prefix-caching), but since the PR already touches both affected files, it is a good opportunity to normalize the condition across all scripts.