[NV] Kimi-K2.5 NVFP4 GB200 dynamo-vllm disagg benchmark refresh#1862
Conversation
16c2d7e to
d57f4a2
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27848658320 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27875251240 |
Resolve perf-changelog.yaml conflict keeping all three new entries. Extend squash-dir probe, shared-base probe, python venv pin, SRTCTL_ROOT override, and INFMAX_WORKSPACE rsync to also apply when MODEL_PREFIX == kimik2.5.
eb4cb43 to
d54e9e5
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d54e9e5. Configure here.
| compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}' | ||
| gpu-memory-utilization: 0.9 | ||
| stream-interval: 50 | ||
| max-cudagraph-capture-size: 16 |
There was a problem hiding this comment.
Missing decode MoE all2all backend
Medium Severity
In disagg-gb200-1p4d-dep4-tep4.yaml, the decode vllm_config sets enable-expert-parallel: true for Kimi-K2.5 MoE but omits all2all-backend, while the prefill block in the same file and every other new GB200 Kimi FP4 recipe with expert-parallel decode include flashinfer_nvlink_one_sided.
Reviewed by Cursor Bugbot for commit d54e9e5. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27885061956 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27885061956 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27885061956 |
|
Hi @kedarpotdar-nv @functionstackx , can you pleaes review and merge? |
|
/reuse-sweep-run |
1 similar comment
|
/reuse-sweep-run |
# Conflicts: # perf-changelog.yaml
The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged The first MI355X disagg sweep (run 27515119215) failed: the day-zero MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found. Searched: ...") before the engine ever started. The single-node recipes hf-download inside the serving container, but the disagg path historically required ops to pre-stage checkpoints. Add an on-demand fallback to the vllm-disagg model-resolution block: when the checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name -> org/name) and download into MODEL_DIR in HF cache layout, then resolve the snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir that is bind-mounted into the serving container as /models, so the existing -v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve. Implementation notes: - The host has no hf CLI, so the download runs in a one-shot container of the serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub. - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a re-check of snapshots/ under the lock makes it idempotent (resumable). - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed through for gated repos. - Scoped to the vllm-disagg branch only; pre-staged models never reach this path (the search finds them first), so sglang/atom and existing vLLM disagg models (M2.5/Kimi) are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * job.slurm: --entrypoint "" for the auto-download container The disagg auto-download reached hf download but failed all 3 attempts: the one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not override the image ENTRYPOINT, so the vllm-openai API server ran with the bash command as its args and died with "Failed to infer device type" (no GPU mounted in the download container). Add --entrypoint "" (as the serving container does) so bash actually runs hf download. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1) Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Update the vLLM external router container vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image. * M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial) MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model: sparse layers register a separate lightning-indexer cache (MLAAttentionSpec, rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5, fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives block geometry from the first cache and reuses first_layer's offsets for every layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is transferred with fp8 K+V sizing and gets corrupted on the decode worker, producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct). This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py. - patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry per layer (own shape/stride/dtype/rank) instead of from the first cache. Idempotent; no-op for homogeneous models. - setup_deps.sh: apply it on the vllm-disagg path. NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO connector cannot map, so M3 disagg accuracy is still broken pending a larger multi-group / index-state transfer change. (Disabling sparse attention is not a viable workaround: M3's fused QKV carries index_k weights, so dropping the indexer breaks weight load.) Refs #1762 Co-authored-by: Cursor <cursoragent@cursor.com> * feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every disagg block transfer reads the wrong region. Invisible to throughput, but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957). Instead of baking a fix into a rebuilt image (-hetkv) or carrying full vendored copies of the patched files in-tree, carry just the 218-line unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with `patch -p1` against the vLLM package dir inside the container at startup, ahead of the server launch. The repo is already bind-mounted into the container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3" (skippable with MORIIO_KV_PATCH=skip), mirroring the existing mori_conn.py sglang hook. A failed apply aborts the container instead of silently running unpatched. Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode) using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract 0.9560 (matches the baked image within noise), decode probe healthy. - patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock - job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out - patches/README.md: document the moriio/ diff-apply mechanism Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k Widen the disagg sweep from conc 1,2,4,8,16 to 1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep, for both seq-len scenarios: - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256 - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024 Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is needed (comment updated to say so). The multinode eval policy still marks exactly one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which bundles three layered fixes for the stock minimax-m3 vLLM image: 1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix, required for homogeneous TP too). 2. heterogeneous-TP addressing + guard: maps each decode rank to the correct prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and raises NotImplementedError for unsupported cases (prefill-TP > decode-TP, KV-head splitting) instead of silently corrupting KV. 3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs per transfer_id and only frees KV blocks once all expected consumers ACK, preventing both the late-ACK EngineCore crash and KV reuse before slower decode ranks finish reading. job.slurm and patches/README.md updated to reference the new diff name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4) With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc. The previous patch used `return self.tp_rank` for the P>D branch, which made decode rank 1 connect to prefill rank 1 (holds head0) instead of prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0. Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size), the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps each decode rank to the *first* prefill rank of its head group, which holds the correct KV content via vLLM's replication scheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion The P>D fix added 4 lines to _remote_tp_rank but the hunk header still said +1100,40; patch aborted with "malformed patch at line 79". Update to +1100,44 to match the actual 6 context + 38 added lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes The MoRIIO KV-layout patch was injected into the per-node container launch via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer srun bash -c "..." double-quoted string. Because the patch command value contains spaces and the shell operators '<' and '||', the unquoted expansion word-split the generated container script, truncating it right after the word `patch` and silently dropping the patch arguments AND the server.sh launch. The container then exited 0:0 within seconds, producing no benchmark/eval output -> collect_latest_results found "No logs directory" -> the launch step failed with exit 1 (all minimax-m3 disagg jobs affected). Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc single quotes (no quote toggling), so the patch command stays intact and its operators are parsed by the container shell. Validated end-to-end: gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4. Co-authored-by: Cursor <cursoragent@cursor.com> * disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k) Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an 8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still one lm-eval on the 8k1k TP8+TP8 layout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher The per-layer READ-offset fix this Python patcher applied to moriio_connector.py is fully subsumed by the unified overlay patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff rewrites the exact lines the patcher searches for (the `first_layer` single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing), with a stronger geometry-memoized + heterogeneous-TP-aware version, so the patcher's OLD1/OLD2 patterns no longer match and it already no-ops ("pattern not found; skipping") in the real flow. It's also the same fix now upstreamed in vLLM #46039 (READ mixed KV layouts). Drop the dead patcher and its setup_deps.sh hook so the diff is the single source of truth. patches/README.md only documents the diff (no reference to this patcher), so no README change is needed. Co-authored-by: Cursor <cursoragent@cursor.com> * Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay - Co-work with Gupta, Ravi All three MoRIIO fixes the in-tree overlay carried have merged upstream and now ship in the ROCm nightly image: - vLLM #46039 READ-mode mixed KV-layout (axis-aware per-layer offsets) - vLLM #46290 WRITE-mode per-geometry offset caching - vLLM #46332 heterogeneous-TP rank mapping + ACK fan-in Point minimaxm3-fp8-mi355x-vllm-disagg at vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15 (vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove the stop-gap overlay: - delete patches/moriio/moriio-minimax-m3-disagg.diff - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate) - trim the moriio/ section from patches/README.md Verified on the nightly image with NO patch across all four P/D layouts x conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4, 2P4+1D8) -- matching the previously-patched results. Refs #1762. * fix: append M3 MI355X disagg changelog entry at end of file The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Chun Fang <chun.fang@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: TianDi101 <ditian12@amd.com>


Note
Low Risk
Changes are limited to benchmark YAML, perf changelog, and CI launch scripting; no application runtime or security-sensitive logic.
Overview
Refreshes the kimik2.5-fp4-gb200-dynamo-vllm sweep in
nvidia-master.yaml: bumps the container to vllm/vllm-openai:v0.21.0, retargetsCONFIG_FILEentries to newkimi-k2.5-fp4recipes, and reshapes search spaces (concurrency lists, prefill/decode tp/ep, worker counts) for 1k/1k and 8k/1k disagg topologies.Adds eight checked-in srt-slurm recipe YAMLs under
benchmarks/multi_node/srt-slurm-recipes/vllm/kimi-k2.5-fp4/(Dynamo 1.2.1, Nixl KV transfer, GB200 resource layouts) and documents the change inperf-changelog.yaml.Updates
runners/launch_gb200-nv.shso kimik2.5 FP4 dynamo-vllm uses the same watchtower/shared-FS staging as minimax, clones srt-slurm on main and overlays the in-repo recipes, instead of the older sa-submission-q2-2026 recipe paths.Reviewed by Cursor Bugbot for commit a196937. Bugbot is set up for automated code reviews on this repo. Configure here.