Skip to content

[NV] Kimi-K2.5 NVFP4 GB200 dynamo-vllm disagg benchmark refresh#1862

Merged
functionstackx merged 3 commits into
mainfrom
kimik2.5-fp4-gb200-dynamo-vllm
Jun 21, 2026
Merged

[NV] Kimi-K2.5 NVFP4 GB200 dynamo-vllm disagg benchmark refresh#1862
functionstackx merged 3 commits into
mainfrom
kimik2.5-fp4-gb200-dynamo-vllm

Conversation

@xinli-sw

@xinli-sw xinli-sw commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Note

Low Risk
Changes are limited to benchmark YAML, perf changelog, and CI launch scripting; no application runtime or security-sensitive logic.

Overview
Refreshes the kimik2.5-fp4-gb200-dynamo-vllm sweep in nvidia-master.yaml: bumps the container to vllm/vllm-openai:v0.21.0, retargets CONFIG_FILE entries to new kimi-k2.5-fp4 recipes, and reshapes search spaces (concurrency lists, prefill/decode tp/ep, worker counts) for 1k/1k and 8k/1k disagg topologies.

Adds eight checked-in srt-slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/vllm/kimi-k2.5-fp4/ (Dynamo 1.2.1, Nixl KV transfer, GB200 resource layouts) and documents the change in perf-changelog.yaml.

Updates runners/launch_gb200-nv.sh so kimik2.5 FP4 dynamo-vllm uses the same watchtower/shared-FS staging as minimax, clones srt-slurm on main and overlays the in-repo recipes, instead of the older sa-submission-q2-2026 recipe paths.

Reviewed by Cursor Bugbot for commit a196937. Bugbot is set up for automated code reviews on this repo. Configure here.

@xinli-sw xinli-sw force-pushed the kimik2.5-fp4-gb200-dynamo-vllm branch from 16c2d7e to d57f4a2 Compare June 19, 2026 21:11
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Resolve perf-changelog.yaml conflict keeping all three new entries.
Extend squash-dir probe, shared-base probe, python venv pin,
SRTCTL_ROOT override, and INFMAX_WORKSPACE rsync to also apply
when MODEL_PREFIX == kimik2.5.
@functionstackx functionstackx force-pushed the kimik2.5-fp4-gb200-dynamo-vllm branch from eb4cb43 to d54e9e5 Compare June 20, 2026 21:54

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d54e9e5. Configure here.

compilation-config: '{"cudagraph_mode":"FULL_DECODE_ONLY","custom_ops":["+quant_fp8","+rms_norm","+rotary_embedding"],"pass_config":{"fuse_attn_quant":true,"fuse_allreduce_rms":true}}'
gpu-memory-utilization: 0.9
stream-interval: 50
max-cudagraph-capture-size: 16

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing decode MoE all2all backend

Medium Severity

In disagg-gb200-1p4d-dep4-tep4.yaml, the decode vllm_config sets enable-expert-parallel: true for Kimi-K2.5 MoE but omits all2all-backend, while the prefill block in the same file and every other new GB200 Kimi FP4 recipe with expert-parallel decode include flashinfer_nvlink_one_sided.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d54e9e5. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@xinli-sw

Copy link
Copy Markdown
Collaborator Author

Hi @kedarpotdar-nv @functionstackx , can you pleaes review and merge?

@xinli-sw

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

1 similar comment
@functionstackx

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 6a07901 into main Jun 21, 2026
27 checks passed
@functionstackx functionstackx deleted the kimik2.5-fp4-gb200-dynamo-vllm branch June 21, 2026 19:53
functionstackx added a commit that referenced this pull request Jun 24, 2026
The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after
the #1862 entry), which violates the append-only changelog gate
("entry 511 changed; existing entries are immutable"). Move it to the
end of perf-changelog.yaml so existing entries stay byte-identical to
main and the new entry is a clean append.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 24, 2026
* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged

The first MI355X disagg sweep (run 27515119215) failed: the day-zero
MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so
job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found.
Searched: ...") before the engine ever started. The single-node recipes
hf-download inside the serving container, but the disagg path historically
required ops to pre-stage checkpoints.

Add an on-demand fallback to the vllm-disagg model-resolution block: when the
checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name
-> org/name) and download into MODEL_DIR in HF cache layout, then resolve the
snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir
that is bind-mounted into the serving container as /models, so the existing
-v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve.

Implementation notes:
  - The host has no hf CLI, so the download runs in a one-shot container of the
    serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub.
  - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a
    re-check of snapshots/ under the lock makes it idempotent (resumable).
  - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed
    through for gated repos.
  - Scoped to the vllm-disagg branch only; pre-staged models never reach this
    path (the search finds them first), so sglang/atom and existing vLLM disagg
    models (M2.5/Kimi) are unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* job.slurm: --entrypoint "" for the auto-download container

The disagg auto-download reached hf download but failed all 3 attempts: the
one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not
override the image ENTRYPOINT, so the vllm-openai API server ran with the bash
command as its args and died with "Failed to infer device type" (no GPU mounted
in the download container). Add --entrypoint "" (as the serving container does)
so bash actually runs hf download.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness)

The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy
only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row
(same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true
(eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate
correctness. The conc-1 1k1k row stays the latency smoke test.

Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1)

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16
(1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios
(1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked
(eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Update the vLLM external router container

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older
dated tags are garbage-collected (manifest unknown), which makes `docker run`
fail with exit 125 on any node that has not already cached the image.

* M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial)

MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model:
sparse layers register a separate lightning-indexer cache (MLAAttentionSpec,
rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5,
fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives
block geometry from the first cache and reuses first_layer's offsets for every
layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is
transferred with fp8 K+V sizing and gets corrupted on the decode worker,
producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct).
This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py.

- patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry
  per layer (own shape/stride/dtype/rank) instead of from the first cache.
  Idempotent; no-op for homogeneous models.
- setup_deps.sh: apply it on the vllm-disagg path.

NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a
separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO
connector cannot map, so M3 disagg accuracy is still broken pending a larger
multi-group / index-state transfer change. (Disabling sparse attention is not a
viable workaround: M3's fused QKV carries index_k weights, so dropping the
indexer breaks weight load.)

Refs #1762

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image

The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the
FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this
vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every
disagg block transfer reads the wrong region. Invisible to throughput,
but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957).

Instead of baking a fix into a rebuilt image (-hetkv) or carrying full
vendored copies of the patched files in-tree, carry just the 218-line
unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with
`patch -p1` against the vLLM package dir inside the container at startup,
ahead of the server launch. The repo is already bind-mounted into the
container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm
auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3"
(skippable with MORIIO_KV_PATCH=skip), mirroring the existing
mori_conn.py sglang hook. A failed apply aborts the container instead of
silently running unpatched.

Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode)
using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract
0.9560 (matches the baked image within noise), decode probe healthy.

- patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock
- job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out
- patches/README.md: document the moriio/ diff-apply mechanism

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k

Widen the disagg sweep from conc 1,2,4,8,16 to
1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D
TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16)
so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k

Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep,
for both seq-len scenarios:
  - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256
  - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024

Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh
sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the
computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is
needed (comment updated to say so). The multinode eval policy still marks exactly
one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff

Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which
bundles three layered fixes for the stock minimax-m3 vLLM image:
1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix,
   required for homogeneous TP too).
2. heterogeneous-TP addressing + guard: maps each decode rank to the correct
   prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and
   raises NotImplementedError for unsupported cases (prefill-TP > decode-TP,
   KV-head splitting) instead of silently corrupting KV.
3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs
   per transfer_id and only frees KV blocks once all expected consumers ACK,
   preventing both the late-ACK EngineCore crash and KV reuse before slower
   decode ranks finish reading.

job.slurm and patches/README.md updated to reference the new diff name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4)

With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks
in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc.
The previous patch used `return self.tp_rank` for the P>D branch, which
made decode rank 1 connect to prefill rank 1 (holds head0) instead of
prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0.

Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size),
the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps
each decode rank to the *first* prefill rank of its head group, which holds
the correct KV content via vLLM's replication scheme.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion

The P>D fix added 4 lines to _remote_tp_rank but the hunk header still
said +1100,40; patch aborted with "malformed patch at line 79". Update
to +1100,44 to match the actual 6 context + 38 added lines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes

The MoRIIO KV-layout patch was injected into the per-node container launch
via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer
srun bash -c "..." double-quoted string. Because the patch command value
contains spaces and the shell operators '<' and '||', the unquoted
expansion word-split the generated container script, truncating it right
after the word `patch` and silently dropping the patch arguments AND the
server.sh launch. The container then exited 0:0 within seconds, producing
no benchmark/eval output -> collect_latest_results found "No logs
directory" -> the launch step failed with exit 1 (all minimax-m3 disagg
jobs affected).

Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc
single quotes (no quote toggling), so the patch command stays intact and
its operators are parsed by the container shell. Validated end-to-end:
gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4.

Co-authored-by: Cursor <cursoragent@cursor.com>

* disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k)

Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an
8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to
both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still
one lm-eval on the 8k1k TP8+TP8 layout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher

The per-layer READ-offset fix this Python patcher applied to
moriio_connector.py is fully subsumed by the unified overlay
patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies
with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff
rewrites the exact lines the patcher searches for (the `first_layer`
single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing),
with a stronger geometry-memoized + heterogeneous-TP-aware version, so
the patcher's OLD1/OLD2 patterns no longer match and it already no-ops
("pattern not found; skipping") in the real flow. It's also the same
fix now upstreamed in vLLM #46039 (READ mixed KV layouts).

Drop the dead patcher and its setup_deps.sh hook so the diff is the
single source of truth. patches/README.md only documents the diff (no
reference to this patcher), so no README change is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay

- Co-work with Gupta, Ravi

All three MoRIIO fixes the in-tree overlay carried have merged upstream and now
ship in the ROCm nightly image:
  - vLLM #46039  READ-mode mixed KV-layout (axis-aware per-layer offsets)
  - vLLM #46290  WRITE-mode per-geometry offset caching
  - vLLM #46332  heterogeneous-TP rank mapping + ACK fan-in

Point minimaxm3-fp8-mi355x-vllm-disagg at
vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15
(vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove
the stop-gap overlay:
  - delete patches/moriio/moriio-minimax-m3-disagg.diff
  - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate)
  - trim the moriio/ section from patches/README.md

Verified on the nightly image with NO patch across all four P/D layouts x
conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4,
2P4+1D8) -- matching the previously-patched results.

Refs #1762.

* fix: append M3 MI355X disagg changelog entry at end of file

The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after
the #1862 entry), which violates the append-only changelog gate
("entry 511 changed; existing entries are immutable"). Move it to the
end of perf-changelog.yaml so existing entries stay byte-identical to
main and the new entry is a clean append.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Chun Fang <chun.fang@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: TianDi101 <ditian12@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants