[Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585
[Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585simondanielsson wants to merge 11 commits into
Conversation
…m image Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
908939c to
89b9243
Compare
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
| image: vllm/vllm-openai-rocm:nightly-bf610c2f56764e1b30bc6065f4ceace3d6e59036 | ||
| # TODO(simondanielsson): change to pinned version once https://github.com/vllm-project/vllm/pull/40344 | ||
| # is part of official release, likely 0.22.0. | ||
| image: vllm/vllm-openai-rocm:nightly |
There was a problem hiding this comment.
can u have an specific night hash pinned here here instead of just the generic "nightly"
There was a problem hiding this comment.
We can! The caveat (and why I didn't pin it here) was that the pinned nightlies are pruned from docker hub after 14 days so the config will then cease working in 2 weeks. WDYT?
|
|
||
| minimaxm2.5-fp8-mi355x-vllm-disagg: | ||
| image: vllm/vllm-openai-rocm:nightly-a6682d1d259cca69a9ae737ea5608fbbe7520031 | ||
| # TODO(simondanielsson): change to pinned version once https://github.com/vllm-project/vllm/pull/40344 |
There was a problem hiding this comment.
can u have an specific night hash pinned here here instead of just the generic "nightly"
|
Claude finished @simondanielsson's task in 2m 25s —— View job PR Review: Remove MoRI-IO patches from vLLM Disagg benchmarks
SummaryClean removal of ~555 lines of MoRI-IO monkey patches now that they've been upstreamed in vllm#40344. The three changes (backend rename One blocking issue: The unpinned |
|
Run is succeeding, and perf is very similar to the existing Kimi vllm-disagg results (with some noise of course) |
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3c94a6f. Configure here.
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
functionstackx
left a comment
There was a problem hiding this comment.
@simondanielsson is there is validated run? if so, lgtm
…ch-free nightly Brings the vLLM-disagg infra onto the upstream-MoRIIO nightly so the large setup_deps.sh runtime patches are dropped (vllm#40344), and migrates the new dsv4-fp4-mi355x-vllm-disagg recipe to match: - image -> vllm/vllm-openai-rocm:nightly-3f0a91bb (carries #40344 + DeepseekV4); not available in v0.22.0/v0.22.1 release tags - drop VLLM_MORIIO_CONNECTOR_READ_MODE env setting (read_mode now set via kv_connector_extra_config in server_vllm.sh) - dsv4 is TP8/EP1 so no all2all backend / mori_low_latency rename needed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Before merge I will
EDIT: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848/job/80753697924 |
|
thanks @simondanielsson |
|
appreipcate your help on getting first class MoRI collective working @simondanielsson ! <3 with the high conc hanging fixed now, the experience for end users will be much better now! |
…agg-patches Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Remove the kimik2.5/minimaxm2.5 vllm-disagg changelog entry (that change is documented in #1585) and scrub kimi/minimax references from the dsv4-fp4-mi355x-vllm-disagg entry descriptions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Run succeeded: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848 cc: @chunfangamd should be ready for merge |
…agg-patches Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged The first MI355X disagg sweep (run 27515119215) failed: the day-zero MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found. Searched: ...") before the engine ever started. The single-node recipes hf-download inside the serving container, but the disagg path historically required ops to pre-stage checkpoints. Add an on-demand fallback to the vllm-disagg model-resolution block: when the checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name -> org/name) and download into MODEL_DIR in HF cache layout, then resolve the snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir that is bind-mounted into the serving container as /models, so the existing -v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve. Implementation notes: - The host has no hf CLI, so the download runs in a one-shot container of the serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub. - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a re-check of snapshots/ under the lock makes it idempotent (resumable). - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed through for gated repos. - Scoped to the vllm-disagg branch only; pre-staged models never reach this path (the search finds them first), so sglang/atom and existing vLLM disagg models (M2.5/Kimi) are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * job.slurm: --entrypoint "" for the auto-download container The disagg auto-download reached hf download but failed all 3 attempts: the one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not override the image ENTRYPOINT, so the vllm-openai API server ran with the bash command as its args and died with "Failed to infer device type" (no GPU mounted in the download container). Add --entrypoint "" (as the serving container does) so bash actually runs hf download. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1) Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Update the vLLM external router container vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image. * M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial) MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model: sparse layers register a separate lightning-indexer cache (MLAAttentionSpec, rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5, fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives block geometry from the first cache and reuses first_layer's offsets for every layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is transferred with fp8 K+V sizing and gets corrupted on the decode worker, producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct). This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py. - patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry per layer (own shape/stride/dtype/rank) instead of from the first cache. Idempotent; no-op for homogeneous models. - setup_deps.sh: apply it on the vllm-disagg path. NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO connector cannot map, so M3 disagg accuracy is still broken pending a larger multi-group / index-state transfer change. (Disabling sparse attention is not a viable workaround: M3's fused QKV carries index_k weights, so dropping the indexer breaks weight load.) Refs #1762 Co-authored-by: Cursor <cursoragent@cursor.com> * feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every disagg block transfer reads the wrong region. Invisible to throughput, but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957). Instead of baking a fix into a rebuilt image (-hetkv) or carrying full vendored copies of the patched files in-tree, carry just the 218-line unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with `patch -p1` against the vLLM package dir inside the container at startup, ahead of the server launch. The repo is already bind-mounted into the container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3" (skippable with MORIIO_KV_PATCH=skip), mirroring the existing mori_conn.py sglang hook. A failed apply aborts the container instead of silently running unpatched. Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode) using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract 0.9560 (matches the baked image within noise), decode probe healthy. - patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock - job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out - patches/README.md: document the moriio/ diff-apply mechanism Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k Widen the disagg sweep from conc 1,2,4,8,16 to 1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep, for both seq-len scenarios: - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256 - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024 Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is needed (comment updated to say so). The multinode eval policy still marks exactly one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which bundles three layered fixes for the stock minimax-m3 vLLM image: 1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix, required for homogeneous TP too). 2. heterogeneous-TP addressing + guard: maps each decode rank to the correct prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and raises NotImplementedError for unsupported cases (prefill-TP > decode-TP, KV-head splitting) instead of silently corrupting KV. 3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs per transfer_id and only frees KV blocks once all expected consumers ACK, preventing both the late-ACK EngineCore crash and KV reuse before slower decode ranks finish reading. job.slurm and patches/README.md updated to reference the new diff name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4) With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc. The previous patch used `return self.tp_rank` for the P>D branch, which made decode rank 1 connect to prefill rank 1 (holds head0) instead of prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0. Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size), the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps each decode rank to the *first* prefill rank of its head group, which holds the correct KV content via vLLM's replication scheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion The P>D fix added 4 lines to _remote_tp_rank but the hunk header still said +1100,40; patch aborted with "malformed patch at line 79". Update to +1100,44 to match the actual 6 context + 38 added lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes The MoRIIO KV-layout patch was injected into the per-node container launch via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer srun bash -c "..." double-quoted string. Because the patch command value contains spaces and the shell operators '<' and '||', the unquoted expansion word-split the generated container script, truncating it right after the word `patch` and silently dropping the patch arguments AND the server.sh launch. The container then exited 0:0 within seconds, producing no benchmark/eval output -> collect_latest_results found "No logs directory" -> the launch step failed with exit 1 (all minimax-m3 disagg jobs affected). Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc single quotes (no quote toggling), so the patch command stays intact and its operators are parsed by the container shell. Validated end-to-end: gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4. Co-authored-by: Cursor <cursoragent@cursor.com> * disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k) Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an 8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still one lm-eval on the 8k1k TP8+TP8 layout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher The per-layer READ-offset fix this Python patcher applied to moriio_connector.py is fully subsumed by the unified overlay patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff rewrites the exact lines the patcher searches for (the `first_layer` single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing), with a stronger geometry-memoized + heterogeneous-TP-aware version, so the patcher's OLD1/OLD2 patterns no longer match and it already no-ops ("pattern not found; skipping") in the real flow. It's also the same fix now upstreamed in vLLM #46039 (READ mixed KV layouts). Drop the dead patcher and its setup_deps.sh hook so the diff is the single source of truth. patches/README.md only documents the diff (no reference to this patcher), so no README change is needed. Co-authored-by: Cursor <cursoragent@cursor.com> * Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay - Co-work with Gupta, Ravi All three MoRIIO fixes the in-tree overlay carried have merged upstream and now ship in the ROCm nightly image: - vLLM #46039 READ-mode mixed KV-layout (axis-aware per-layer offsets) - vLLM #46290 WRITE-mode per-geometry offset caching - vLLM #46332 heterogeneous-TP rank mapping + ACK fan-in Point minimaxm3-fp8-mi355x-vllm-disagg at vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15 (vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove the stop-gap overlay: - delete patches/moriio/moriio-minimax-m3-disagg.diff - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate) - trim the moriio/ section from patches/README.md Verified on the nightly image with NO patch across all four P/D layouts x conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4, 2P4+1D8) -- matching the previously-patched results. Refs #1762. * fix: append M3 MI355X disagg changelog entry at end of file The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Chun Fang <chun.fang@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: TianDi101 <ditian12@amd.com>

These patches were upstreamed in vllm-project/vllm#40344 so we can use the nightly image instead.
Switching to nightly also requires us to:
moritomori_low_latencyread_mode=1flag.Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848
Results from this run are very similar to the existing Kimi vllm disagg results, as expected
Note
Medium Risk
Changes affect multi-node prefill/decode KV transfer and MoE all2all for production benchmark paths; behavior now depends on a nightly image rather than battle-tested in-tree patches, though validation noted similar results to prior runs.
Overview
MI355X vLLM disaggregated Kimi K2.5 and MiniMax M2.5 configs now use a pinned ROCm nightly image that includes upstream MoRI-IO fixes (vllm#40344), with a TODO to move to a stable release later. Scenario
additional-settingsno longer passVLLM_MORIIO_CONNECTOR_READ_MODE.setup_deps.shdrops the large set of runtime vLLM/MoRI-IO and scheduler patches (save/load KV timeouts, transfer timeouts, read-mode scheduler fixes, idle KV reaper); vLLM-disagg startup is now recipe deps, amd-quark, and UCX/RIXL paths only.Serving wiring changes for the new stack:
--all2all-backend mori→mori_low_latencyinmodels_vllm.yaml;read_mode: truein MoRIIOkv_connector_extra_configinserver_vllm.shinstead of the env var;job.slurmstops injecting that env and bumps the default vllm-router image.submit.shno longer exports the read-mode env.perf-changelog.yamldocuments the config-key updates.Reviewed by Cursor Bugbot for commit 556d926. Bugbot is set up for automated code reviews on this repo. Configure here.