Skip to content

[Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585

Open
simondanielsson wants to merge 11 commits into
mainfrom
fix/remove-vllm-disagg-patches
Open

[Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585
simondanielsson wants to merge 11 commits into
mainfrom
fix/remove-vllm-disagg-patches

Conversation

@simondanielsson

@simondanielsson simondanielsson commented May 29, 2026

Copy link
Copy Markdown
Collaborator

These patches were upstreamed in vllm-project/vllm#40344 so we can use the nightly image instead.

Switching to nightly also requires us to:

  • rename a2a backend from mori to mori_low_latency
  • change MORI_READ_MODE=1 envvar to a read_mode=1 flag.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848

Results from this run are very similar to the existing Kimi vllm disagg results, as expected


Note

Medium Risk
Changes affect multi-node prefill/decode KV transfer and MoE all2all for production benchmark paths; behavior now depends on a nightly image rather than battle-tested in-tree patches, though validation noted similar results to prior runs.

Overview
MI355X vLLM disaggregated Kimi K2.5 and MiniMax M2.5 configs now use a pinned ROCm nightly image that includes upstream MoRI-IO fixes (vllm#40344), with a TODO to move to a stable release later. Scenario additional-settings no longer pass VLLM_MORIIO_CONNECTOR_READ_MODE.

setup_deps.sh drops the large set of runtime vLLM/MoRI-IO and scheduler patches (save/load KV timeouts, transfer timeouts, read-mode scheduler fixes, idle KV reaper); vLLM-disagg startup is now recipe deps, amd-quark, and UCX/RIXL paths only.

Serving wiring changes for the new stack: --all2all-backend morimori_low_latency in models_vllm.yaml; read_mode: true in MoRIIO kv_connector_extra_config in server_vllm.sh instead of the env var; job.slurm stops injecting that env and bumps the default vllm-router image. submit.sh no longer exports the read-mode env. perf-changelog.yaml documents the config-key updates.

Reviewed by Cursor Bugbot for commit 556d926. Bugbot is set up for automated code reviews on this repo. Configure here.

…m image

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Comment thread .github/configs/amd-master.yaml Outdated
image: vllm/vllm-openai-rocm:nightly-bf610c2f56764e1b30bc6065f4ceace3d6e59036
# TODO(simondanielsson): change to pinned version once https://github.com/vllm-project/vllm/pull/40344
# is part of official release, likely 0.22.0.
image: vllm/vllm-openai-rocm:nightly

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u have an specific night hash pinned here here instead of just the generic "nightly"

@simondanielsson simondanielsson Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can! The caveat (and why I didn't pin it here) was that the pinned nightlies are pruned from docker hub after 14 days so the config will then cease working in 2 weeks. WDYT?


minimaxm2.5-fp8-mi355x-vllm-disagg:
image: vllm/vllm-openai-rocm:nightly-a6682d1d259cca69a9ae737ea5608fbbe7520031
# TODO(simondanielsson): change to pinned version once https://github.com/vllm-project/vllm/pull/40344

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u have an specific night hash pinned here here instead of just the generic "nightly"

Comment thread benchmarks/multi_node/amd_utils/setup_deps.sh
@simondanielsson simondanielsson marked this pull request as ready for review June 3, 2026 08:23
@simondanielsson simondanielsson requested a review from a team June 3, 2026 08:23
@claude

claude Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Claude finished @simondanielsson's task in 2m 25s —— View job


PR Review: Remove MoRI-IO patches from vLLM Disagg benchmarks

  • Gather context and read changed files
  • Validate master config changes and perf-changelog
  • Check for bugs, logic errors, and breaking changes
  • Post review findings

Summary

Clean removal of ~555 lines of MoRI-IO monkey patches now that they've been upstreamed in vllm#40344. The three changes (backend rename morimori_low_latency, env var → read_mode config flag, patch removal) are consistent and the CI run confirms they work.

One blocking issue: perf-changelog.yaml must be updated since amd-master.yaml was modified — see inline comment.

The unpinned nightly image tags are already under discussion with the reviewer.

@simondanielsson

Copy link
Copy Markdown
Collaborator Author

Run is succeeding, and perf is very similar to the existing Kimi vllm-disagg results (with some noise of course)

Comment thread .github/configs/amd-master.yaml
Comment thread benchmarks/multi_node/amd_utils/setup_deps.sh
Comment thread .github/configs/amd-master.yaml
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3c94a6f. Configure here.

Comment thread benchmarks/multi_node/amd_utils/job.slurm Outdated
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simondanielsson is there is validated run? if so, lgtm

functionstackx added a commit that referenced this pull request Jun 11, 2026
…ch-free nightly

Brings the vLLM-disagg infra onto the upstream-MoRIIO nightly so the large
setup_deps.sh runtime patches are dropped (vllm#40344), and migrates the new
dsv4-fp4-mi355x-vllm-disagg recipe to match:
- image -> vllm/vllm-openai-rocm:nightly-3f0a91bb (carries #40344 + DeepseekV4);
  not available in v0.22.0/v0.22.1 release tags
- drop VLLM_MORIIO_CONNECTOR_READ_MODE env setting (read_mode now set via
  kv_connector_extra_config in server_vllm.sh)
- dsv4 is TP8/EP1 so no all2all backend / mori_low_latency rename needed

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@simondanielsson

simondanielsson commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Before merge I will

  1. update the pinned vllm nightlies again (since vllm imges are pruned from docker hub every 2 weeks)
  2. Re-run CI
  3. Post link here

EDIT: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848/job/80753697924

@functionstackx

Copy link
Copy Markdown
Collaborator

thanks @simondanielsson

@functionstackx

Copy link
Copy Markdown
Collaborator

appreipcate your help on getting first class MoRI collective working @simondanielsson ! <3 with the high conc hanging fixed now, the experience for end users will be much better now!

…agg-patches

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
functionstackx added a commit that referenced this pull request Jun 12, 2026
Remove the kimik2.5/minimaxm2.5 vllm-disagg changelog entry (that
change is documented in #1585) and scrub kimi/minimax references from
the dsv4-fp4-mi355x-vllm-disagg entry descriptions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@simondanielsson

simondanielsson commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Run succeeded: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848

cc: @chunfangamd should be ready for merge

…agg-patches

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
functionstackx added a commit that referenced this pull request Jun 17, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 17, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 18, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 18, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 19, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 19, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 21, 2026
…g smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 21, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 24, 2026
* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged

The first MI355X disagg sweep (run 27515119215) failed: the day-zero
MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so
job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found.
Searched: ...") before the engine ever started. The single-node recipes
hf-download inside the serving container, but the disagg path historically
required ops to pre-stage checkpoints.

Add an on-demand fallback to the vllm-disagg model-resolution block: when the
checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name
-> org/name) and download into MODEL_DIR in HF cache layout, then resolve the
snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir
that is bind-mounted into the serving container as /models, so the existing
-v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve.

Implementation notes:
  - The host has no hf CLI, so the download runs in a one-shot container of the
    serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub.
  - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a
    re-check of snapshots/ under the lock makes it idempotent (resumable).
  - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed
    through for gated repos.
  - Scoped to the vllm-disagg branch only; pre-staged models never reach this
    path (the search finds them first), so sglang/atom and existing vLLM disagg
    models (M2.5/Kimi) are unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* job.slurm: --entrypoint "" for the auto-download container

The disagg auto-download reached hf download but failed all 3 attempts: the
one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not
override the image ENTRYPOINT, so the vllm-openai API server ran with the bash
command as its args and died with "Failed to infer device type" (no GPU mounted
in the download container). Add --entrypoint "" (as the serving container does)
so bash actually runs hf download.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness)

The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy
only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row
(same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true
(eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate
correctness. The conc-1 1k1k row stays the latency smoke test.

Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1)

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16
(1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios
(1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked
(eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Update the vLLM external router container

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older
dated tags are garbage-collected (manifest unknown), which makes `docker run`
fail with exit 125 on any node that has not already cached the image.

* M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial)

MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model:
sparse layers register a separate lightning-indexer cache (MLAAttentionSpec,
rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5,
fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives
block geometry from the first cache and reuses first_layer's offsets for every
layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is
transferred with fp8 K+V sizing and gets corrupted on the decode worker,
producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct).
This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py.

- patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry
  per layer (own shape/stride/dtype/rank) instead of from the first cache.
  Idempotent; no-op for homogeneous models.
- setup_deps.sh: apply it on the vllm-disagg path.

NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a
separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO
connector cannot map, so M3 disagg accuracy is still broken pending a larger
multi-group / index-state transfer change. (Disabling sparse attention is not a
viable workaround: M3's fused QKV carries index_k weights, so dropping the
indexer breaks weight load.)

Refs #1762

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image

The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the
FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this
vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every
disagg block transfer reads the wrong region. Invisible to throughput,
but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957).

Instead of baking a fix into a rebuilt image (-hetkv) or carrying full
vendored copies of the patched files in-tree, carry just the 218-line
unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with
`patch -p1` against the vLLM package dir inside the container at startup,
ahead of the server launch. The repo is already bind-mounted into the
container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm
auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3"
(skippable with MORIIO_KV_PATCH=skip), mirroring the existing
mori_conn.py sglang hook. A failed apply aborts the container instead of
silently running unpatched.

Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode)
using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract
0.9560 (matches the baked image within noise), decode probe healthy.

- patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock
- job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out
- patches/README.md: document the moriio/ diff-apply mechanism

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k

Widen the disagg sweep from conc 1,2,4,8,16 to
1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D
TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16)
so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k

Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep,
for both seq-len scenarios:
  - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256
  - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024

Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh
sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the
computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is
needed (comment updated to say so). The multinode eval policy still marks exactly
one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff

Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which
bundles three layered fixes for the stock minimax-m3 vLLM image:
1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix,
   required for homogeneous TP too).
2. heterogeneous-TP addressing + guard: maps each decode rank to the correct
   prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and
   raises NotImplementedError for unsupported cases (prefill-TP > decode-TP,
   KV-head splitting) instead of silently corrupting KV.
3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs
   per transfer_id and only frees KV blocks once all expected consumers ACK,
   preventing both the late-ACK EngineCore crash and KV reuse before slower
   decode ranks finish reading.

job.slurm and patches/README.md updated to reference the new diff name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4)

With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks
in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc.
The previous patch used `return self.tp_rank` for the P>D branch, which
made decode rank 1 connect to prefill rank 1 (holds head0) instead of
prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0.

Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size),
the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps
each decode rank to the *first* prefill rank of its head group, which holds
the correct KV content via vLLM's replication scheme.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion

The P>D fix added 4 lines to _remote_tp_rank but the hunk header still
said +1100,40; patch aborted with "malformed patch at line 79". Update
to +1100,44 to match the actual 6 context + 38 added lines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes

The MoRIIO KV-layout patch was injected into the per-node container launch
via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer
srun bash -c "..." double-quoted string. Because the patch command value
contains spaces and the shell operators '<' and '||', the unquoted
expansion word-split the generated container script, truncating it right
after the word `patch` and silently dropping the patch arguments AND the
server.sh launch. The container then exited 0:0 within seconds, producing
no benchmark/eval output -> collect_latest_results found "No logs
directory" -> the launch step failed with exit 1 (all minimax-m3 disagg
jobs affected).

Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc
single quotes (no quote toggling), so the patch command stays intact and
its operators are parsed by the container shell. Validated end-to-end:
gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4.

Co-authored-by: Cursor <cursoragent@cursor.com>

* disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k)

Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an
8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to
both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still
one lm-eval on the 8k1k TP8+TP8 layout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher

The per-layer READ-offset fix this Python patcher applied to
moriio_connector.py is fully subsumed by the unified overlay
patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies
with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff
rewrites the exact lines the patcher searches for (the `first_layer`
single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing),
with a stronger geometry-memoized + heterogeneous-TP-aware version, so
the patcher's OLD1/OLD2 patterns no longer match and it already no-ops
("pattern not found; skipping") in the real flow. It's also the same
fix now upstreamed in vLLM #46039 (READ mixed KV layouts).

Drop the dead patcher and its setup_deps.sh hook so the diff is the
single source of truth. patches/README.md only documents the diff (no
reference to this patcher), so no README change is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay

- Co-work with Gupta, Ravi

All three MoRIIO fixes the in-tree overlay carried have merged upstream and now
ship in the ROCm nightly image:
  - vLLM #46039  READ-mode mixed KV-layout (axis-aware per-layer offsets)
  - vLLM #46290  WRITE-mode per-geometry offset caching
  - vLLM #46332  heterogeneous-TP rank mapping + ACK fan-in

Point minimaxm3-fp8-mi355x-vllm-disagg at
vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15
(vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove
the stop-gap overlay:
  - delete patches/moriio/moriio-minimax-m3-disagg.diff
  - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate)
  - trim the moriio/ section from patches/README.md

Verified on the nightly image with NO patch across all four P/D layouts x
conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4,
2P4+1D8) -- matching the previously-patched results.

Refs #1762.

* fix: append M3 MI355X disagg changelog entry at end of file

The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after
the #1862 entry), which violates the append-only changelog gate
("entry 511 changed; existing entries are immutable"). Move it to the
end of perf-changelog.yaml so existing entries stay byte-identical to
main and the new entry is a clean append.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Chun Fang <chun.fang@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: TianDi101 <ditian12@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants