[Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks by simondanielsson · Pull Request #1585 · SemiAnalysisAI/InferenceX

simondanielsson · 2026-05-29T07:38:07Z

These patches were upstreamed in vllm-project/vllm#40344 so we can use the nightly image instead.

Switching to nightly also requires us to:

rename a2a backend from mori to mori_low_latency
change MORI_READ_MODE=1 envvar to a read_mode=1 flag.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848

Results from this run are very similar to the existing Kimi vllm disagg results, as expected

Note

Medium Risk
Changes affect multi-node prefill/decode KV transfer and MoE all2all for production benchmark paths; behavior now depends on a nightly image rather than battle-tested in-tree patches, though validation noted similar results to prior runs.

Overview
MI355X vLLM disaggregated Kimi K2.5 and MiniMax M2.5 configs now use a pinned ROCm nightly image that includes upstream MoRI-IO fixes (vllm#40344), with a TODO to move to a stable release later. Scenario additional-settings no longer pass VLLM_MORIIO_CONNECTOR_READ_MODE.

setup_deps.sh drops the large set of runtime vLLM/MoRI-IO and scheduler patches (save/load KV timeouts, transfer timeouts, read-mode scheduler fixes, idle KV reaper); vLLM-disagg startup is now recipe deps, amd-quark, and UCX/RIXL paths only.

Serving wiring changes for the new stack: --all2all-backend mori → mori_low_latency in models_vllm.yaml; read_mode: true in MoRIIO kv_connector_extra_config in server_vllm.sh instead of the env var; job.slurm stops injecting that env and bumps the default vllm-router image. submit.sh no longer exports the read-mode env. perf-changelog.yaml documents the config-key updates.

^{Reviewed by Cursor Bugbot for commit 556d926. Bugbot is set up for automated code reviews on this repo. Configure here.}

…m image Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

github-actions · 2026-05-29T07:38:18Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

functionstackx · 2026-06-02T23:06:50Z

-  image: vllm/vllm-openai-rocm:nightly-bf610c2f56764e1b30bc6065f4ceace3d6e59036
+  # TODO(simondanielsson): change to pinned version once https://github.com/vllm-project/vllm/pull/40344
+  # is part of official release, likely 0.22.0.
+  image: vllm/vllm-openai-rocm:nightly


can u have an specific night hash pinned here here instead of just the generic "nightly"

We can! The caveat (and why I didn't pin it here) was that the pinned nightlies are pruned from docker hub after 14 days so the config will then cease working in 2 weeks. WDYT?

functionstackx · 2026-06-02T23:06:56Z


 minimaxm2.5-fp8-mi355x-vllm-disagg:
-  image: vllm/vllm-openai-rocm:nightly-a6682d1d259cca69a9ae737ea5608fbbe7520031
+  # TODO(simondanielsson): change to pinned version once https://github.com/vllm-project/vllm/pull/40344


can u have an specific night hash pinned here here instead of just the generic "nightly"

claude · 2026-06-03T08:24:01Z

Claude finished @simondanielsson's task in 2m 25s —— View job

PR Review: Remove MoRI-IO patches from vLLM Disagg benchmarks

Gather context and read changed files
Validate master config changes and perf-changelog
Check for bugs, logic errors, and breaking changes
Post review findings

Summary

Clean removal of ~555 lines of MoRI-IO monkey patches now that they've been upstreamed in vllm#40344. The three changes (backend rename mori → mori_low_latency, env var → read_mode config flag, patch removal) are consistent and the CI run confirms they work.

One blocking issue: perf-changelog.yaml must be updated since amd-master.yaml was modified — see inline comment.

The unpinned nightly image tags are already under discussion with the reviewer.

simondanielsson · 2026-06-03T08:24:14Z

Run is succeeding, and perf is very similar to the existing Kimi vllm-disagg results (with some noise of course)

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…atches

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3c94a6f. Configure here.}

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

functionstackx

@simondanielsson is there is validated run? if so, lgtm

…ch-free nightly Brings the vLLM-disagg infra onto the upstream-MoRIIO nightly so the large setup_deps.sh runtime patches are dropped (vllm#40344), and migrates the new dsv4-fp4-mi355x-vllm-disagg recipe to match: - image -> vllm/vllm-openai-rocm:nightly-3f0a91bb (carries #40344 + DeepseekV4); not available in v0.22.0/v0.22.1 release tags - drop VLLM_MORIIO_CONNECTOR_READ_MODE env setting (read_mode now set via kv_connector_extra_config in server_vllm.sh) - dsv4 is TP8/EP1 so no all2all backend / mori_low_latency rename needed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

simondanielsson · 2026-06-11T08:03:59Z

Before merge I will

update the pinned vllm nightlies again (since vllm imges are pruned from docker hub every 2 weeks)
Re-run CI
Post link here

EDIT: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848/job/80753697924

functionstackx · 2026-06-11T08:07:01Z

thanks @simondanielsson

functionstackx · 2026-06-11T08:08:44Z

appreipcate your help on getting first class MoRI collective working @simondanielsson ! <3 with the high conc hanging fixed now, the experience for end users will be much better now!

…agg-patches Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Remove the kimik2.5/minimaxm2.5 vllm-disagg changelog entry (that change is documented in #1585) and scrub kimi/minimax references from the dsv4-fp4-mi355x-vllm-disagg entry descriptions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…wnload Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…g smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…wnload Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

simondanielsson · 2026-06-15T14:17:15Z

Run succeeded: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27334107848

cc: @chunfangamd should be ready for merge