Skip to content

[NVIDIA] Add TRT-LLM 70B FP8 via slurm#1

Closed
kedarpotdar-nv wants to merge 28 commits into
mainfrom
kepotdar-trt-70b
Closed

[NVIDIA] Add TRT-LLM 70B FP8 via slurm#1
kedarpotdar-nv wants to merge 28 commits into
mainfrom
kepotdar-trt-70b

Conversation

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator

Added B200 TRT-LLM runner configuration and consolidated runner logic

Changes Made:

  1. Added new B200 TRT-LLM job (bmk-b200-trt) in 70b-tmpl.yml
  • Uses nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 container
  • Runs nvidia/Llama-3.3-70B-Instruct-FP8 model
  • Same experimental parameters as other 70B configs
  1. Consolidated B200 runner logic
  • Updated launch_b200-nv.sh to use dynamic ${MODEL_CODE}_${RUNNER_LABEL}_slurm.sh pattern
  • Added RUNNER_LABEL environment variable in benchmark-tmpl.yml
  • Deleted redundant launch_b200-trt.sh
  1. Created TRT-LLM benchmark script (70b_b200-trt_slurm.sh)
  • Uses trtllm-serve with proper configuration
  • Inline llama-config.yml generation
  • Same client script (kimbochen/bench_serving)
  1. Temporarily disabled standard B200 vLLM for testing
  • Commented out bmk-b200 job
  • Updated collect-results dependencies

@kimbochen

kimbochen commented Aug 28, 2025

Copy link
Copy Markdown
Collaborator

Thank you for the PR.
I think we should keep B200 vLLM because it's an important comparison.
By injecting the ${{ inputs.runner }} info at the "Launch job script" step, you can keep the default behavior:

- name: Launch job script
  run: |
    RUNNER_NAME=${{ runner.name }}
    RUNNER_LABEL=${{ inputs.runner }}
    bash ./runners/launch_${RUNNER_NAME%%_*}.sh ${{ inputs.exp-name }}

and in launch_b200-nv.sh:

bash benchmarks/${MODEL_CODE}_${RUNNER_LABEL}_slurm.sh

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator Author

Thanks for the review, @kimbochen!

made these changes:

✅ uncommented vLLM
✅ use targeted variable injection (not global) for runner label
✅ Dynamially selects benchmark scripts based on runner labels

@kimbochen

Copy link
Copy Markdown
Collaborator

Testing shows the script doesn't pick up RUNNER_LABEL.
Can you add the RUNNER_LABEL back to env and remove in the step?
Sorry my bad

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator Author

No worries, reverted!

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator Author

@kimbochen B200 trt jobs are failing because trt sqsh file shares name with vllm. Made a fix, also temporarily removed vllm and other configs. just to test if b200 trt is working. Can you please cancel the current job and re-run with these fixes?

salloc: Granted job allocation 1919
salloc: Waiting for resource configuration
salloc: Nodes dgx05-b200 are ready for job
+ srun --jobid=1919 bash -c 'enroot import -o /raid/image_70b_b200.sqsh docker://nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0'
Error:  File already exists: /raid/image_70b_b200.sqsh
srun: error: dgx05-b200: task 0: Exited with exit code 1
+ srun --jobid=1919 --container-image=/raid/image_70b_b200.sqsh --container-mounts=/home/gharunnerb1/actions-runner/_work/InferenceMAX/InferenceMAX:/workspace/,/raid/hf_hub_cache/:/mnt/hf_hub_cache/ --container-mount-home --container-workdir=/workspace/ --no-container-entrypoint --export=ALL bash benchmarks/70b_b200-trt_slurm.sh
JOB 1919 running on dgx05-b200
+ hf download nvidia/Llama-3.3-70B-Instruct-FP8
Fetching 25 files:   0%|          | 0/25 [00:00

jthomson04 pushed a commit to jthomson04/InferenceMAX that referenced this pull request Jan 21, 2026
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title Add TRT-LLM 70B FP8 via slurm [NVIDIA] Add TRT-LLM 70B FP8 via slurm Apr 8, 2026
chunfangamd added a commit that referenced this pull request May 24, 2026
…DSA state-index path

amd-master.yaml
  - Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402
        -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
    (matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is
    no longer the reference build for hybrid-attention disagg models on
    MI355X.)
  - Scenarios: collapse the four legacy "top/middle/bottom/small-scale"
    search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false
    entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512]
    for both 1k1k and 8k1k. dp-attn=false avoids the
    fused_moe_triton/layer.py:209 shared-slot assertion that
    --enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5
    (256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed
    layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the
    same CI matrix-expansion logic applies to both.

patches/mori_conn.py
  - Add patch #4: rank + length normalization in
    MoriKVReceiver._send_swa_dsa_state, immediately before the
    group_concurrent_contiguous call. For GLM-5 (single DSA component),
    upstream hands dst_state_indices as a 2-D (1, N) array while
    src_state_indices is 1-D length 1; the existing [:common_len]
    slice operates only on the outer axis, leaving the rank mismatched.
    np.diff then produces (1, N-1) vs (0,), which can't broadcast and
    crashes with "operands could not be broadcast together with shapes
    (1,12) (0,)". The fix ravels both indices to 1-D and re-truncates
    to common length so np.diff outputs compatible 1-D arrays. One-shot
    log gates the warning to once per receiver class.

  - Verified end-to-end:
      glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047
      glm5-fp8-mi355x-sglang-disagg gsm8k strict-match     = 0.9712 +/- 0.0046
      qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression)  = 0.9780 +/- 0.004
    Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives
    inside _send_swa_dsa_state, never called for Mamba); patches #1-#3
    behavior is unchanged.

patches/README.md
  - Document patch #4 alongside the existing three. Cross-link the full
    bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md
    and the gsm8k verification at
    scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.
functionstackx added a commit that referenced this pull request Jun 15, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 15, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 17, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 18, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 19, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Jun 21, 2026
…wnload

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Oseltamivir added a commit that referenced this pull request Jun 23, 2026
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
functionstackx added a commit that referenced this pull request Jun 24, 2026
* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test

MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the
day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) +
1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline
end-to-end for M3.

Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's
amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml
mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump
nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env),
while keeping main's atom-disagg support intact.

Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128
(MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend
TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts
TP-sharded as in the single-node M3 TP8 recipe).

perf-changelog.yaml and amd-master.yaml contain only M3 changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged

The first MI355X disagg sweep (run 27515119215) failed: the day-zero
MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so
job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found.
Searched: ...") before the engine ever started. The single-node recipes
hf-download inside the serving container, but the disagg path historically
required ops to pre-stage checkpoints.

Add an on-demand fallback to the vllm-disagg model-resolution block: when the
checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name
-> org/name) and download into MODEL_DIR in HF cache layout, then resolve the
snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir
that is bind-mounted into the serving container as /models, so the existing
-v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve.

Implementation notes:
  - The host has no hf CLI, so the download runs in a one-shot container of the
    serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub.
  - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a
    re-check of snapshots/ under the lock makes it idempotent (resumable).
  - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed
    through for gated repos.
  - Scoped to the vllm-disagg branch only; pre-staged models never reach this
    path (the search finds them first), so sglang/atom and existing vLLM disagg
    models (M2.5/Kimi) are unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* job.slurm: --entrypoint "" for the auto-download container

The disagg auto-download reached hf download but failed all 3 attempts: the
one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not
override the image ENTRYPOINT, so the vllm-openai API server ran with the bash
command as its args and died with "Failed to infer device type" (no GPU mounted
in the download container). Add --entrypoint "" (as the serving container does)
so bash actually runs hf download.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download

Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:

    export MODEL_PATH=/it-share/hf-hub-cache

submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.

This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness)

The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy
only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row
(same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true
(eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate
correctness. The conc-1 1k1k row stays the latency smoke test.

Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1)

Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16
(1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k

Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios
(1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked
(eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Update the vLLM external router container

vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older
dated tags are garbage-collected (manifest unknown), which makes `docker run`
fail with exit 125 on any node that has not already cached the image.

* M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial)

MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model:
sparse layers register a separate lightning-indexer cache (MLAAttentionSpec,
rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5,
fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives
block geometry from the first cache and reuses first_layer's offsets for every
layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is
transferred with fp8 K+V sizing and gets corrupted on the decode worker,
producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct).
This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py.

- patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry
  per layer (own shape/stride/dtype/rank) instead of from the first cache.
  Idempotent; no-op for homogeneous models.
- setup_deps.sh: apply it on the vllm-disagg path.

NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a
separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO
connector cannot map, so M3 disagg accuracy is still broken pending a larger
multi-group / index-state transfer change. (Disabling sparse attention is not a
viable workaround: M3's fused QKV carries index_k weights, so dropping the
indexer breaks weight load.)

Refs #1762

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image

The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the
FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this
vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every
disagg block transfer reads the wrong region. Invisible to throughput,
but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957).

Instead of baking a fix into a rebuilt image (-hetkv) or carrying full
vendored copies of the patched files in-tree, carry just the 218-line
unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with
`patch -p1` against the vLLM package dir inside the container at startup,
ahead of the server launch. The repo is already bind-mounted into the
container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm
auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3"
(skippable with MORIIO_KV_PATCH=skip), mirroring the existing
mori_conn.py sglang hook. A failed apply aborts the container instead of
silently running unpatched.

Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode)
using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract
0.9560 (matches the baked image within noise), decode probe healthy.

- patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock
- job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out
- patches/README.md: document the moriio/ diff-apply mechanism

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k

Widen the disagg sweep from conc 1,2,4,8,16 to
1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D
TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16)
so lm-eval still validates the MoRI-IO disagg pipeline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k

Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep,
for both seq-len scenarios:
  - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256
  - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024

Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh
sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the
computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is
needed (comment updated to say so). The multinode eval policy still marks exactly
one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff

Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which
bundles three layered fixes for the stock minimax-m3 vLLM image:
1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix,
   required for homogeneous TP too).
2. heterogeneous-TP addressing + guard: maps each decode rank to the correct
   prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and
   raises NotImplementedError for unsupported cases (prefill-TP > decode-TP,
   KV-head splitting) instead of silently corrupting KV.
3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs
   per transfer_id and only frees KV blocks once all expected consumers ACK,
   preventing both the late-ACK EngineCore crash and KV reuse before slower
   decode ranks finish reading.

job.slurm and patches/README.md updated to reference the new diff name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4)

With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks
in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc.
The previous patch used `return self.tp_rank` for the P>D branch, which
made decode rank 1 connect to prefill rank 1 (holds head0) instead of
prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0.

Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size),
the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps
each decode rank to the *first* prefill rank of its head group, which holds
the correct KV content via vLLM's replication scheme.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion

The P>D fix added 4 lines to _remote_tp_rank but the hunk header still
said +1100,40; patch aborted with "malformed patch at line 79". Update
to +1100,44 to match the actual 6 context + 38 added lines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes

The MoRIIO KV-layout patch was injected into the per-node container launch
via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer
srun bash -c "..." double-quoted string. Because the patch command value
contains spaces and the shell operators '<' and '||', the unquoted
expansion word-split the generated container script, truncating it right
after the word `patch` and silently dropping the patch arguments AND the
server.sh launch. The container then exited 0:0 within seconds, producing
no benchmark/eval output -> collect_latest_results found "No logs
directory" -> the launch step failed with exit 1 (all minimax-m3 disagg
jobs affected).

Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc
single quotes (no quote toggling), so the patch command stays intact and
its operators are parsed by the container shell. Validated end-to-end:
gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4.

Co-authored-by: Cursor <cursoragent@cursor.com>

* disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k)

Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an
8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to
both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still
one lm-eval on the 8k1k TP8+TP8 layout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher

The per-layer READ-offset fix this Python patcher applied to
moriio_connector.py is fully subsumed by the unified overlay
patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies
with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff
rewrites the exact lines the patcher searches for (the `first_layer`
single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing),
with a stronger geometry-memoized + heterogeneous-TP-aware version, so
the patcher's OLD1/OLD2 patterns no longer match and it already no-ops
("pattern not found; skipping") in the real flow. It's also the same
fix now upstreamed in vLLM #46039 (READ mixed KV layouts).

Drop the dead patcher and its setup_deps.sh hook so the diff is the
single source of truth. patches/README.md only documents the diff (no
reference to this patcher), so no README change is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay

- Co-work with Gupta, Ravi

All three MoRIIO fixes the in-tree overlay carried have merged upstream and now
ship in the ROCm nightly image:
  - vLLM #46039  READ-mode mixed KV-layout (axis-aware per-layer offsets)
  - vLLM #46290  WRITE-mode per-geometry offset caching
  - vLLM #46332  heterogeneous-TP rank mapping + ACK fan-in

Point minimaxm3-fp8-mi355x-vllm-disagg at
vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15
(vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove
the stop-gap overlay:
  - delete patches/moriio/moriio-minimax-m3-disagg.diff
  - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate)
  - trim the moriio/ section from patches/README.md

Verified on the nightly image with NO patch across all four P/D layouts x
conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4,
2P4+1D8) -- matching the previously-patched results.

Refs #1762.

* fix: append M3 MI355X disagg changelog entry at end of file

The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after
the #1862 entry), which violates the append-only changelog gate
("entry 511 changed; existing entries are immutable"). Move it to the
end of perf-changelog.yaml so existing entries stay byte-identical to
main and the new entry is a clean append.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Chun Fang <chun.fang@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: TianDi101 <ditian12@amd.com>
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
…p99, routing identity

Addresses review #3 methodology critiques (schema_version 3):

- Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform,
  rather than each choosing its own timing boundary. layout-and-dispatch-v1 times
  get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is
  computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so
  dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The
  misleading "comm-only-v1" label is gone.

- Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized
  per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump
  wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline).
  p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.)

- Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights;
  cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the
  identical trace, else status=invalid. Added per-dest-rank send histogram.

- Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at
  their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of
  isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty).

- Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
…tract/run metadata

- capability.py (stdlib): static table mirroring adapter SUPPORTED_* sets; resolves
  (sku->vendor, backend, mode, dtype, contract) -> valid/why. Workflow runs it as a
  fail-fast "Validate capability" gate BEFORE consuming a runner (review #3 #2).
- NCCL/RCCL phase-dedup: matrix collapses to a single 'na' job for collective backends
  (phase is meaningless for nccl/rccl — was running identical work twice).
- contract input + CX_MEASUREMENT_CONTRACT threaded through run_in_container -> run_ep;
  CX_TRIALS too. COLLECTIVEX_SOURCE_SHA + GHA run id/attempt reach the artifact (run
  linkage, review #3 #1). run_ep reads GITHUB_SHA as the source-sha fallback.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
… rate, run links

Addresses review #3 frontend critiques (backward-compatible with v2 docs):
- Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles.
- Suite selector backend-default vs resource-constrained — kept distinct, never read as
  one fair contest (#5). dtype/mode/resource/contract are all in the per-line label +
  hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10).
- Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine
  bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7).
- Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that
  produced the point (#1). Provenance text no longer claims a single dtype (the
  "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count,
  logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants