[NVIDIA] Add TRT-LLM 70B FP8 via slurm#1
Conversation
|
Thank you for the PR. - name: Launch job script
run: |
RUNNER_NAME=${{ runner.name }}
RUNNER_LABEL=${{ inputs.runner }}
bash ./runners/launch_${RUNNER_NAME%%_*}.sh ${{ inputs.exp-name }}and in bash benchmarks/${MODEL_CODE}_${RUNNER_LABEL}_slurm.sh |
|
Thanks for the review, @kimbochen! made these changes: ✅ uncommented vLLM |
|
Testing shows the script doesn't pick up |
|
No worries, reverted! |
|
@kimbochen B200 trt jobs are failing because trt sqsh file shares name with vllm. Made a fix, also temporarily removed vllm and other configs. just to test if b200 trt is working. Can you please cancel the current job and re-run with these fixes? salloc: Granted job allocation 1919 salloc: Waiting for resource configuration salloc: Nodes dgx05-b200 are ready for job + srun --jobid=1919 bash -c 'enroot import -o /raid/image_70b_b200.sqsh docker://nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0' Error: File already exists: /raid/image_70b_b200.sqsh srun: error: dgx05-b200: task 0: Exited with exit code 1 + srun --jobid=1919 --container-image=/raid/image_70b_b200.sqsh --container-mounts=/home/gharunnerb1/actions-runner/_work/InferenceMAX/InferenceMAX:/workspace/,/raid/hf_hub_cache/:/mnt/hf_hub_cache/ --container-mount-home --container-workdir=/workspace/ --no-container-entrypoint --export=ALL bash benchmarks/70b_b200-trt_slurm.sh JOB 1919 running on dgx05-b200 + hf download nvidia/Llama-3.3-70B-Instruct-FP8 Fetching 25 files: 0%| | 0/25 [00:00 |
…summarize.py to reflect backend, fix issue with result filename
Add upstream sync workflow
…DSA state-index path
amd-master.yaml
- Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402
-> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
(matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is
no longer the reference build for hybrid-attention disagg models on
MI355X.)
- Scenarios: collapse the four legacy "top/middle/bottom/small-scale"
search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false
entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512]
for both 1k1k and 8k1k. dp-attn=false avoids the
fused_moe_triton/layer.py:209 shared-slot assertion that
--enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5
(256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed
layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the
same CI matrix-expansion logic applies to both.
patches/mori_conn.py
- Add patch #4: rank + length normalization in
MoriKVReceiver._send_swa_dsa_state, immediately before the
group_concurrent_contiguous call. For GLM-5 (single DSA component),
upstream hands dst_state_indices as a 2-D (1, N) array while
src_state_indices is 1-D length 1; the existing [:common_len]
slice operates only on the outer axis, leaving the rank mismatched.
np.diff then produces (1, N-1) vs (0,), which can't broadcast and
crashes with "operands could not be broadcast together with shapes
(1,12) (0,)". The fix ravels both indices to 1-D and re-truncates
to common length so np.diff outputs compatible 1-D arrays. One-shot
log gates the warning to once per receiver class.
- Verified end-to-end:
glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047
glm5-fp8-mi355x-sglang-disagg gsm8k strict-match = 0.9712 +/- 0.0046
qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression) = 0.9780 +/- 0.004
Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives
inside _send_swa_dsa_state, never called for Mamba); patches #1-#3
behavior is unchanged.
patches/README.md
- Document patch #4 alongside the existing three. Cross-link the full
bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md
and the gsm8k verification at
scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…wnload
Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's
shared HF cache where the ~414 GB MXFP8 checkpoint is already staged
(/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the
launcher default /it-share/data. Scoped to M3 only via the M3 disagg script:
export MODEL_PATH=/it-share/hf-hub-cache
submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot
under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode
serving containers. Other disagg models keep /it-share/data.
This supersedes the earlier job.slurm auto-download approach, which is reverted:
job.slurm now differs from main only by the #1585 mori-removal hunks (router
image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged The first MI355X disagg sweep (run 27515119215) failed: the day-zero MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found. Searched: ...") before the engine ever started. The single-node recipes hf-download inside the serving container, but the disagg path historically required ops to pre-stage checkpoints. Add an on-demand fallback to the vllm-disagg model-resolution block: when the checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name -> org/name) and download into MODEL_DIR in HF cache layout, then resolve the snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir that is bind-mounted into the serving container as /models, so the existing -v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve. Implementation notes: - The host has no hf CLI, so the download runs in a one-shot container of the serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub. - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a re-check of snapshots/ under the lock makes it idempotent (resumable). - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed through for gated repos. - Scoped to the vllm-disagg branch only; pre-staged models never reach this path (the search finds them first), so sglang/atom and existing vLLM disagg models (M2.5/Kimi) are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * job.slurm: --entrypoint "" for the auto-download container The disagg auto-download reached hf download but failed all 3 attempts: the one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not override the image ENTRYPOINT, so the vllm-openai API server ran with the bash command as its args and died with "Failed to infer device type" (no GPU mounted in the download container). Add --entrypoint "" (as the serving container does) so bash actually runs hf download. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1) Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Update the vLLM external router container vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image. * M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial) MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model: sparse layers register a separate lightning-indexer cache (MLAAttentionSpec, rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5, fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives block geometry from the first cache and reuses first_layer's offsets for every layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is transferred with fp8 K+V sizing and gets corrupted on the decode worker, producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct). This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py. - patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry per layer (own shape/stride/dtype/rank) instead of from the first cache. Idempotent; no-op for homogeneous models. - setup_deps.sh: apply it on the vllm-disagg path. NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO connector cannot map, so M3 disagg accuracy is still broken pending a larger multi-group / index-state transfer change. (Disabling sparse attention is not a viable workaround: M3's fused QKV carries index_k weights, so dropping the indexer breaks weight load.) Refs #1762 Co-authored-by: Cursor <cursoragent@cursor.com> * feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every disagg block transfer reads the wrong region. Invisible to throughput, but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957). Instead of baking a fix into a rebuilt image (-hetkv) or carrying full vendored copies of the patched files in-tree, carry just the 218-line unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with `patch -p1` against the vLLM package dir inside the container at startup, ahead of the server launch. The repo is already bind-mounted into the container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3" (skippable with MORIIO_KV_PATCH=skip), mirroring the existing mori_conn.py sglang hook. A failed apply aborts the container instead of silently running unpatched. Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode) using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract 0.9560 (matches the baked image within noise), decode probe healthy. - patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock - job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out - patches/README.md: document the moriio/ diff-apply mechanism Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k Widen the disagg sweep from conc 1,2,4,8,16 to 1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep, for both seq-len scenarios: - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256 - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024 Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is needed (comment updated to say so). The multinode eval policy still marks exactly one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which bundles three layered fixes for the stock minimax-m3 vLLM image: 1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix, required for homogeneous TP too). 2. heterogeneous-TP addressing + guard: maps each decode rank to the correct prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and raises NotImplementedError for unsupported cases (prefill-TP > decode-TP, KV-head splitting) instead of silently corrupting KV. 3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs per transfer_id and only frees KV blocks once all expected consumers ACK, preventing both the late-ACK EngineCore crash and KV reuse before slower decode ranks finish reading. job.slurm and patches/README.md updated to reference the new diff name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4) With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc. The previous patch used `return self.tp_rank` for the P>D branch, which made decode rank 1 connect to prefill rank 1 (holds head0) instead of prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0. Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size), the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps each decode rank to the *first* prefill rank of its head group, which holds the correct KV content via vLLM's replication scheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion The P>D fix added 4 lines to _remote_tp_rank but the hunk header still said +1100,40; patch aborted with "malformed patch at line 79". Update to +1100,44 to match the actual 6 context + 38 added lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes The MoRIIO KV-layout patch was injected into the per-node container launch via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer srun bash -c "..." double-quoted string. Because the patch command value contains spaces and the shell operators '<' and '||', the unquoted expansion word-split the generated container script, truncating it right after the word `patch` and silently dropping the patch arguments AND the server.sh launch. The container then exited 0:0 within seconds, producing no benchmark/eval output -> collect_latest_results found "No logs directory" -> the launch step failed with exit 1 (all minimax-m3 disagg jobs affected). Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc single quotes (no quote toggling), so the patch command stays intact and its operators are parsed by the container shell. Validated end-to-end: gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4. Co-authored-by: Cursor <cursoragent@cursor.com> * disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k) Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an 8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still one lm-eval on the 8k1k TP8+TP8 layout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher The per-layer READ-offset fix this Python patcher applied to moriio_connector.py is fully subsumed by the unified overlay patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff rewrites the exact lines the patcher searches for (the `first_layer` single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing), with a stronger geometry-memoized + heterogeneous-TP-aware version, so the patcher's OLD1/OLD2 patterns no longer match and it already no-ops ("pattern not found; skipping") in the real flow. It's also the same fix now upstreamed in vLLM #46039 (READ mixed KV layouts). Drop the dead patcher and its setup_deps.sh hook so the diff is the single source of truth. patches/README.md only documents the diff (no reference to this patcher), so no README change is needed. Co-authored-by: Cursor <cursoragent@cursor.com> * Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay - Co-work with Gupta, Ravi All three MoRIIO fixes the in-tree overlay carried have merged upstream and now ship in the ROCm nightly image: - vLLM #46039 READ-mode mixed KV-layout (axis-aware per-layer offsets) - vLLM #46290 WRITE-mode per-geometry offset caching - vLLM #46332 heterogeneous-TP rank mapping + ACK fan-in Point minimaxm3-fp8-mi355x-vllm-disagg at vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15 (vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove the stop-gap overlay: - delete patches/moriio/moriio-minimax-m3-disagg.diff - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate) - trim the moriio/ section from patches/README.md Verified on the nightly image with NO patch across all four P/D layouts x conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4, 2P4+1D8) -- matching the previously-patched results. Refs #1762. * fix: append M3 MI355X disagg changelog entry at end of file The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Chun Fang <chun.fang@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: TianDi101 <ditian12@amd.com>
…p99, routing identity Addresses review #3 methodology critiques (schema_version 3): - Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform, rather than each choosing its own timing boundary. layout-and-dispatch-v1 times get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The misleading "comm-only-v1" label is gone. - Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline). p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.) - Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights; cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the identical trace, else status=invalid. Added per-dest-rank send histogram. - Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty). - Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
…tract/run metadata - capability.py (stdlib): static table mirroring adapter SUPPORTED_* sets; resolves (sku->vendor, backend, mode, dtype, contract) -> valid/why. Workflow runs it as a fail-fast "Validate capability" gate BEFORE consuming a runner (review #3 #2). - NCCL/RCCL phase-dedup: matrix collapses to a single 'na' job for collective backends (phase is meaningless for nccl/rccl — was running identical work twice). - contract input + CX_MEASUREMENT_CONTRACT threaded through run_in_container -> run_ep; CX_TRIALS too. COLLECTIVEX_SOURCE_SHA + GHA run id/attempt reach the artifact (run linkage, review #3 #1). run_ep reads GITHUB_SHA as the source-sha fallback.
… rate, run links Addresses review #3 frontend critiques (backward-compatible with v2 docs): - Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles. - Suite selector backend-default vs resource-constrained — kept distinct, never read as one fair contest (#5). dtype/mode/resource/contract are all in the per-line label + hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10). - Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7). - Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that produced the point (#1). Provenance text no longer claims a single dtype (the "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count, logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
Added B200 TRT-LLM runner configuration and consolidated runner logic
Changes Made: