[NVIDIA] Add TRT-LLM 70B FP8 via slurm by kedarpotdar-nv · Pull Request #1 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2025-08-28T18:24:03Z

Added B200 TRT-LLM runner configuration and consolidated runner logic

Changes Made:

Added new B200 TRT-LLM job (bmk-b200-trt) in 70b-tmpl.yml

Uses nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0 container
Runs nvidia/Llama-3.3-70B-Instruct-FP8 model
Same experimental parameters as other 70B configs

Consolidated B200 runner logic

Updated launch_b200-nv.sh to use dynamic ${MODEL_CODE}_${RUNNER_LABEL}_slurm.sh pattern
Added RUNNER_LABEL environment variable in benchmark-tmpl.yml
Deleted redundant launch_b200-trt.sh

Created TRT-LLM benchmark script (70b_b200-trt_slurm.sh)

Uses trtllm-serve with proper configuration
Inline llama-config.yml generation
Same client script (kimbochen/bench_serving)

Temporarily disabled standard B200 vLLM for testing

Commented out bmk-b200 job
Updated collect-results dependencies

kimbochen · 2025-08-28T18:35:36Z

Thank you for the PR.
I think we should keep B200 vLLM because it's an important comparison.
By injecting the ${{ inputs.runner }} info at the "Launch job script" step, you can keep the default behavior:

- name: Launch job script
  run: |
    RUNNER_NAME=${{ runner.name }}
    RUNNER_LABEL=${{ inputs.runner }}
    bash ./runners/launch_${RUNNER_NAME%%_*}.sh ${{ inputs.exp-name }}

and in launch_b200-nv.sh:

bash benchmarks/${MODEL_CODE}_${RUNNER_LABEL}_slurm.sh

kedarpotdar-nv · 2025-08-28T18:40:50Z

Thanks for the review, @kimbochen!

made these changes:

✅ uncommented vLLM
✅ use targeted variable injection (not global) for runner label
✅ Dynamially selects benchmark scripts based on runner labels

kimbochen · 2025-08-28T18:58:17Z

Testing shows the script doesn't pick up RUNNER_LABEL.
Can you add the RUNNER_LABEL back to env and remove in the step?
Sorry my bad

kedarpotdar-nv · 2025-08-28T19:01:13Z

No worries, reverted!

kedarpotdar-nv · 2025-08-28T20:07:38Z

@kimbochen B200 trt jobs are failing because trt sqsh file shares name with vllm. Made a fix, also temporarily removed vllm and other configs. just to test if b200 trt is working. Can you please cancel the current job and re-run with these fixes?

salloc: Granted job allocation 1919
salloc: Waiting for resource configuration
salloc: Nodes dgx05-b200 are ready for job
+ srun --jobid=1919 bash -c 'enroot import -o /raid/image_70b_b200.sqsh docker://nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0'
Error:  File already exists: /raid/image_70b_b200.sqsh
srun: error: dgx05-b200: task 0: Exited with exit code 1
+ srun --jobid=1919 --container-image=/raid/image_70b_b200.sqsh --container-mounts=/home/gharunnerb1/actions-runner/_work/InferenceMAX/InferenceMAX:/workspace/,/raid/hf_hub_cache/:/mnt/hf_hub_cache/ --container-mount-home --container-workdir=/workspace/ --no-container-entrypoint --export=ALL bash benchmarks/70b_b200-trt_slurm.sh
JOB 1919 running on dgx05-b200
+ hf download nvidia/Llama-3.3-70B-Instruct-FP8
Fetching 25 files:   0%|          | 0/25 [00:00

…summarize.py to reflect backend, fix issue with result filename

Add upstream sync workflow

…DSA state-index path amd-master.yaml - Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402 -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523 (matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is no longer the reference build for hybrid-attention disagg models on MI355X.) - Scenarios: collapse the four legacy "top/middle/bottom/small-scale" search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512] for both 1k1k and 8k1k. dp-attn=false avoids the fused_moe_triton/layer.py:209 shared-slot assertion that --enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5 (256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the same CI matrix-expansion logic applies to both. patches/mori_conn.py - Add patch #4: rank + length normalization in MoriKVReceiver._send_swa_dsa_state, immediately before the group_concurrent_contiguous call. For GLM-5 (single DSA component), upstream hands dst_state_indices as a 2-D (1, N) array while src_state_indices is 1-D length 1; the existing [:common_len] slice operates only on the outer axis, leaving the rank mismatched. np.diff then produces (1, N-1) vs (0,), which can't broadcast and crashes with "operands could not be broadcast together with shapes (1,12) (0,)". The fix ravels both indices to 1-D and re-truncates to common length so np.diff outputs compatible 1-D arrays. One-shot log gates the warning to once per receiver class. - Verified end-to-end: glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047 glm5-fp8-mi355x-sglang-disagg gsm8k strict-match = 0.9712 +/- 0.0046 qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression) = 0.9780 +/- 0.004 Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives inside _send_swa_dsa_state, never called for Mamba); patches #1-#3 behavior is unchanged. patches/README.md - Document patch #4 alongside the existing three. Cross-link the full bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md and the gsm8k verification at scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.

…wnload Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.

* [Klaud Cold] minimaxm3-fp8-mi355x-vllm-disagg: day-zero MoRI-IO disagg smoke test MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) smoke test on the day-zero ROCm image (vllm/vllm-openai-rocm:minimax-m3): 1 prefill (TP8) + 1 decode (TP8) at conc 1, validating the MoRI-IO KV-transfer disagg pipeline end-to-end for M3. Layered on the MoRI-IO patch-removal infra (#1585): brings in that PR's amd_utils changes (setup_deps.sh / server_vllm.sh / submit.sh / models_vllm.yaml mori -> mori_low_latency) and the two job.slurm hunks (vllm-router image bump nightly-20260511 -> nightly-20260603, drop VLLM_MORIIO_CONNECTOR_READ_MODE env), while keeping main's atom-disagg support intact. Per-worker serve flags (models_vllm.yaml MiniMax-M3-MXFP8): --block-size 128 (MSA), --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, minimax_m3 tool/reasoning parsers; no EP (TP8, MoE experts TP-sharded as in the single-node M3 TP8 recipe). perf-changelog.yaml and amd-master.yaml contain only M3 changes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * amd_utils/job.slurm: auto-download disagg checkpoint when not pre-staged The first MI355X disagg sweep (run 27515119215) failed: the day-zero MiniMax-M3-MXFP8 checkpoint is not staged on the disagg cluster's shared FS, so job.slurm's model search hit a hard FATAL ("Model 'MiniMax-M3-MXFP8' not found. Searched: ...") before the engine ever started. The single-node recipes hf-download inside the serving container, but the disagg path historically required ops to pre-stage checkpoints. Add an on-demand fallback to the vllm-disagg model-resolution block: when the checkpoint isn't found, derive the HF repo id from the hf_dir (models--org--name -> org/name) and download into MODEL_DIR in HF cache layout, then resolve the snapshot as MODEL_PATH. Staging into MODEL_DIR keeps MODEL_PATH under the dir that is bind-mounted into the serving container as /models, so the existing -v ${MODEL_DIR}:/models mount and DOCKER_MODEL_PATH (/models) remap both resolve. Implementation notes: - The host has no hf CLI, so the download runs in a one-shot container of the serving image (DOCKER_IMAGE_NAME), which ships huggingface_hub. - flock on a lockfile in MODEL_DIR serializes the prefill/decode nodes; a re-check of snapshots/ under the lock makes it idempotent (resumable). - hf download with a huggingface-cli fallback; 3 retries; HF_TOKEN passed through for gated repos. - Scoped to the vllm-disagg branch only; pre-staged models never reach this path (the search finds them first), so sglang/atom and existing vLLM disagg models (M2.5/Kimi) are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * job.slurm: --entrypoint "" for the auto-download container The disagg auto-download reached hf download but failed all 3 attempts: the one-shot `docker run "$DOCKER_IMAGE_NAME" bash -lc "hf download ..."` did not override the image ENTRYPOINT, so the vllm-openai API server ran with the bash command as its args and died with "Failed to infer device type" (no GPU mounted in the download container). Add --entrypoint "" (as the serving container does) so bash actually runs hf download. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * M3 disagg: use shared HF cache (/it-share/hf-hub-cache); drop auto-download Per maintainer direction, point the MiniMax-M3 disagg model dir at the cluster's shared HF cache where the ~414 GB MXFP8 checkpoint is already staged (/it-share/hf-hub-cache/models--MiniMaxAI--MiniMax-M3-MXFP8), instead of the launcher default /it-share/data. Scoped to M3 only via the M3 disagg script: export MODEL_PATH=/it-share/hf-hub-cache submit.sh exports MODEL_DIR=$MODEL_PATH and job.slurm resolves the snapshot under it (search path #1) and bind-mounts MODEL_DIR into the prefill/decode serving containers. Other disagg models keep /it-share/data. This supersedes the earlier job.slurm auto-download approach, which is reverted: job.slurm now differs from main only by the #1585 mori-removal hunks (router image bump + dropping VLLM_MORIIO_CONNECTOR_READ_MODE). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: add 8k1k conc-16 row to run an lm-eval (validate correctness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 (not just conc 1) Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * disagg #1762: sweep conc 1,2,4,8,16 at both 1k1k and 8k1k Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Update the vLLM external router container vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image. * M3 disagg: per-layer MoRIIO KV transfer for hybrid sparse-attn (partial) MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model: sparse layers register a separate lightning-indexer cache (MLAAttentionSpec, rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5, fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives block geometry from the first cache and reuses first_layer's offsets for every layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is transferred with fp8 K+V sizing and gets corrupted on the decode worker, producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct). This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py. - patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry per layer (own shape/stride/dtype/rank) instead of from the first cache. Idempotent; no-op for homogeneous models. - setup_deps.sh: apply it on the vllm-disagg path. NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO connector cannot map, so M3 disagg accuracy is still broken pending a larger multi-group / index-state transfer change. (Disabling sparse attention is not a viable workaround: M3's fused QKV carries index_k weights, so dropping the indexer breaks weight load.) Refs #1762 Co-authored-by: Cursor <cursoragent@cursor.com> * feat(amd-disagg): add vLLM MoRIIO KV-layout patch to reuse stock minimax-m3 image The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every disagg block transfer reads the wrong region. Invisible to throughput, but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957). Instead of baking a fix into a rebuilt image (-hetkv) or carrying full vendored copies of the patched files in-tree, carry just the 218-line unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with `patch -p1` against the vLLM package dir inside the container at startup, ahead of the server launch. The repo is already bind-mounted into the container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3" (skippable with MORIIO_KV_PATCH=skip), mirroring the existing mori_conn.py sglang hook. A failed apply aborts the container instead of silently running unpatched. Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode) using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract 0.9560 (matches the baked image within noise), decode probe healthy. - patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock - job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out - patches/README.md: document the moriio/ diff-apply mechanism Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * disagg #1762: extend conc sweep to 32,64,128,256,512,1024 at 1k1k and 8k1k Widen the disagg sweep from conc 1,2,4,8,16 to 1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * disagg #1762: add TP4-prefill P/D layouts (TP4+TP8, TP4+TP4) at 1k1k and 8k1k Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep, for both seq-len scenarios: - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256 - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024 Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is needed (comment updated to say so). The multinode eval policy still marks exactly one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(amd-disagg): bundle heterogeneous-TP + dup-ack fixes into unified MoRIIO diff Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which bundles three layered fixes for the stock minimax-m3 vLLM image: 1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix, required for homogeneous TP too). 2. heterogeneous-TP addressing + guard: maps each decode rank to the correct prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and raises NotImplementedError for unsupported cases (prefill-TP > decode-TP, KV-head splitting) instead of silently corrupting KV. 3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs per transfer_id and only frees KV blocks once all expected consumers ACK, preventing both the late-ACK EngineCore crash and KV reuse before slower decode ranks finish reading. job.slurm and patches/README.md updated to reference the new diff name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio): correct _remote_tp_rank for prefill-TP > decode-TP (P8/D4) With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc. The previous patch used `return self.tp_rank` for the P>D branch, which made decode rank 1 connect to prefill rank 1 (holds head0) instead of prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0. Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size), the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps each decode rank to the *first* prefill rank of its head group, which holds the correct KV content via vLLM's replication scheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(moriio-diff): correct hunk header count after _remote_tp_rank expansion The P>D fix added 4 lines to _remote_tp_rank but the hunk header still said +1100,40; patch aborted with "malformed patch at line 79". Update to +1100,44 to match the actual 6 context + 38 added lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(amd-disagg): keep MoRIIO patch cmd inside container bash -lc quotes The MoRIIO KV-layout patch was injected into the per-node container launch via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer srun bash -c "..." double-quoted string. Because the patch command value contains spaces and the shell operators '<' and '||', the unquoted expansion word-split the generated container script, truncating it right after the word `patch` and silently dropping the patch arguments AND the server.sh launch. The container then exited 0:0 within seconds, producing no benchmark/eval output -> collect_latest_results found "No logs directory" -> the launch step failed with exit 1 (all minimax-m3 disagg jobs affected). Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc single quotes (no quote toggling), so the patch command stays intact and its operators are parsed by the container shell. Validated end-to-end: gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4. Co-authored-by: Cursor <cursoragent@cursor.com> * disagg #1762: add 2P TP4 + 1D TP8 layout at conc 256,512,768,1024 (1k1k & 8k1k) Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an 8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still one lm-eval on the 8k1k TP8+TP8 layout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(amd-disagg): remove redundant moriio_heterogeneous_kv.py patcher The per-layer READ-offset fix this Python patcher applied to moriio_connector.py is fully subsumed by the unified overlay patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff rewrites the exact lines the patcher searches for (the `first_layer` single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing), with a stronger geometry-memoized + heterogeneous-TP-aware version, so the patcher's OLD1/OLD2 patterns no longer match and it already no-ops ("pattern not found; skipping") in the real flow. It's also the same fix now upstreamed in vLLM #46039 (READ mixed KV layouts). Drop the dead patcher and its setup_deps.sh hook so the diff is the single source of truth. patches/README.md only documents the diff (no reference to this patcher), so no README change is needed. Co-authored-by: Cursor <cursoragent@cursor.com> * Use upstream nightly image for MiniMax-M3 disagg, drop MoRIIO overlay - Co-work with Gupta, Ravi All three MoRIIO fixes the in-tree overlay carried have merged upstream and now ship in the ROCm nightly image: - vLLM #46039 READ-mode mixed KV-layout (axis-aware per-layer offsets) - vLLM #46290 WRITE-mode per-geometry offset caching - vLLM #46332 heterogeneous-TP rank mapping + ACK fan-in Point minimaxm3-fp8-mi355x-vllm-disagg at vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15 (vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove the stop-gap overlay: - delete patches/moriio/moriio-minimax-m3-disagg.diff - drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate) - trim the moriio/ section from patches/README.md Verified on the nightly image with NO patch across all four P/D layouts x conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4, 2P4+1D8) -- matching the previously-patched results. Refs #1762. * fix: append M3 MI355X disagg changelog entry at end of file The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Chun Fang <chun.fang@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: TianDi101 <ditian12@amd.com>

…p99, routing identity Addresses review #3 methodology critiques (schema_version 3): - Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform, rather than each choosing its own timing boundary. layout-and-dispatch-v1 times get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The misleading "comm-only-v1" label is gone. - Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline). p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.) - Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights; cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the identical trace, else status=invalid. Added per-dest-rank send histogram. - Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty). - Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.

…tract/run metadata - capability.py (stdlib): static table mirroring adapter SUPPORTED_* sets; resolves (sku->vendor, backend, mode, dtype, contract) -> valid/why. Workflow runs it as a fail-fast "Validate capability" gate BEFORE consuming a runner (review #3 #2). - NCCL/RCCL phase-dedup: matrix collapses to a single 'na' job for collective backends (phase is meaningless for nccl/rccl — was running identical work twice). - contract input + CX_MEASUREMENT_CONTRACT threaded through run_in_container -> run_ep; CX_TRIALS too. COLLECTIVEX_SOURCE_SHA + GHA run id/attempt reach the artifact (run linkage, review #3 #1). run_ep reads GITHUB_SHA as the source-sha fallback.

… rate, run links Addresses review #3 frontend critiques (backward-compatible with v2 docs): - Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles. - Suite selector backend-default vs resource-constrained — kept distinct, never read as one fair contest (#5). dtype/mode/resource/contract are all in the per-line label + hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10). - Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7). - Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that produced the point (#1). Provenance text no longer claims a single dtype (the "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count, logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).

kedarpotdar-nv added 4 commits August 28, 2025 09:58

add trt init for 70b

d556b88

remove dsr1 and add $MAX_MODEL_LEN to launch configs

426f48e

remove b200 tg

12a7f6e

add RUNNER LABEL and temporarily remove bmk-b200?

0fc8ab4

kedarpotdar-nv requested a review from kimbochen August 28, 2025 18:24

fix per kimbo's suggestion

4b30c03

revert local runner var

aab2320

kedarpotdar-nv added 2 commits August 28, 2025 12:54

update sqsh file name to include runner name. i.e. trt

0c5ad16

temporarily remove other benchmarks. only keep bmk-b200-trt

7487baa

kedarpotdar-nv added 16 commits August 28, 2025 15:00

refactor scheduler to add trt tag, update ngc image address , update …

1233b53

…summarize.py to reflect backend, fix issue with result filename

refactor trt into separate yml

7800006

fix file name

43057dd

comment vllm for now

a94fbd0

update port in trtllm-serve

0225b10

update artifact name to have runner name at end

1e594f3

update plot function with b200-trt

f63768c

add h200 trt

ed20d23

fix launch slurm script based on runner label

25566a9

better identify if result is vllm or trt

d33cda5

clarify runners for trt and vllm

de2d8de

fix runner names

80dc11d

remove trt runners

3cf357b

ensure trt runners are correctly tagged

9d7cbd3

rename launch scripts

a2ed19c

only get latest run id

fd1ff2e

functionstackx mentioned this pull request Jan 7, 2026

[Question] GPT-OSS-120B benchmark environment requirements - Driver/CUDA version clarification needed #393

Closed

claude-code-infmax Bot mentioned this pull request Jan 17, 2026

[NVIDIA] fix: update ep metadata in gb200 dynamo sglang configs to match comments #486

Merged

jthomson04 pushed a commit to jthomson04/InferenceMAX that referenced this pull request Jan 21, 2026

Merge pull request SemiAnalysisAI#1 from NVIDIA/add-upstream-sync

6b45b59

Add upstream sync workflow

claude Bot mentioned this pull request Feb 6, 2026

[NV] H100 FP8 Disagg DSR1 1k1k, 8k1k (STP + MTP) #651

Merged

This was referenced Feb 17, 2026

Add Qwen3.5-397B-A17B BF16 B200 SGLang benchmark (STP only) #704

Merged

feat: multinode first-class reorganization #666

Merged

Klaud-Cold mentioned this pull request Feb 26, 2026

test profile b200 decode #807

Open

cquil11 added the NVIDIA label Apr 8, 2026

cquil11 changed the title ~~Add TRT-LLM 70B FP8 via slurm~~ [NVIDIA] Add TRT-LLM 70B FP8 via slurm Apr 8, 2026

functionstackx mentioned this pull request May 5, 2026

[AMD] improve dsr1 fp4 disagg perf on mi355x - rapid follow up PR incoming to quant correction on fp8 combine #1236

Merged

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] Update glm5-fp8-b200-sglang (+mtp) SGLang image to v0.5.12-cu130 #1447

Merged

1 task

claude Bot mentioned this pull request May 19, 2026

[NV] update Minimax2.5 fp8 h100 vllm #1516

Merged

claude Bot mentioned this pull request Jun 25, 2026

Add MiniMax-M3 NVFP4 B300 single-node vLLM benchmark (EAGLE3 spec decode) #1929

Open

claude Bot mentioned this pull request Jun 25, 2026

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark #1932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA] Add TRT-LLM 70B FP8 via slurm#1

[NVIDIA] Add TRT-LLM 70B FP8 via slurm#1
kedarpotdar-nv wants to merge 28 commits into
mainfrom
kepotdar-trt-70b

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

kimbochen commented Aug 28, 2025 •

edited

Loading

Uh oh!

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

kimbochen commented Aug 28, 2025

Uh oh!

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

kimbochen commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

kimbochen commented Aug 28, 2025

Uh oh!

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

kedarpotdar-nv commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kimbochen commented Aug 28, 2025 •

edited

Loading