Skip to content

[NVIDIA] Fix vllm & sglang b200 updated containers#4

Merged
kimbochen merged 6 commits into
mainfrom
fix-vllm-b200
Sep 4, 2025
Merged

[NVIDIA] Fix vllm & sglang b200 updated containers#4
kimbochen merged 6 commits into
mainfrom
fix-vllm-b200

Conversation

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator

No description provided.

@kimbochen kimbochen merged commit 75ec29c into main Sep 4, 2025
@kimbochen kimbochen deleted the fix-vllm-b200 branch September 4, 2025 00:42
jthomson04 pushed a commit to jthomson04/InferenceMAX that referenced this pull request Jan 21, 2026
Modify GB200 runs to use test partition
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title Fix vllm & sglang b200 updated containers [NVIDIA] Fix vllm & sglang b200 updated containers Apr 8, 2026
chunfangamd added a commit that referenced this pull request May 24, 2026
…DSA state-index path

amd-master.yaml
  - Image: rocm/sgl-dev:sglang-0.5.9-rocm720-mi35x-mori-0402
        -> lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
    (matches qwen3.5-fp8-mi355x-sglang-disagg; the older 0.5.9 image is
    no longer the reference build for hybrid-attention disagg models on
    MI355X.)
  - Scenarios: collapse the four legacy "top/middle/bottom/small-scale"
    search-spaces per ISL into a single 1P+1D TP=8 EP=1 dp-attn=false
    entry with the standard conc-list [8, 16, 32, 64, 128, 256, 512]
    for both 1k1k and 8k1k. dp-attn=false avoids the
    fused_moe_triton/layer.py:209 shared-slot assertion that
    --enable-dp-attention + --moe-a2a-backend mori triggers for GLM-5
    (256 routed + 1 shared expert; (256-1) % 8 = 7 != 0). The collapsed
    layout mirrors the qwen3.5-fp8-mi355x-sglang-disagg shape so the
    same CI matrix-expansion logic applies to both.

patches/mori_conn.py
  - Add patch #4: rank + length normalization in
    MoriKVReceiver._send_swa_dsa_state, immediately before the
    group_concurrent_contiguous call. For GLM-5 (single DSA component),
    upstream hands dst_state_indices as a 2-D (1, N) array while
    src_state_indices is 1-D length 1; the existing [:common_len]
    slice operates only on the outer axis, leaving the rank mismatched.
    np.diff then produces (1, N-1) vs (0,), which can't broadcast and
    crashes with "operands could not be broadcast together with shapes
    (1,12) (0,)". The fix ravels both indices to 1-D and re-truncates
    to common length so np.diff outputs compatible 1-D arrays. One-shot
    log gates the warning to once per receiver class.

  - Verified end-to-end:
      glm5-fp8-mi355x-sglang-disagg gsm8k flexible-extract = 0.9704 +/- 0.0047
      glm5-fp8-mi355x-sglang-disagg gsm8k strict-match     = 0.9712 +/- 0.0046
      qwen3.5-fp8-mi355x-sglang-disagg gsm8k (regression)  = 0.9780 +/- 0.004
    Patch #4 fires zero times on the Qwen3.5 Mamba path (it lives
    inside _send_swa_dsa_state, never called for Mamba); patches #1-#3
    behavior is unchanged.

patches/README.md
  - Document patch #4 alongside the existing three. Cross-link the full
    bug analysis at scripts/sglang_disagg/docs_glm5/01-bug-analysis.md
    and the gsm8k verification at
    scripts/sglang_disagg/docs_glm5/02-fix-and-verification.md.
Oseltamivir added a commit that referenced this pull request May 26, 2026
Oseltamivir added a commit that referenced this pull request Jun 23, 2026
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
…p99, routing identity

Addresses review #3 methodology critiques (schema_version 3):

- Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform,
  rather than each choosing its own timing boundary. layout-and-dispatch-v1 times
  get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is
  computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so
  dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The
  misleading "comm-only-v1" label is gone.

- Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized
  per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump
  wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline).
  p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.)

- Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights;
  cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the
  identical trace, else status=invalid. Added per-dest-rank send histogram.

- Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at
  their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of
  isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty).

- Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants