Skip to content

Fix hardware label generation for runners with framework suffix in names#6

Closed
kedarpotdar-nv wants to merge 1 commit into
mainfrom
kepotdar-fix-trt-chart
Closed

Fix hardware label generation for runners with framework suffix in names#6
kedarpotdar-nv wants to merge 1 commit into
mainfrom
kepotdar-fix-trt-chart

Conversation

@kedarpotdar-nv

Copy link
Copy Markdown
Collaborator

Problem

When using runner names that already include the framework suffix (e.g., b200-trt), the plotting script was incorrectly creating double suffixes in hardware labels, resulting in b200-trt-trt instead of the expected b200-trt.

Root Cause

The plot_perf.py script was blindly appending -trt to any hardware name when the framework was 'trt', without checking if the hardware name already contained the framework suffix.

Solution

Updated utils/plot_perf.py to check if the hardware name already ends with -trt before appending it:

# Create combined hw+framework identifier for plotting
if result.get('framework') == 'trt':
    # Check if hardware name already contains framework suffix
    if result['hw'].endswith('-trt'):
        result['hw_label'] = result['hw']
    else:
        result['hw_label'] = f"{result['hw']}-trt"
else:
    result['hw_label'] = result['hw']

Impact

✅ Fixes plotting for runners with names like b200-trt
✅ Maintains backward compatibility for existing runners like b200 → b200-trt
✅ No changes needed to workflow templates or result processing
✅ Results now plot correctly with proper hardware labels and colors

@kimbochen

Copy link
Copy Markdown
Collaborator

Thank you for the PR. The issue is fixed in other commits.

@kimbochen kimbochen closed this Sep 15, 2025
@kimbochen kimbochen deleted the kepotdar-fix-trt-chart branch September 15, 2025 18:23
Oseltamivir added a commit that referenced this pull request Jun 23, 2026
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
…p99, routing identity

Addresses review #3 methodology critiques (schema_version 3):

- Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform,
  rather than each choosing its own timing boundary. layout-and-dispatch-v1 times
  get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is
  computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so
  dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The
  misleading "comm-only-v1" label is gone.

- Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized
  per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump
  wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline).
  p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.)

- Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights;
  cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the
  identical trace, else status=invalid. Added per-dest-rank send histogram.

- Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at
  their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of
  isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty).

- Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
… rate, run links

Addresses review #3 frontend critiques (backward-compatible with v2 docs):
- Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles.
- Suite selector backend-default vs resource-constrained — kept distinct, never read as
  one fair contest (#5). dtype/mode/resource/contract are all in the per-line label +
  hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10).
- Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine
  bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7).
- Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that
  produced the point (#1). Provenance text no longer claims a single dtype (the
  "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count,
  logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants