Fix hardware label generation for runners with framework suffix in names by kedarpotdar-nv · Pull Request #6 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2025-09-05T15:21:19Z

Problem

When using runner names that already include the framework suffix (e.g., b200-trt), the plotting script was incorrectly creating double suffixes in hardware labels, resulting in b200-trt-trt instead of the expected b200-trt.

Root Cause

The plot_perf.py script was blindly appending -trt to any hardware name when the framework was 'trt', without checking if the hardware name already contained the framework suffix.

Solution

Updated utils/plot_perf.py to check if the hardware name already ends with -trt before appending it:

# Create combined hw+framework identifier for plotting
if result.get('framework') == 'trt':
    # Check if hardware name already contains framework suffix
    if result['hw'].endswith('-trt'):
        result['hw_label'] = result['hw']
    else:
        result['hw_label'] = f"{result['hw']}-trt"
else:
    result['hw_label'] = result['hw']

Impact

✅ Fixes plotting for runners with names like b200-trt
✅ Maintains backward compatibility for existing runners like b200 → b200-trt
✅ No changes needed to workflow templates or result processing
✅ Results now plot correctly with proper hardware labels and colors

kimbochen · 2025-09-15T18:23:11Z

Thank you for the PR. The issue is fixed in other commits.

Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.

…p99, routing identity Addresses review #3 methodology critiques (schema_version 3): - Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform, rather than each choosing its own timing boundary. layout-and-dispatch-v1 times get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The misleading "comm-only-v1" label is gone. - Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline). p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.) - Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights; cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the identical trace, else status=invalid. Added per-dest-rank send histogram. - Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty). - Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.

… rate, run links Addresses review #3 frontend critiques (backward-compatible with v2 docs): - Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles. - Suite selector backend-default vs resource-constrained — kept distinct, never read as one fair contest (#5). dtype/mode/resource/contract are all in the per-line label + hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10). - Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7). - Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that produced the point (#1). Provenance text no longer claims a single dtype (the "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count, logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).

fix trt chart

608a6a1

kedarpotdar-nv requested a review from kimbochen September 5, 2025 15:21

kimbochen closed this Sep 15, 2025

kimbochen deleted the kepotdar-fix-trt-chart branch September 15, 2025 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix hardware label generation for runners with framework suffix in names#6

Fix hardware label generation for runners with framework suffix in names#6
kedarpotdar-nv wants to merge 1 commit into
mainfrom
kepotdar-fix-trt-chart

kedarpotdar-nv commented Sep 5, 2025

Uh oh!

kimbochen commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kedarpotdar-nv commented Sep 5, 2025

Problem

Root Cause

Solution

Impact

Uh oh!

kimbochen commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants