Fix hardware label generation for runners with framework suffix in names#6
Closed
kedarpotdar-nv wants to merge 1 commit into
Closed
Fix hardware label generation for runners with framework suffix in names#6kedarpotdar-nv wants to merge 1 commit into
kedarpotdar-nv wants to merge 1 commit into
Conversation
Collaborator
|
Thank you for the PR. The issue is fixed in other commits. |
Oseltamivir
added a commit
that referenced
this pull request
Jun 23, 2026
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Oseltamivir
added a commit
that referenced
this pull request
Jun 25, 2026
…p99, routing identity Addresses review #3 methodology critiques (schema_version 3): - Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform, rather than each choosing its own timing boundary. layout-and-dispatch-v1 times get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The misleading "comm-only-v1" label is gone. - Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline). p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.) - Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights; cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the identical trace, else status=invalid. Added per-dest-rank send histogram. - Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty). - Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
Oseltamivir
added a commit
that referenced
this pull request
Jun 25, 2026
… rate, run links Addresses review #3 frontend critiques (backward-compatible with v2 docs): - Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles. - Suite selector backend-default vs resource-constrained — kept distinct, never read as one fair contest (#5). dtype/mode/resource/contract are all in the per-line label + hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10). - Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7). - Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that produced the point (#1). Provenance text no longer claims a single dtype (the "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count, logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When using runner names that already include the framework suffix (e.g., b200-trt), the plotting script was incorrectly creating double suffixes in hardware labels, resulting in b200-trt-trt instead of the expected b200-trt.
Root Cause
The plot_perf.py script was blindly appending -trt to any hardware name when the framework was 'trt', without checking if the hardware name already contained the framework suffix.
Solution
Updated utils/plot_perf.py to check if the hardware name already ends with -trt before appending it:
# Create combined hw+framework identifier for plotting if result.get('framework') == 'trt': # Check if hardware name already contains framework suffix if result['hw'].endswith('-trt'): result['hw_label'] = result['hw'] else: result['hw_label'] = f"{result['hw']}-trt" else: result['hw_label'] = result['hw']Impact
✅ Fixes plotting for runners with names like b200-trt
✅ Maintains backward compatibility for existing runners like b200 → b200-trt
✅ No changes needed to workflow templates or result processing
✅ Results now plot correctly with proper hardware labels and colors