Skip to content

New FP4 and Testing Workflows#9

Merged
kimbochen merged 36 commits into
mainfrom
amd-fp4
Sep 15, 2025
Merged

New FP4 and Testing Workflows#9
kimbochen merged 36 commits into
mainfrom
amd-fp4

Conversation

@kimbochen

Copy link
Copy Markdown
Collaborator

Merging to debug workflow scheduler testing workflow

  • Added FP4 and renamed all original ones to FP8
  • Added FP4 configs for AMD
  • Rewritten benchmark template input arguments
    • Cleaning up arguments by using env vars
  • Testing out new workflows for testing
    • Single Runner Test
    • Single GPU Sweep Test
    • Single Model Test
    • Workflow Scheduler Test

@kimbochen kimbochen merged commit 4e3c32e into main Sep 15, 2025
@kimbochen kimbochen deleted the amd-fp4 branch September 15, 2025 16:17
Oseltamivir added a commit that referenced this pull request Jun 23, 2026
Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
…p99, routing identity

Addresses review #3 methodology critiques (schema_version 3):

- Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform,
  rather than each choosing its own timing boundary. layout-and-dispatch-v1 times
  get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is
  computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so
  dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The
  misleading "comm-only-v1" label is gone.

- Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized
  per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump
  wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline).
  p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.)

- Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights;
  cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the
  identical trace, else status=invalid. Added per-dest-rank send histogram.

- Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at
  their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of
  isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty).

- Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.
Oseltamivir added a commit that referenced this pull request Jun 25, 2026
… rate, run links

Addresses review #3 frontend critiques (backward-compatible with v2 docs):
- Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles.
- Suite selector backend-default vs resource-constrained — kept distinct, never read as
  one fair contest (#5). dtype/mode/resource/contract are all in the per-line label +
  hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10).
- Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine
  bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7).
- Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that
  produced the point (#1). Provenance text no longer claims a single dtype (the
  "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count,
  logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant