New FP4 and Testing Workflows by kimbochen · Pull Request #9 · SemiAnalysisAI/InferenceX

kimbochen · 2025-09-15T16:16:15Z

Merging to debug workflow scheduler testing workflow

Added FP4 and renamed all original ones to FP8
Added FP4 configs for AMD
Rewritten benchmark template input arguments
- Cleaning up arguments by using env vars
Testing out new workflows for testing
- Single Runner Test
- Single GPU Sweep Test
- Single Model Test
- Workflow Scheduler Test

Add summarize.py (compact NCCL/DeepEP results table, printed at end of every job) and make it the result gate. Fix review findings: benchmark failures/skipped-deepep now fail the job instead of reporting green (#1); DeepEP nodes from SLURM_NNODES not world_size//8 (#3); apply Buffer.set_num_sms so num_comm_sms is real (#8); nccl-tests -c 1 with a missing check footer is now invalid (#7); use context managers for file reads (#4,#5); launchers export COLLECTIVEX_IMAGE/_DIGEST for provenance (#9); trim workflow_dispatch sku options to launcher-backed pools (#2). Artifact-path finding (#6) already fixed via cx_collect_results.

…p99, routing identity Addresses review #3 methodology critiques (schema_version 3): - Explicit measurement contracts (#4): adapters declare SUPPORTED_CONTRACTS and conform, rather than each choosing its own timing boundary. layout-and-dispatch-v1 times get_dispatch_layout INSIDE dispatch (the only contract MoRI can honor — its layout is computed in-kernel); cached-layout-comm-only-v1 hoists layout out (DeepEP normal) so dispatch is pure comm. run_ep.py rejects unsupported contract / ll+cached-layout. The misleading "comm-only-v1" label is gone. - Pooled-trial percentiles (#9, #2): N trials (default 3) x iters, token-order randomized per trial (seeded => identical across ranks; MoRI keeps ascending to avoid cold-jump wedge), per-iteration cross-rank-MAX samples POOLED, then p50/p90/p99 (p99 headline). p99 from ~50 samples was just the max. (#2 aggregation was already Q_p(max_r); verified.) - Routing identity proof (#3): routing_hash now SHA-256 of topk_idx AND gate weights; cross-rank trace-signature MIN==MAX check proves every rank (NVIDIA + AMD) built the identical trace, else status=invalid. Added per-dest-rank send histogram. - Separated logical bytes (#6): dispatch_logical_bytes + combine_logical_bytes recorded at their real dtypes with byte_contract; serial bandwidth removed. serial relabeled "sum of isolated medians". Correctness scope tagged roundtrip-reconstruction-smoke-v1 (#8 honesty). - Run linkage (#1): artifacts record GHA run_id/attempt/source SHA when present.

… rate, run links Addresses review #3 frontend critiques (backward-compatible with v2 docs): - Percentile selector p50/p90/p99 (p99 default); reads pooled-trial percentiles. - Suite selector backend-default vs resource-constrained — kept distinct, never read as one fair contest (#5). dtype/mode/resource/contract are all in the per-line label + hover; lines are uniquely colored (SKU family) + dashed-fp8 (#10). - Bandwidth axis renamed "Logical routed payload rate" using SEPARATE dispatch/combine bytes; serial bandwidth removed; serial relabeled "Σ isolated medians" (#6,#7). - Hover shows p50/p90/p99, contract, suite, and the WORKFLOW RUN (run id + sha) that produced the point (#1). Provenance text no longer claims a single dtype (the "bf16 while fp8 shown" bug); states routing-identity-proven, pooled-sample count, logical-rate caveat, suite-separation, and correctness-is-smoke (#9 fix).

kimbochen added 30 commits September 12, 2025 21:32

Test MI300X FP4 logic.

425f74d

Comment out to focus MI300X.

7f7eb98

Fix collect results.

11ae264

Fix precision.

3162866

Removed lock.

2579553

Fixed CR script.

afb74de

Added 70B FP4 script.

09f1d92

Added FP4.

501d548

Added precision to run name.

2994e95

Reduced TP.

e9c77bc

Added DSR1 MI355X.

35b3c08

Added quark installation.

77cee75

Added 70B MI325X.

e8dbeb9

Updated MI325X runner script.

88969ad

Moved collect results.

e56eb3d

Fixed var.

c6cb9b1

Test refactoring.

a7f251e

Try vars.

bfe566a

Test env.

64268a7

Test env.

f84de1e

Give up.

1dc0612

Fix var eval.

8c1834b

Test using number.

7c770e5

Test using number.

89d0ec4

Test using number.

8452556

Try env again.

0105127

Try boolean.

96f8c29

Updated model and precision multiplexing logic.

2b76883

Added rm quark_logs.

d443eec

Updated precision multiplex logic.

d3922d0

kimbochen added 6 commits September 14, 2025 21:13

Fixed exp-name and precision multiplexing logic.

f4e0406

Fixed squash file path.

954d31e

Fixed squash path pytpo.

352bd9b

Refactored workflow scheduler.

06abff4

Removed new line.

1000b9c

Fixed gptoss missing args.

b2ac54a

kimbochen merged commit 4e3c32e into main Sep 15, 2025

kimbochen deleted the amd-fp4 branch September 15, 2025 16:17

claude-code-infmax Bot mentioned this pull request Jan 21, 2026

[NV] Update DSR1 GB200 FP4 Disagg Submission #510

Merged

functionstackx mentioned this pull request May 18, 2026

[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Test sgl-deep-gemm==0.0.1 pin for sgl#25551 (glm5-fp8-b300 DeepGemm regression) #1512

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New FP4 and Testing Workflows#9

New FP4 and Testing Workflows#9
kimbochen merged 36 commits into
mainfrom
amd-fp4

kimbochen commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kimbochen commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant