feat(power): multinode measured-power aggregation by arygupt · Pull Request #1574 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-27T19:23:25Z

Summary

Extends single-node measured-power aggregation (#1558) to multinode srt-slurm benchmarks. Wires per-node perf_samples_<host>.csv from srt-slurm's PR #35 perfmon through the launcher into process_result.py → aggregate_power.py, which now namespaces local GPU indices per source CSV stem so each node's local indices 0..N-1 don't collapse across nodes.

Backward compatible: aggregate_power() accepts both Path and Iterable[Path]; single-CSV callers (single-node start_gpu_monitor path) are unchanged. csv_path param name preserved.

Pipeline

srt-slurm perfmon (PR #35 on NVIDIA/srt-slurm, layered on
  SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon)
  → perf_samples_<host>.csv in outputs/<job>/logs/ on shared NFS
  → launch_gb300-cw.sh exports GPU_METRICS_CSV_GLOB to $GITHUB_ENV
  → process_result.py expands glob → aggregate_power.run() with list
  → aggregate_power.py emits cluster-wide avg_power_w + joules_per_*_token
  → InferenceX-app ETL auto-captures (no schema change)

Files

utils/aggregate_power.py — csv_path widened to Path | Iterable[Path]. Per-source GPU-id namespacing only kicks in for 2+ sources so single-node num_gpus is unchanged. CLI adds --csv-glob (mutually exclusive with --csv).
utils/process_result.py — bridge GPU_METRICS_CSV_GLOB env var. Glob takes precedence over single GPU_METRICS_CSV when both are set.
runners/launch_gb300-cw.sh — point dynamo-sglang at our srt-slurm fork, append monitoring: block to each recipe post-copy (idempotent), write GPU_METRICS_CSV_GLOB to $GITHUB_ENV after the job.
utils/test_aggregate_power.py — 8 new multinode cases: per-source namespacing, sub-second clock drift, asymmetric prefill/decode power, missing-CSV silent skip, backward-compat single-path-in-list, Iterable acceptance, E2E with list.
utils/test_process_result.py — 3 new cases: glob aggregation, precedence over single CSV, empty-match falls through.

Test plan

36/36 aggregator tests pass (28 existing + 8 new)
28/28 process_result tests pass (25 existing + 3 new)
nvidia-smi inside sglang container on real gb300-cw emits expected columns (timestamp, gpu, power_w) — verified manually with srun --container-image=...sglang...sqsh nvidia-smi --query-gpu=...
First E2E multinode sweep produces avg_power_w + joules_per_*_token in the agg JSON (pending perf-changelog.yaml entry + sweep-enabled label)
num_gpus in agg JSON matches prefill_gpus + decode_gpus from launcher (validates per-source namespacing — without the fix, num_gpus would equal a single node's gpus_per_node)
Chart at inferencex.semianalysis.com renders the new data via ?unofficialrun=<run_id>

Depends on

SemiAnalysisAI/srt-slurm:feat/inferencex-perfmon (pinned by the launcher). Tracks NVIDIA/srt-slurm PR #35 head; will rebase to upstream main once #35 merges.

Note

Medium Risk
Changes benchmark CI/launcher behavior and published agg JSON schema (new power/telemetry fields); failures are mostly best-effort for monitoring, but incorrect glob or GPU namespacing could silently skew power metrics until caught by smoke tests.

Overview
Extends measured-power from single-node to multinode by collecting per-node perf_samples_<role>_w<idx>_<host>.csv files and patching agg JSONs with cluster-wide and (when disagg) per-stage energy metrics.

NVIDIA (gb300 / srt-slurm): launch_gb300-cw.sh pins a fork with per-node perfmon, recursively injects monitoring: enabled into overlaid recipe YAMLs (fixes silent zero-power when recipes live in subdirs), exports GPU_METRICS_CSV_GLOB after the job. AMD (mi355x): each SGLang/vLLM node starts start_perf_monitor into NFS-shared perfmon/; launch_mi355x-amds.sh copies CSVs to the workspace before log cleanup and sets the same glob env var.

Aggregation pipeline: process_result.py prefers GPU_METRICS_CSV_GLOB over single GPU_METRICS_CSV (no stale single-node fallback when glob is set) and passes disagg into aggregate_power.run(). aggregate_power.py now accepts multiple CSVs with per-source GPU index namespacing, parses perfmon filenames for workers[], adds temp/util/mem/peak fields, disagg per-stage joules_per_input_token / decode-weighted joules_per_output_token, and bench-window fallbacks for srt-slurm date + duration when Unix timestamps are absent.

Ops hardening: Multinode workflow Slurm pre-cleanup uses timeouts, drain deadlines, and AMD-only sudo rm of stale benchmark_logs/; AMD launcher job_alive() tolerates squeue timeouts. AMD monitoring adds -m for VRAM and logs CSV headers for schema-debug.

Large test additions cover multinode/disagg/AMD CSV shapes and process_result glob behavior; perf-changelog.yaml documents smoke sweeps for validation.

^{Reviewed by Cursor Bugbot for commit 60b62ed. Bugbot is set up for automated code reviews on this repo. Configure here.}

…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).

github-actions · 2026-05-27T19:35:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26534043069
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26534043069

claude · 2026-05-27T19:38:03Z

+    _csv_arg = None
+    _glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB')
+    if _glob_pattern:
+        _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern))
+        if _matched:
+            _csv_arg = _matched
+        else:
+            print(
+                f'[process_result] GPU_METRICS_CSV_GLOB={_glob_pattern!r} matched no files',
+                file=sys.stderr,
+            )
+
+    if _csv_arg is None:
+        # Single-node path: gpu_metrics.csv written by start_gpu_monitor in the
+        # bench container.
+        _csv_candidates = [
+            os.environ.get('GPU_METRICS_CSV'),
+            'gpu_metrics.csv',
+            '/workspace/gpu_metrics.csv',
+        ]
+        _csv_arg = next(
+            (Path(p) for p in _csv_candidates if p and Path(p).is_file()),
+            None,
+        )


🔴 When GPU_METRICS_CSV_GLOB is set but matches no files, _csv_arg stays None and the code falls through to the single-CSV candidate list (GPU_METRICS_CSV, gpu_metrics.csv, /workspace/gpu_metrics.csv) — contradicting the comment at lines 145-148 that the glob 'Takes precedence over the single-CSV fallback'. On a persistent self-hosted runner with a stale /workspace/gpu_metrics.csv from a prior single-node run (or a leaked GPU_METRICS_CSV env var), a multinode run whose perfmon failed on every node would silently patch wrong single-node avg_power_w / joules_per_*_token values into the multinode agg JSON. Fix: when _glob_pattern is truthy, skip the single-CSV fallback regardless of whether the glob matched anything.

Extended reasoning...

The contract violation

The block at utils/process_result.py:142-159 documents the precedence contract clearly:

Takes precedence over the single-CSV fallback — if the launcher set the glob, the run was multinode and there is no single-CSV fallback to make.

But the implementation only honors that contract when the glob actually matches files. On empty match:

_csv_arg = None _glob_pattern = os.environ.get('GPU_METRICS_CSV_GLOB') if _glob_pattern: _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern)) if _matched: _csv_arg = _matched else: print(..., file=sys.stderr) # warns but doesn't prevent fallthrough if _csv_arg is None: # still None — falls into single-CSV branch _csv_candidates = [ os.environ.get('GPU_METRICS_CSV'), 'gpu_metrics.csv', '/workspace/gpu_metrics.csv', ] ...

The else branch just logs; _csv_arg stays None, and the next if _csv_arg is None block consults the single-CSV candidates.

Step-by-step proof on a persistent self-hosted runner

Single-node run on gb300-cw_N completes successfully. benchmarks/benchmark_lib.sh exports GPU_METRICS_CSV=/workspace/gpu_metrics.csv (it lives in gpu_metrics.csv in cwd too). The file is left behind because the runner is persistent across jobs.

Next job is a multinode dynamo-sglang sweep. runners/launch_gb300-cw.sh (lines 297-318) writes GPU_METRICS_CSV_GLOB=$LOGS_DIR/perf_samples_*.csv to $GITHUB_ENV — but only when perf_csv_count > 0. Suppose perfmon failed to start on every node (srt-slurm PR [NVIDIA] Reduce B200 Runs & add B200 FP4 Docker Script #35 had startup issues, host driver mismatch, etc.) — perf_csv_count would be 0 and the glob env var would not be written. Fine — that path is safe.

However, suppose perfmon CSVs were written at the end of the job (so the launcher writes the GLOB), but a downstream cleanup hook between launcher and process_result.py removed them, OR srt-slurm wrote the CSVs to a different path on a subsequent retry, OR a persistent env var (GPU_METRICS_CSV_GLOB from a prior job) leaks in. The glob expansion in process_result.py returns empty.

process_result.py enters the else branch on line 155, prints a warning, and falls through. os.environ.get('GPU_METRICS_CSV') from the prior single-node job returns /workspace/gpu_metrics.csv (or gpu_metrics.csv in cwd is still there). Path(p).is_file() is True. _csv_arg = Path('/workspace/gpu_metrics.csv').

_aggregate_power_run is called with the stale single-node CSV.

Why the bench-window timestamp filter doesn't always save us

One verifier argued the start_unix <= ts <= end_unix filter at aggregate_power.py:177-178 would reject stale samples. That's true if the window comes from explicit Unix timestamps. But this PR adds two new fallback tiers in _load_bench_window:

Tier 2: date field parsed as a UTC string (YYYYMMDD-HHMMSS).

Tier 3: bench_result_path.stat().st_mtime — the bench JSON's own mtime, which is the current run's mtime, used as bench-end with start = end - duration.

The mtime tier is exactly the danger zone: on a persistent runner the bench JSON is freshly written, so its mtime is now. If the stale gpu_metrics.csv was also written recently (within the derived [mtime - duration, mtime] window — possible if the prior single-node run finished a few minutes ago), its samples do fall inside the window. Result: silent wrong avg_power_w and joules_per_*_token patched into the multinode agg JSON, which InferenceX-app's ETL auto-captures into the dashboard.

What the test misses

The accompanying test test_multinode_csv_glob_empty_match_falls_through_silently only asserts the no-stale-file case (asserts 'avg_power_w' not in patched). It does not stage a stale fallback CSV, so it can't catch the precedence violation. test_multinode_csv_glob_takes_precedence_over_single_csv only tests precedence when the glob matches.

Fix

One-line change in the empty-match branch:

if _glob_pattern: _matched = sorted(Path(p) for p in _glob_module.glob(_glob_pattern)) if _matched: _csv_arg = _matched else: _csv_arg = [] # sentinel: glob attempted, fallback forbidden print(...) if not _csv_arg: # treats [] same as None for the downstream check, but… if _glob_pattern: pass # …skip single-CSV candidates when glob was attempted else: _csv_candidates = [...] _csv_arg = next(...)

Or more cleanly: guard the single-CSV block on not _glob_pattern instead of _csv_arg is None.

github-actions · 2026-05-28T00:57:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26547958720
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26547958720

…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).

github-actions · 2026-05-28T04:57:33Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26548110246
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26548110246

… joules Layers per-worker breakdown on top of the cluster-wide multinode aggregation in the parent PR #1574. New agg JSON fields (additive — all existing keys preserved bit-for-bit for backward compat): workers: [{role, worker_idx, num_gpus, avg_power_w}, ...] role ∈ "prefill" / "decode" / "agg" / "frontend". Each (role, idx) aggregates across all CSVs for that worker — a multi-node TP=16 decode worker on 4 nodes produces one workers entry with num_gpus=16. prefill_avg_power_w, decode_avg_power_w (disagg only) Weighted per-GPU averages within each role. joules_per_input_token = prefill_energy / total_input_tokens joules_per_output_token_decode = decode_energy / total_output_tokens Disagg-only role-split metrics. Existing joules_per_output_token and joules_per_total_token keep their cluster-wide semantics so the chart won't shift on existing data. Worker → CSV mapping is by filename: srt-slurm's perfmon (companion change on SemiAnalysisAI/srt-slurm c4c86dc) writes `perf_samples_<role>_w<worker_idx>_<host>.csv`. Unlabeled filenames (old single-CSV format) silently emit empty workers list and skip the role split — cluster-wide metrics unchanged in that case. 77/77 tests pass (68 existing + 9 new — per-worker grouping, multi-node worker aggregation, mixed labeled/unlabeled inputs, disagg E2E with role split, agg E2E omitting disagg-only fields, bit-for-bit backward compat for old-format callers).

Squashed rebase of NVIDIA/srt-slurm PR NVIDIA#35 (kdhruv/gweperf_integration) onto current main (which now includes default_bash_preamble, added since PR NVIDIA#35 was opened on 2026-04-13). Original PR NVIDIA#35 had three commits; their net effect is collapsed here to one because the second commit replaced the first's gweperf integration with a built-in poller. Adds: - src/srtctl/monitor/perfmon.py (new) - nvidia-smi polling, per-node perf_samples_<node>.csv + perf_summary_<node>.json output. - MonitoringConfig in src/srtctl/core/schema.py (new) - {enabled, sample_interval}, top-level SrtConfig field. - _start_perf_monitor / _stop_perf_monitor in BenchmarkStageMixin (new) - one process per worker node, started before bench, stopped SIGINT with 30s grace. - tests/test_monitoring.py (new) - 19 tests, all passing upstream. Consumed by SemiAnalysisAI/InferenceX#1574 via the pinned ref SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon. Will revert this fork pin to NVIDIA/srt-slurm@main once PR NVIDIA#35 merges upstream.

…regation Appends entry for dsv4-fp4-gb300-dynamo-sglang so run-sweep.yml fires when the sweep-enabled label is added to PR #1574. The sweep produces the first multinode agg JSONs with avg_power_w + joules_per_*_token, validating the per-source GPU-id namespacing and GPU_METRICS_CSV_GLOB env-var bridge end-to-end on real GB300 hardware (gb300-cw cluster).

Resolve perf-changelog.yaml append conflict by keeping all three new entries: main's #1579 (qwen3.5-fp4-mi355x-sglang-disagg) plus this branch's #1574 re-trigger and the AMD multinode measured-power entry. Append-only file (process_changelog rejects deletions); no lines removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-05-28T20:39:45Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26593269421
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26593269421

github-actions · 2026-05-28T21:34:17Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26600823211
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26600823211

Workflow's paths: filter only fires on perf-changelog.yaml. This bumps the dsv4-fp4-gb300-dynamo-sglang entry so the sweep picks up the new per-worker power + per-stage J/token aggregation from 24f46ff. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…+ add temp/util/mem Realigns the per-worker / per-stage schema introduced in 06558b9 to match the canonical METRIC_KEYS already declared in InferenceX-app (packages/app/src/lib/metric-keys.ts). Previously this PR overrode cluster-wide joules_per_output_token for disagg runs, which would silently shift the meaning of a shared field. New per-stage values are emitted as separate flat scalars so the cluster keys stay byte-stable. Schema changes: - Revert disagg override on joules_per_output_token and joules_per_total_token — both are now ALWAYS cluster-wide (total_system_energy / token_count), matching single-node math and the frontend's existing axis labels. - Add new disagg-only flat scalars (already in frontend METRIC_KEYS): prefill_avg_power_w cluster mean across prefill workers decode_avg_power_w cluster mean across decode workers joules_per_output_token_decode decode_energy / output_tokens joules_per_input_token unchanged (prefill_energy / input_tokens). - Rename power_by_worker[] -> workers[] to match InferenceX-app's BenchmarkRow.workers / WorkerPower interface. - Each workers[] entry extended with per-worker telemetry: avg_temp_c, peak_temp_c, avg_util_pct, avg_mem_used_mb - Add matching cluster-wide telemetry scalars (per-GPU mean, omitted when CSV lacks the column). Implementation: - _read_samples + _aggregate_rows refactored to extract all metric columns in one pass (single-vendor regex per metric, gracefully degrades when a column is absent). - aggregate_power() preserved as a thin compat wrapper returning the old (power, num_gpus) tuple so external callers don't break. - Per-stage prefill_avg_power_w / decode_avg_power_w use weighted mean by num_gpus (matches how cluster avg_power_w is computed). - Frontend-labeled CSVs still excluded from per-stage energy attribution; included in cluster totals. Tests: 107/107 pass (88 existing baseline preserved, 14 new telemetry tests, 5 schema-renamed tests updated in place). New coverage: temp / util / mem extraction across NVIDIA + AMD + srt-slurm CSV schemas, peak vs avg distinction, missing-column graceful degradation, per- worker telemetry, per-stage weighted-mean scalars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirror the NVIDIA gb300/srt-slurm measured-power path on the AMD multi-node disaggregated inference path. With no orchestrator perfmon, each SGLang/vLLM disagg node starts its own amd-smi monitor via start_perf_monitor (benchmark_lib.sh), writing perf_samples_<role>_w<idx>_<host>.csv into the NFS-shared /benchmark_logs/perfmon mount; launch_mi355x-amds.sh collects them and exports GPU_METRICS_CSV_GLOB so the existing vendor-agnostic utils/aggregate_power.py produces per-worker + per-stage power. AMD perfmon wiring: - benchmark_lib.sh: start_perf_monitor helper; case-insensitive amd-smi header filter; log captured CSV header for schema-mismatch visibility - amd_utils/job.slurm: PERFMON_OUTPUT_DIR + interval into each container - amd_utils/server_sglang.sh / server_vllm.sh: per-node role + worker-idx classification (matches each engine's own placement); monitor start + stop on every exit path - runners/launch_mi355x-amds.sh: collect per-node CSVs immediately after job completion (before result-processing early-exits / EXIT-trap wipe), export GPU_METRICS_CSV_GLOB - utils/aggregate_power.py: docstring documents the AMD source (logic already vendor-agnostic) - utils/test_aggregate_power.py: AMD amd-smi multinode tests (per-worker, per-stage J/token, multi-node-per-worker collapse, vLLM topology) - perf-changelog.yaml: trigger the 6 mi355x disagg sweeps (sglang+vllm) Also lands the concurrent per-metric telemetry extension in aggregate_power.py / tests: temp/util/mem aggregation, workers[] schema, and flat per-stage scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode). Verified locally: 107 utils tests pass; bash syntax + shellcheck clean; role mapping + filename contract + full amd-smi->agg pipeline validated; adversarial review findings addressed (CSV collection moved ahead of early exits; case-insensitive amd-smi header). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- _MEM_EXCLUDE_RE now excludes "clock" and "util" (not just "total"), so nvidia-smi's clocks.current.memory (a frequency) and utilization.memory (a percent) are no longer mislabeled as avg_mem_used_mb. (cursor[bot] Medium) - Remove dead _disagg_stage_energies shim — no callers. (cursor[bot] Low) - Add regression test: mem detection ignores clock/util memory columns. 108 utils tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Per reviewer: in disagg serving, attribute each token type to only its stage's GPUs — input tokens to prefill GPUs, output tokens to decode GPUs (symmetric). joules_per_output_token is now decode_energy / output_tokens for disagg (was cluster-wide); joules_per_input_token already used prefill energy / input_tokens. joules_per_total_token stays cluster-wide (overall efficiency). Single-node / non-disagg / single-stage keep the cluster-wide output ratio so the field is always populated. Removes the now-redundant joules_per_output_token_decode key (folded into joules_per_output_token). Docstring, CLI help, and tests updated; 108 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The AMD multinode container runs as root and writes benchmark_logs/. If a job is cancelled (e.g. concurrency supersede), its cleanup trap never runs, leaving root-owned dirs. actions/checkout (clean: true) then can't rmdir them (EACCES) and fails BEFORE the job starts — poison-failing every job scheduled onto that runner. Add `sudo rm -rf $GITHUB_WORKSPACE/benchmark_logs` to the shared Slurm-cleanup anchor (runs pre-checkout AND post-run) so a dirty runner self-heals. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A full sweep floods slurmctld, so `squeue` intermittently returns "slurm_load_jobs error: Socket timed out". The old liveness check (`! squeue ... | grep -q $JOB_ID`) treated that empty/failed output as "job died" and exit 1'd — a false failure on a healthy job (observed on dsr1-fp8-mi355x-sglang-disagg conc 1024x2048). Add job_alive(): a non-zero squeue exit is treated as "still alive" (don't false-fail on a scheduler blip); only a SUCCESSFUL squeue that omits the job — re-checked once to avoid a single-sample race — counts as gone. Used by both the wait-for-log loop and the completion poll. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…pot) The amd-smi monitor only ran `metric -p -c -t -u`, so no VRAM column was emitted and avg_mem_used_mb never populated on AMD. It also used util/mem column matchers tuned for NVIDIA/srt-slurm names, which miss amd-smi's conventions — so avg_util_pct and avg_temp_c silently dropped too. - benchmark_lib.sh: add `-m` (mem-usage) to the amd-smi command so a used_vram column is captured. - aggregate_power.py column detection: - util: also match amd-smi `gfx_activity` (umc/mm_activity excluded). - mem: match positively on memory/vram + "used" instead of broad "mem" minus a growing exclude list — picks memory.used / mem_used_mb / used_vram while rejecting mem_temperature, mem_voltage, total/free_vram, the memory clock, and utilization.memory. - temp: prefer hotspot/junction over the first temp column, since edge temperature reads N/A on data-center AMD parts (MI300/MI355). NVIDIA and srt-slurm detection is unchanged (verified by existing tests). Adds AMD-header detection tests; full suite 111 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d2cca8a. Configure here.}

cursor · 2026-06-05T20:10:15Z

+        sleep 5
+        out=$(squeue -u "$USER" --noheader --format='%i' 2>/dev/null) || return 0
+        grep -qw "$JOB_ID" <<<"$out"
+    }


Squeue failure stalls job wait

Medium Severity

The new job_alive helper treats any failed squeue as proof the job is still running. After the Slurm job has already finished, repeated controller timeouts leave the launcher in the background poll and log-file wait loops indefinitely instead of proceeding to perfmon collection and result processing.

^{Reviewed by Cursor Bugbot for commit d2cca8a. Configure here.}

github-actions · 2026-06-07T02:59:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27037497545
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27037497545

arygupt · 2026-06-08T18:58:37Z

@Oseltamivir (or anyone on the Core team, @SemiAnalysisAI/core) — could I get a review on this when you have a cycle? 🙏 It's the gate on the measured-power work landing.

What it is: measured-power + temperature aggregation for multinode runs — turns the disagg energy numbers from modeled into measured by reading per-node perfmon CSVs (srt-slurm on NVIDIA, amd-smi on AMD) and patching the agg JSON. Purely additive telemetry; no production config behavior changes.

Where to focus review:

utils/aggregate_power.py — per-source GPU-id namespacing (a TP16 worker over 2 nodes counts 16 GPUs, not 8), per-worker workers[] with role labels (prefill/decode/agg/frontend), per-stage prefill_avg_power_w/decode_avg_power_w + joules_per_*_token, and per-worker avg_temp_c/peak_temp_c/avg_util_pct/avg_mem_used_mb.
utils/process_result.py — the GPU_METRICS_CSV_GLOB bridge into the Process-result step.
Launchers — launch_mi355x-amds.sh, launch_gb300-cw.sh, amd_utils/*, benchmark_lib.sh: CSV collection + monitoring: wiring.
Tests — test_aggregate_power.py, test_process_result.py cover the new fields.

Validation: the changelog entries (dsv4-gb300 + 6 AMD disagg) are smoke runs that produce the first agg JSONs with avg_power_w + per-worker power/temp populated — that's what the in-flight sweep (run 27037497545) is exercising.

Happy to walk through any of it — thanks!

…nical NVIDIA) Path-B-style run-only branch off main to get CANONICAL NVIDIA dsr1 power+temp on the HEALTHY gb200 runner, sidestepping the wedged gb300-nv fleet. Carries #1574's self-contained consumer code (aggregate_power.py/process_result.py + tests) + a 1-job changelog entry (dsr1-fp4-gb200-dynamo-sglang-powercheck). gb200 launcher perfmon wiring is the prior commit. Matrix verified = exactly 1 gb200 job. DO NOT MERGE (duplicates #1574). Run-only to harvest measured data; close after it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extends the measured-power campaign (PR #1574) to the NVIDIA gb300-nv runner for DeepSeek-R1. The dsr1 branch in launch_gb300-nv.sh clones SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon (NVIDIA PR #35), recursively injects monitoring: into every recipe (find -type f, not a flat glob), and stages per-node perf_samples_*.csv to $GITHUB_WORKSPACE so the Process-result step's aggregate_power.py patches the agg JSON with measured per-phase board power. Adds dsr1-fp4-gb300-dynamo-sglang-powercheck: a minimal single-job validation (1k/1k, conc 8, 1xP TP4 + 2xD TP4) of the plumbing before the full dsr1-disagg-NVIDIA sweep. Once this lands on a main that already carries #1574's aggregate_power consumer code, the changelog entry triggers exactly 1 gb300 job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nical NVIDIA) Path-B-style run-only branch off main to get CANONICAL NVIDIA dsr1 power+temp on the HEALTHY gb200 runner, sidestepping the wedged gb300-nv fleet. Carries #1574's self-contained consumer code (aggregate_power.py/process_result.py + tests) + a 1-job changelog entry (dsr1-fp4-gb200-dynamo-sglang-powercheck). gb200 launcher perfmon wiring is the prior commit. Matrix verified = exactly 1 gb200 job. DO NOT MERGE (duplicates #1574). Run-only to harvest measured data; close after it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extends the measured-power campaign (PR #1574) to the NVIDIA gb300-nv runner for DeepSeek-R1. The dsr1 branch in launch_gb300-nv.sh clones SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon (NVIDIA PR #35), recursively injects monitoring: into every recipe (find -type f, not a flat glob), and stages per-node perf_samples_*.csv to $GITHUB_WORKSPACE so the Process-result step's aggregate_power.py patches the agg JSON with measured per-phase board power. Adds dsr1-fp4-gb300-dynamo-sglang-powercheck: a minimal single-job validation (1k/1k, conc 8, 1xP TP4 + 2xD TP4) of the plumbing before the full dsr1-disagg-NVIDIA sweep. Once this lands on a main that already carries #1574's aggregate_power consumer code, the changelog entry triggers exactly 1 gb300 job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nical NVIDIA) Path-B-style run-only branch off main to get CANONICAL NVIDIA dsr1 power+temp on the HEALTHY gb200 runner, sidestepping the wedged gb300-nv fleet. Carries #1574's self-contained consumer code (aggregate_power.py/process_result.py + tests) + a 1-job changelog entry (dsr1-fp4-gb200-dynamo-sglang-powercheck). gb200 launcher perfmon wiring is the prior commit. Matrix verified = exactly 1 gb200 job. DO NOT MERGE (duplicates #1574). Run-only to harvest measured data; close after it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extends the measured-power campaign (PR #1574) to the NVIDIA gb300-nv runner for DeepSeek-R1. The dsr1 branch in launch_gb300-nv.sh clones SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon (NVIDIA PR #35), recursively injects monitoring: into every recipe (find -type f, not a flat glob), and stages per-node perf_samples_*.csv to $GITHUB_WORKSPACE so the Process-result step's aggregate_power.py patches the agg JSON with measured per-phase board power. Adds dsr1-fp4-gb300-dynamo-sglang-powercheck: a minimal single-job validation (1k/1k, conc 8, 1xP TP4 + 2xD TP4) of the plumbing before the full dsr1-disagg-NVIDIA sweep. Once this lands on a main that already carries #1574's aggregate_power consumer code, the changelog entry triggers exactly 1 gb300 job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nical NVIDIA) Path-B-style run-only branch off main to get CANONICAL NVIDIA dsr1 power+temp on the HEALTHY gb200 runner, sidestepping the wedged gb300-nv fleet. Carries #1574's self-contained consumer code (aggregate_power.py/process_result.py + tests) + a 1-job changelog entry (dsr1-fp4-gb200-dynamo-sglang-powercheck). gb200 launcher perfmon wiring is the prior commit. Matrix verified = exactly 1 gb200 job. DO NOT MERGE (duplicates #1574). Run-only to harvest measured data; close after it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extends the measured-power campaign (PR #1574) to the NVIDIA gb300-nv runner for DeepSeek-R1. The dsr1 branch in launch_gb300-nv.sh clones SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon (NVIDIA PR #35), recursively injects monitoring: into every recipe (find -type f, not a flat glob), and stages per-node perf_samples_*.csv to $GITHUB_WORKSPACE so the Process-result step's aggregate_power.py patches the agg JSON with measured per-phase board power. Adds dsr1-fp4-gb300-dynamo-sglang-powercheck: a minimal single-job validation (1k/1k, conc 8, 1xP TP4 + 2xD TP4) of the plumbing before the full dsr1-disagg-NVIDIA sweep. Once this lands on a main that already carries #1574's aggregate_power consumer code, the changelog entry triggers exactly 1 gb300 job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ound cleanup timeouts Root cause of the dsr1 NVIDIA measured-power sweeps wedging ~1.5 weeks: the shared pre/post 'Slurm cleanup' anchor ran 'timeout 60 sudo rm -rf $GITHUB_WORKSPACE/benchmark_logs' on EVERY multinode runner. Only the AMD path (launch_mi355x-amds.sh) creates benchmark_logs; NVIDIA launchers never do. On the GB300 login host sudo hangs resolving SSSD policy (nsswitch 'sudoers: files sss') BEFORE exec'ing rm, and 'timeout 60' only sends SIGTERM — which the stuck root sudo ignores — so timeout waits in sigsuspend forever and the step never returns. Live process table (probe 27988714413) showed timeout/sudo pairs for gharunner0/1/2 stuck 4+ DAYS. The command exists only on feature branches (added 475ce8a, PR #1574); not on main — which is why only this campaign's branch wedged. Fix: (1) run the privileged benchmark_logs cleanup only on AMD (case mi*) runners, with 'timeout --kill-after=5s 60s sudo -n' so a hung sudo is force-KILLed and never prompts; (2) harden every Slurm call with 'timeout --kill-after=5s 30s' (TERM then KILL) and break the drain loop explicitly on squeue timeout/failure, so no cleanup command can block indefinitely. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…bound Slurm calls Prevents the NVIDIA wedge from re-entering main when this PR lands. The shared cleanup ran 'sudo rm -rf benchmark_logs' on every multinode runner, but only the AMD path (launch_mi355x-amds.sh) creates that dir. On GB300 login hosts sudo hangs resolving SSSD policy before exec'ing rm, and the bare 'sudo rm' (no timeout) hangs forever. Scope it to AMD (case mi*) with 'timeout --kill-after=5s 60s sudo -n', and hard-bound every Slurm call (TERM-then-KILL) with an explicit loop break on squeue timeout/failure. Mirrors the fix validated live on PR #1791 (run 27989491427 cleared cleanup on gb300-nv_1; probe 27988714413 showed 4-day-stuck timeout/sudo pairs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>



+# Back-compat shim — some external callers may have imported _parse_power.
+_parse_power = _parse_numeric_cell


arygupt requested a review from a team May 27, 2026 19:23

github-project-automation Bot added this to InferenceMAX Board May 27, 2026

arygupt added the sweep-enabled label May 27, 2026

functionstackx changed the title ~~feat(power): multinode measured-power aggregation~~ feat(power): multinode measured-montiroring aggregation May 27, 2026

claude Bot reviewed May 27, 2026

View reviewed changes

arygupt added sweep-enabled and removed sweep-enabled labels May 28, 2026

arygupt changed the title ~~feat(power): multinode measured-montiroring aggregation~~ feat(power): multinode measured-power aggregation May 28, 2026

arygupt force-pushed the feat/measured-power-multinode branch from 3caf593 to 8d30341 Compare May 28, 2026 01:01

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread utils/aggregate_power.py

arygupt mentioned this pull request May 28, 2026

feat(power): per-worker prefill/decode power + role-split joules (stacked on #1574) #1577

Open

3 tasks

arygupt added sweep-enabled and removed sweep-enabled labels May 28, 2026

arygupt force-pushed the feat/measured-power-multinode branch from f5b5c77 to 1af17ab Compare May 28, 2026 18:11

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread utils/aggregate_power.py Fixed

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread utils/aggregate_power.py Outdated

Comment thread utils/aggregate_power.py Outdated

arygupt enabled auto-merge (squash) May 28, 2026 21:33

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread utils/aggregate_power.py Fixed

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread utils/aggregate_power.py

arygupt and others added 8 commits June 5, 2026 13:06

arygupt force-pushed the feat/measured-power-multinode branch from 6849229 to d2cca8a Compare June 5, 2026 20:08

cursor Bot reviewed Jun 5, 2026

View reviewed changes

arygupt mentioned this pull request Jun 8, 2026

[DO NOT MERGE] Run-only: gb300 dsr1 measured power+temp validation #1686

Closed

arygupt requested a review from Oseltamivir June 8, 2026 18:58

arygupt mentioned this pull request Jun 15, 2026

[DO NOT MERGE] Run-only: gb200 dsr1 measured power+temp (canonical NVIDIA) #1791

Open

arygupt mentioned this pull request Jun 22, 2026

fix(ci): bound multinode pre-run Slurm cleanup drain loop (unblocks NVIDIA sweeps) #1820

Open

github-code-quality Bot found potential problems Jun 22, 2026

View reviewed changes

Comment thread utils/aggregate_power.py

# Back-compat shim — some external callers may have imported _parse_power.

_parse_power = _parse_numeric_cell



		# Back-compat shim — some external callers may have imported _parse_power.
		_parse_power = _parse_numeric_cell

Uh oh!

Conversation

arygupt commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pipeline

Files

Test plan

Depends on

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

claude Bot May 27, 2026

Choose a reason for hiding this comment

The contract violation

Step-by-step proof on a persistent self-hosted runner

Why the bench-window timestamp filter doesn't always save us

What the test misses

Fix

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Squeue failure stalls job wait

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

arygupt commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 27, 2026 •

edited by cursor Bot

Loading