chore: agentx v0.3#1571
Conversation
…loadingConnector
vLLM's --kv_offloading_backend native resolves to two different connectors
based on the VLLM_USE_SIMPLE_KV_OFFLOAD env var (see vllm/config/vllm.py:662):
VLLM_USE_SIMPLE_KV_OFFLOAD=1 -> SimpleCPUOffloadConnector (the path
we were using; carries the popleft_n
+ context-overflow + completion-barrier
bugs we hit on B200/B300/H200)
unset (default) -> OffloadingConnector (the regular
native path)
This commit drops the env var and the JSON form, switching MI355X to the
shortcut form which now routes to OffloadingConnector. We're trying the
regular path here to see if it sidesteps the SimpleCPUOffloadConnector-
specific issues that have been forcing lazy_offload + workarounds.
Also drops the --kv-transfer-config JSON since the shortcut form constructs
the KVTransferConfig itself at engine startup. Keeps
--disable-hybrid-kv-cache-manager since MI355X uses --block-size=1 + AITER
which doesn't play with the hybrid manager.
Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the dsv4-fp4-b200-vllm-agentic CONC sweep (tp8 [16,32,64] + tp8 dp-attn [64,128,256]) so the two SKUs can be compared on the same trace load. Uses the same SGLang image as the fixed-seq-len sibling (rocm/sgl-dev:rocm720-mi35x-0363e6c-20260509-DSv4). Offload sweep is none-only (SGLang has no equivalent of vLLM's SimpleCPUOffloadConnector that we exercise on b200). Launcher swaps the fixed-seq-len harness (run_benchmark_serving) for the agentic harness (build_replay_cmd / write_agentic_result_json / analyze_benchmark_distributions) but keeps all SGLang server flags and SGLANG_* env vars identical to the fixed-seq-len sibling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 dispatch failed on all 6 b200 shards with the same enroot error during manifest fetch: [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] Could not process JSON input curl: (23) Failure writing output to destination Docker Hub confirms the image exists with a clean Docker v2 manifest, but enroot import was being invoked as `docker://docker.io/cquil/vllm-openai:...` because the image field had the docker.io/ prefix. Every other image entry in the repo uses the bare `org/repo:tag` form (no docker.io/ prefix), so this entry was the outlier. Dropping the prefix matches convention and should let enroot resolve the registry host normally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First multi-node agentic config with the recipe local to this repo. Adds:
- Two new agentic recipes under benchmarks/multi_node/srt-slurm-recipes/
vllm/deepseek-v4/agentic/, adapted from the corresponding 8k1k fixed-
seq-len siblings:
* disagg-gb300-1p6d-dep4-tp4-agentic.yaml (low-lat conc=32, mid conc=192)
* disagg-gb300-4p1d-dep4-dep8-24-c4096-agentic.yaml (high-tput conc=4096)
Both drop max-model-len, drop no-enable-prefix-caching, add DSv4
tool/reasoning parsers, switch benchmark.type sa-bench -> custom (hands
off to benchmarks/multi_node/agentic_srt.sh which builds the aiperf
inferencex-agentx-mvp invocation).
- New IS_AGENTIC=1 branch at the top of runners/launch_gb300-nv.sh's
framework conditional. Clones the cquil11/srt-slurm-nv fork (the only
srt-slurm build that supports benchmark.type=custom) on the
cam/sa-submission-q2-2026 branch and overlays the local agentic
recipes into recipes/vllm/deepseek-v4/agentic/ so iteration stays in
this repo.
- New dsv4-fp4-gb300-dynamo-vllm-agentic config entry in
nvidia-master.yaml as a sibling of the byte-identical-to-origin/main
dsv4-fp4-gb300-dynamo-vllm base. Three-tier sweep:
* low-latency (conc=32, 1p6d shape, 28 GPUs / 8 nodes)
* mid (conc=192, 1p6d shape, same alloc as low-lat)
* high-tput (conc=4096, 4p1d shape, 24 GPUs / 7 nodes)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R1 of dsv4-fp4-gb300-dynamo-vllm-agentic failed at `srtctl apply` with
two schema errors against the cquil11/srt-slurm-nv fork:
Invalid config: {'dynamo': {'wheel': ['Unknown field.']},
'benchmark': {'env': {'PORT': {'value': ['Not a valid string.']}}}}
The first (dynamo.wheel) is fixed by cherry-picking commit 0060f857 from
NVIDIA upstream onto cquil11/srt-slurm-nv@cam/sa-submission-q2-2026
(adds wheel field + install scripts; pushed separately).
The second (PORT) is fixed here: env values must be strings, so
`PORT: 8000` -> `PORT: "8000"`. INFMAX_CONTAINER_WORKSPACE / RESULT_DIR
parse as strings due to their / chars, and IS_MULTINODE was already
quoted; PORT was the only bare int.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R2 of dsv4-fp4-gb300-dynamo-vllm-agentic landed all 3 shards on
gb300-cw_N runners (CoreWeave self-hosted runners advertise both
gb300-cw AND gb300-nv labels). RUNNER_NAME%%_* resolves to gb300-cw,
which routes to runners/launch_gb300-cw.sh — but that launcher had
no IS_AGENTIC handling, so it cloned upstream NVIDIA/srt-slurm
(which lacks benchmark.type=custom) instead of the cquil11 fork.
srtctl apply then failed:
Invalid config: {'benchmark': {'command': ['Unknown field.'],
'env': ['Unknown field.']}}
Mirrors the IS_AGENTIC=1 branch I added earlier to launch_gb300-nv.sh:
use cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (now patched with
dynamo.wheel support via cherry-picked NVIDIA commit 0060f857) and
overlay our local agentic recipes from
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/agentic/.
Both gb300-nv and gb300-cw launchers now handle IS_AGENTIC identically,
so the workload runs correctly regardless of which runner picks it up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Upstream NVIDIA/srt-slurm@main has caught up on every schema feature
the agentic path needs:
- BenchmarkType.CUSTOM + benchmark.command + benchmark.env (the
hook that hands off to benchmarks/multi_node/agentic_srt.sh)
- DynamoConfig.wheel (so our vllm recipes can pin the same
ai-dynamo wheel as the fixed-seq-len path)
- default_bash_preamble (no more "Unknown field" warning)
So we don't need the cquil11/srt-slurm-nv fork anymore. Pin to
upstream commit 127597c0e6d3 (current HEAD) for reproducibility;
bump as upstream evolves.
Also fix: `uv venv` defaults to no-pip. The upstream
prefetch-ai-dynamo-wheel.sh script (called by srtctl when a recipe
has `dynamo.wheel` set) does `python3 -m pip download`, which fails
with "No module named pip" without a seeded venv. Adding --seed
installs pip+setuptools+wheel into the venv so the prefetch path
works. R4 of dsv4-fp4-gb300-dynamo-vllm-agentic showed this error
on the gb300-cw runner immediately after the lockfile cleanup
unblocked the import_squash step.
Both gb300-cw and gb300-nv launchers updated identically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R5 first-shard failure on gb300-nv runner: fatal: reference is not a tree: 127597c0e6d3c1b3ffd7ac02dd0fea2d2fd62f74 I extrapolated the 40-char SHA from a 7-char short `127597c` shown in git log output instead of resolving it. The real SHA is 127597c2926467db06e6707e0aa9227261c6c02a (NVIDIA/srt-slurm@main, "Update GB300 FP8 GLM-5 recipe (#160)"). R5's gb300-cw shards didn't immediately fail on the same error — either they hadn't reached the checkout step yet when I noticed, or their git was more lenient about the prefix-then-garbage SHA. Either way, the fixed SHA works for both. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… launcher
Two issues caught in R5:
1) dynamo-vllm worker rejects chat parser flags
The worker entrypoint (different argparser than `vllm serve`) errors:
__main__.py: error: unrecognized arguments: --enable-auto-tool-choice
--tool-call-parser deepseek_v4
These belong on the dynamo frontend, not the worker. In disagg, chat
parsing happens at the frontend; workers just take tokens. The 8k1k
sibling recipes (which work) don't set these either. I mistakenly
ported them from the single-node launchers, which run `vllm serve`
directly (the chat-serving entrypoint).
Drop --tool-call-parser, --enable-auto-tool-choice, --reasoning-parser
from both prefill and decode blocks in both agentic recipes. Keep
--tokenizer-mode deepseek_v4 (worker DOES accept that one).
2) launch_gb300-cw.sh was missing set -e
The fabricated SHA bug from the prior commit only surfaced on the nv
launcher (which has set -exo pipefail). The cw launcher silently
swallowed the failed `git checkout` and proceeded on origin/HEAD —
which happened to be the right commit, masking the bug. Add
`set -exo pipefail` to match the nv launcher; loud failures are
safer than silent ones.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R6 surfaced via srtctl preflight that /scratch/models/DeepSeek-V4-Pro is
not staged on the gb300-nv cluster:
Error: Preflight failed for ...disagg-gb300-1p6d-dep4-tp4-agentic.yaml:
- model.path: Model alias 'deepseek-v4-pro' resolved to
'/scratch/models/DeepSeek-V4-Pro', but that path is unavailable.
DSR1 weights ARE staged on /scratch (node-local SSD), but DSv4-Pro was
never staged there. The 806 GB DSv4-Pro checkpoint lives at
/home/sa-shared/models/DeepSeek-V4-Pro (NFS, shared across nodes).
This silently broke the existing 8k1k fixed-seq-len path for dsv4-vllm
on gb300-nv too (just hadn't been exercised against the stricter
upstream srtctl preflight). Fix is single-file: re-point the DSv4 leg
of the per-model conditional to the NFS path.
NFS is slower than /scratch but that's where the model actually lives.
Stage to /scratch and switch back if model load I/O becomes a bottleneck.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…S ELOOP R7 of dsv4-fp4-gb300-dynamo-vllm-agentic: Fatal error: Symlink loop from '/home/sa-shared/models/DeepSeek-V4-Pro' OSError: [Errno 40] Too many levels of symbolic links Same Vast NFS ELOOP bug we hit on the squash lockfiles in R3/R4: the /home/sa-shared/ NFS mount returns ELOOP to workflow worker processes (specifically those spawned through GHA runner pod -> sbatch -> pyxis/enroot), even though the same path is a regular directory from interactive sessions (verified via gb300-slurm + srun on c001 — both Path.resolve() and ls succeed cleanly). Workaround: /data/ and /home/sa-shared/ are SEPARATE mount points backed by the SAME storage (storage-vip.vast.p03.globalai.run, with /scratch and /scratch/home/sa-shared as the server-side paths). Switching MODEL_PATH to /data/home/sa-shared/models/DeepSeek-V4-Pro gives us identical files with a separate NFS client cache, which isn't poisoned in the workflow context. Doesn't fix the underlying Vast NFS bug — just routes around it. Long-term: stage DSv4-Pro to /scratch/models/ (node-local SSD) like DSR1, both for performance and to bypass this whole mount class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R7 of dsv4-fp4-gb300-dynamo-vllm-agentic had 6/8 worker srun steps OOM-killed within 30s, with `torch.AcceleratorError: CUDA-capable device(s) is/are busy or unavailable` (CUDA init aborts when SIGKILL races it). sacct showed each worker step got AllocTRES mem=4G (empirically verified on CW: default sbatch w/ --gres=gpu:4 -> AllocTRES mem=4G; same sbatch w/ --mem=0 -> AllocTRES mem=868G). Root cause: srt-slurm's start_srun_process doesn't pass --mem on the container srun, so it gets cpus_per_task × DefMemPerCPU = 4 GB by default on clusters with positive DefMemPerCPU (CW gb300 has 4096). 4 GB is wildly insufficient for a vLLM worker mmap'ing multi-GB model weights and pinning CUDA buffers. Fix: re-point both gb300 launchers' IS_AGENTIC clone from upstream NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/agentic-mem-0 (96c443a), which is the same upstream commit + a single patch adding `--mem 0` to start_srun_process when container_image is set. Long-term: PR the --mem=0 change upstream so we can drop the fork indirection for this feature class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R9 hit the same Vast NFS ELOOP we fixed for the model path in R8, but
this time on the squash lockfile:
/usr/bin/bash: line 2: /home/sa-shared/gharunners/squash/<image>.sqsh.lock:
Too many levels of symbolic links
The /home/sa-shared/ NFS mount poisons lockfiles AND data files alike
under the workflow worker NFS session. We applied the /data/ workaround
for MODEL_PATH; now do the same for SQUASH_FILE + NGINX_SQUASH_FILE
which were still pointing at the bad mount. Both /home/sa-shared/
and /data/ are mounted from the same Vast backing storage; same files,
separate NFS client cache.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier I patched srt-slurm's start_srun_process to default --mem=0 on
container srun. That's the wrong layer — srtctl has a documented
top-level recipe field `srun_options:` (see docs/config-reference.md#srun_options)
that gets threaded straight through to the worker srun via
mixins/worker_stage.py:235 (`srun_options=self.runtime.srun_options`)
and start_srun_process line 248 (`for key, value in srun_options.items()`).
Switch to that mechanism:
- Add `srun_options: {mem: "0"}` to both agentic recipes
- Revert both launchers from the cquil11 fork pin back to upstream
NVIDIA/srt-slurm@127597c (the fork patch in cam/agentic-mem-0 is
now redundant; leaving the branch around as a fallback but not
pinned in the launcher)
R9/R10 confirmed sacct still showed mem=4G per worker step despite the
launcher cloning the patched fork — likely because srtctl's uv-sync
inside the sbatch rebuilds the venv from pyproject.toml and the
editable install from src/ doesn't include code modifications the way
uv pip install -e . would. The recipe-level mechanism doesn't depend
on patching srtctl at all so this whole class of "is the patch
loaded?" question goes away.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R11 verified that srun_options.mem=0 IS now in the worker srun
cmdline (confirmed via /proc/<pid>/cmdline on the head node).
BUT sacct still showed AllocTRES mem=4G per step.
Why: the sbatch only requested `--ntasks=8` with no `--mem`, so the
JOB allocation per node is bound to cpus_per_task × DefMemPerCPU =
1 × 4 GB = 4 GB. `--mem=0` on srun means "use ALL of what the JOB
has on this node" — and the job has 4 GB. There's nothing to grow
into.
The other half of the fix is `sbatch_directives.mem=0` which emits
`#SBATCH --mem=0` in the generated sbatch script (per
src/srtctl/templates/job_script_minimal.j2:26), making SLURM
allocate all available node memory (~868 GB on CW gb300) up front.
Both layers needed:
- sbatch_directives.mem=0 → JOB gets full node memory
- srun_options.mem=0 → each container srun step uses it
(without this, srun defaults back to
cpus_per_task × DefMemPerCPU = 4 GB)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ation)
R12 progressed past the memory layer (sbatch_directives.mem=0 from prior
commit worked; sacct showed AllocTRES mem=868G per worker), but failed
~10 min in with etcd lease-keepalive `deadline exceeded` errors followed
by every worker SIGKILL'd at 16:36:03.
Root cause from infra.out: etcd reported `max-cpu-set: 1` at startup.
SLURM's default cpus_per_task=1 starved single-CPU etcd under load from
24 concurrent dynamo DP rank lease keep-alives (16 prefill + 8 decode).
etcd's gRPC handler couldn't process RPCs fast enough → cascading lease
deadline exceeded → workers crashed → orchestrator cancelled job →
infra step itself SIGKILL'd at 16:35:49 ("STEP 4572.2 ON
slurm-gb300-138-249 CANCELLED ... DUE to SIGNAL Killed").
Fix: sbatch_directives.cpus-per-task=72 grants every task (including
the GPU-less infra step) one CW gb300 NUMA socket. etcd now has
plenty of compute; vLLM workers also get more aux CPU for tokenizer
threads etc.
Why cw needs this and nv doesn't: nv cluster's JobDefaults includes
DefCpuPerGPU=35 → any task with --gres=gpu:N auto-gets 35*N CPUs (=
140 on a 4-GPU task). cw has no per-GPU default → tasks get
cpus_per_task=1 by default. The infra step has no --gres flag at all
so it's the worst case on cw.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two changes: 1) Pin to NVIDIA cluster (drop CW) The dsv4-fp4-gb300-dynamo-vllm-agentic runner field was `gb300`, which is the generic label both NV and CW runner pools advertise (per gh api runners). So shards landed on either cluster, which meant we kept debugging the same recipe path against two different cluster configs (NV's DefCpuPerGPU=35 vs CW's DefMemPerCPU=4096 with no per-GPU defaults). Switch to `runner: gb300-nv`, a label only the NV pool advertises. This matches just gb300-nv_0/1/2 going forward. 2) MODEL_PATH switched to /scratch/models/DeepSeek-V4-Pro The node-local SSD on NV compute nodes. Faster than the /data/home/sa-shared NFS path (where DSv4-Pro currently lives). Caveat: /scratch doesn't exist on the GHA runner pod, so srtctl preflight may fail with "Model alias resolved to ..., but that path is unavailable." We're trying this anyway to see whether the runner pod has /scratch mounted; if it errors, next step is to either (a) patch srt-slurm to add a `skip_model_preflight` recipe field or (b) stub a symlink on the runner pod. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agentic recipe pins MODEL_PATH=/scratch/models/DeepSeek-V4-Pro (node-local NVMe on compute nodes). srtctl's _preflight_model runs in-process on whatever node invokes srtctl — the GHA runner pod, which doesn't have /scratch mounted — so it bails before sbatch with "Model alias 'deepseek-v4-pro' resolved to '/scratch/...', but that path is unavailable" (R14 hit this). Switch the IS_AGENTIC=1 clone target from NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/no-preflight-flag (854b3fd), which adds one CLI flag — `srtctl apply --no-preflight` — that skips just the optional Python-level FS precheck. vLLM still fails loudly at runtime if the path is genuinely missing on the compute node. The flag is only passed when IS_AGENTIC=1. Fixed-seq-len recipes resolve model.path to an NFS path visible from the runner pod, where the precheck is a useful sanity guard, so leave enforcement on for them. Fork commit: cquil11/srt-slurm-nv@854b3fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Aiperf's content-addressed mmap dataset cache (~65 GB per dataset)
needs to be persisted across runs so the first run of the day doesn't
re-tokenize + re-write it on every shard. Same pattern as
launch_h200-dgxc-slurm.sh, launch_b200-dgxc.sh, launch_mi355x-amds.sh.
Three layers wired:
1) Host paths (cluster-specific, created with 0777 so all gharunner_X
SLURM users can write):
gb300-nv /data/home/sa-shared/gharunners/ai-perf-cache
gb300-cw /mnt/vast/ai-perf-cache
2) Both launchers export AIPERF_MMAP_CACHE_HOST_PATH and add a line to
the generated srtslurm.yaml's default_mounts block — srt-slurm's
runtime.py reads default_mounts via get_srtslurm_setting() and
bind-mounts each entry into every worker container. cw already had
a default_mounts block (for dynamo-wheels-cache); nv had none.
3) Both agentic recipes set AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache
in benchmark.env so the aiperf process inside the container reads
from the persistent mount instead of ~/.cache/aiperf/dataset_mmap.
Single-node launchers don't need updating — they have their own srun
--container-mounts line that already bind-mounts the cache.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings in 45 commits from upstream/ajc/inferencex-agentx-mvp (PR #875): - InferenceX AgentX-MVP scenario (default corpus switched to 051226 no-subagents 949-trace variant) - semianalysis_cc_traces_weka_no_subagents HF loader - Wrap-fill trajectory recycling + correlation-id double-recycle guard - DAG benchmarks, reproducible payload replay, agentic_replay E2E test - assorted dataset/timing fixes Local commits preserved (no rebase). One docstring-only conflict in src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py resolved by taking upstream's text (more comprehensive — documents both 042026 and 051226 variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vllm/vllm-openai:v0.21.0-ubuntu2404 ships without git, but pip's
editable install (-e) of utils/aiperf invokes `git version` to record
direct_url.json provenance. Without git, every R16 shard on both
gb300-nv and gb300-cw failed at:
+ python3 -m pip install --break-system-packages -q --ignore-installed -e /infmax-workspace/utils/aiperf
ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git version
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?
This happens AFTER server boot is healthy and "Server is healthy - starting
benchmark" has fired, so all the upstream cluster/recipe work (preflight,
mem=0 x2 layers, etcd cpus-per-task=72, --no-preflight, /scratch model
path, NixlConnector P<->D, model load) is working end-to-end. Only the
pip install step is blocked.
Fix: prepend a `command -v git || apt-get update && apt-get install -y git`
to install_agentic_deps. Cheap no-op on images that already ship git
(AMD images, custom containers). The vLLM image's apt is functional from
inside the container so this works without container rebuild.
The -e install was introduced yesterday in e92a9bf (aiperf v0.2
migration); previously the agentic flow used kv-cache-tester which
didn't need git.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t containers R17 surfaced two distinct failures, one per cluster: 1) gb300-cw (all 3 shards): aiperf rejected --public-dataset semianalysis_cc_traces_weka with "Scenario invariants violated ... required loader=any of ['semianalysis_cc_traces_weka_no_subagents', 'weka_trace']". Yesterday's aiperf merge (PR #875 commit fef78a96) switched the inferencex-agentx-mvp scenario's default corpus to the 051226 no-subagents 949-trace variant and tightened the loader contract. The old name is no longer accepted. Fix: resolve_trace_source emits --public-dataset semianalysis_cc_traces_weka_no_subagents. 2) gb300-nv (all 3 shards): "dpkg: error: requested operation requires superuser privilege" from yesterday's install_agentic_deps git install path. The gb300-nv pyxis/enroot setup maps the calling user (sa-shared) into the container as non-root, while gb300-cw runs as root. The git install needs sudo on nv; cw is fine without. Fix: branch on `id -u` — apt-get directly when root, sudo apt-get otherwise. The vllm-base layer installs `sudo` so the binary is available, and the typical enroot config grants the calling user passwordless sudo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R17/R18 made it clear that there's no clean way to install git into the
vllm/vllm-openai container at run-time on gb300-nv:
- R16/R17: container ships without git -> pip's editable install of
aiperf fails with "Cannot find command 'git'"
- R18: tried `sudo apt-get install git`. gb300-nv pyxis/enroot remaps
the calling user to uid=345200007 inside the container, and sudo
refuses to run with "/usr/bin/sudo must be owned by uid 0 and have
the setuid bit set" -- the setuid bit can't carry across user
namespaces. cw container runs as root so sudo wasn't tripped there,
but the right answer is one that works on both clusters.
The actual fix is upstream from this entirely: drop `-e`. pip's editable
install needs git only to record direct_url.json provenance; the
non-editable install just builds a wheel via hatchling and copies into
site-packages. aiperf's pyproject.toml pins version="0.8.0" rather than
deriving it from git tags, so non-editable install works without git in
any environment. We don't edit aiperf source mid-benchmark anyway --
loss of -e ergonomics is zero.
`--ignore-installed` is still needed (handles the apt-managed-blinker
distutils-uninstall pile-up) and is orthogonal to -e.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the sudo/root-detection complexity from R18 and restore -e on the aiperf pip install. Per user direction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The vllm/vllm-openai container ships without git; agentic_srt.sh needs to apt-get install it because pip's install of utils/aiperf calls `git version`. R17/R18/R19/R20 chased this on gb300-nv with various combinations of sudo / no-sudo / drop-e / etc., all failing because pyxis maps the calling user to uid 345200007 inside the container and dpkg's hardcoded geteuid()!=0 check rejects every attempt regardless of filesystem permissions. The cleanest fix is to ask pyxis to remap us to uid 0 inside the container, matching the gb300-cw behavior (where the container already runs as root and apt-get install works directly). pyxis exposes this as a per-srun flag: --container-remap-root. srt-slurm renders empty-string srun_options as flag-only srun args (see core/slurm.py:250 in NVIDIA/srt-slurm@127597c). No-op on gb300-cw (cw is already remapped to root by default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Picks up cquil11/srt-slurm-nv@6e34b8b which propagates srun_options through the benchmark_stage srun (previously only worker/frontend/ telemetry stages honored them). Required for the recipe-level srun_options.container-remap-root: "" to apply to the benchmark.command container — the one that runs agentic_srt.sh + apt install git. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Picks up cquil11/aiperf@9b858ae which fixes PhaseRunner.cancel() to set all_credits_sent_event / all_credits_returned_event so the outer runner awaits wake immediately. Previously cancelled runs (e.g. via --failed-request-threshold) blocked for the full phase timeout (~1800s default) before reaching the graceful exit path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ncel) When a workflow run is cancelled mid-flight (gh run cancel, or UI cancel button), the launcher gets SIGTERM during its `tail -F` wait and exits before reaching the `tar czf .../multinode_server_logs.tar.gz` line in the main flow. The Upload server logs workflow step runs (it has if: always()) but finds no file (if-no-files-found: ignore silently skips), so the artifact never gets uploaded. Fix: install an EXIT trap right after JOB_ID extraction that produces the tarball on any exit path — normal completion, error, SIGTERM, SIGKILL of our parent. The main-flow tar block is now an idempotent no-op (kept for log narrative). Applied identically to both gb300-nv and gb300-cw launchers. The b200-dgxc launcher has the same pattern but its multi-node flow is currently only used by other configs; leaving it alone for now to avoid mixing unrelated changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
gb300-nv 1p6d agentic runs hit ~15% errors at conc=32 from Dynamo NATS RPC deadline timeouts when the single prefill worker is saturated by 32 concurrent 50-100k token prefills. Each timeout returns HTTP 500 "Failed to generate completions: Prefill execution failed: ... NATS request to dynamo_prefill.generate-... failed: ... deadline has elapsed" — a real failure but driven by the single-prefill-worker capacity limit, not a regression. At the previous 0.05 threshold the run tripped its ProfileCancel mechanism early and produced no usable numbers. At 0.20 the run completes and we get steady-state metrics for the ~85% of requests that succeed; the underlying NATS saturation is a separate work item (Dynamo deadline tuning, or more prefill workers in the recipe, or both). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Agentic replay traces have a theoretical prefix-cache hit rate above 95% on every workload we benchmark; the realtime srv row only reads 0.0% because the launch script turns the SGLang RadixAttention cache off. Every server recipe in this directory had it on — either as the only branch of an OFFLOADING=none case or as an unconditional launch-line flag — so the hit-rate number was never meaningful and the run was paying full prefill cost on every turn. Removed unconditionally from: dsv4_fp4_mi355x_sglang, glm5.1_fp4_mi355x, glm5_fp8_b200, qwen3.5_bf16_b200, qwen3.5_fp8_b200, qwen3.5_fp8_mi355x. Removed from the OFFLOADING=none branch of: qwen3.5_fp8_h100, qwen3.5_fp8_b300_sglang, qwen3.5_fp8_mi355x_sglang. Replaced with a short comment so the next person editing the `case` doesn't put it back. OFFLOADING=none still means "no CPU/host offload"; the GPU RadixAttention cache stays on, which is the only sensible default for an agentic workload. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Pulls in cjq/agentx-v0.3-subagents @ b2d047dd, which switches the realtime srv-row prefix_cache_hit_rate fallback from SGLang's per-batch `cache_hit_rate` gauge (reads 0 between requests) to the cumulative `cached_tokens_total` / `prompt_tokens_total` counter pair, matching vLLM's `hits/queries` shape. Also unlocks unique_input_tokens_srv on SGLang. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
aiperf cea3b7e7 replaces _TraceIdleTiming.child_by_request_id's id(req)-based keying with a stable (session_id, idx) key, so the parallel reconstruction path's ProcessPoolExecutor pickle round-trip no longer breaks the lookup with KeyError. Unblocks every recipe that trips into the parallel reconstruction path -- most reliably the 256k-capped corpus (470 traces, around WEKA_PARALLEL_THRESHOLD) which caused 15/15 failures in InferenceX run 26554741458. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
| kill "$tail_pid" 2>/dev/null || true | ||
| wait "$tail_pid" 2>/dev/null || true | ||
| cat "$LMCACHE_LOG" >&2 || true | ||
| exit 1 |
There was a problem hiding this comment.
Duplicated LMCache helper functions across three scripts
Low Severity
cleanup_lmcache_server and wait_for_lmcache_ready are copy-pasted identically across kimik2.5_fp4_mi355x.sh, kimik2.5_fp4_b200.sh, and dsv4_fp4_b200_vllm.sh. These are non-trivial functions (~45 lines each instance) that belong in benchmark_lib.sh alongside the other shared agentic helpers like run_agentic_replay_and_write_outputs. Triplicating them increases the risk of inconsistent bug fixes.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit c00454e. Configure here.
…agent skip aiperf 666887ff makes _build_parallel_reconstruction_tasks skip child_plans whose subagent_index is in the dropped set, matching the serial path's existing filter at line ~1172. Pairs with cea3b7e7's id()->(session_id, idx) keying fix: that one made the parallel-path lookup correct for active subagents, this one prevents the lookup from running at all for dropped subagents (which were never in the timing dict). Without this, the qwen3.5-fp8-h100-sglang-agentic recipe (and any other recipe that crosses WEKA_PARALLEL_THRESHOLD) crashed with KeyError on the first dropped subagent -- see InferenceX run 26583416531. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
aiperf 89d67bb4 makes acquire_cache_lock fast-path when the cache is already populated (entry/manifest.json exists). Prevents stale .lock files from a SIGKILLed populator from wedging every subsequent waiter on shared NFS -- see InferenceX run 26585006455 where 10+ jobs sat 14+ minutes printing 'Still waiting on mmap-cache populate lock' next to a complete 32 GB cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>
| # --cuda-graph-max-bs "$CONC" | ||
| # --max-running-requests "$CONC" | ||
| # --max-prefill-tokens 8192 | ||
| # --chunked-prefill-size 8192 |
There was a problem hiding this comment.
Commented-out server parameters look like debug leftovers
Medium Severity
Four critical SGLang server parameters (--cuda-graph-max-bs, --max-running-requests, --max-prefill-tokens, --chunked-prefill-size) are commented out inside the SGLANG_CMD array. All sibling scripts (B200, B300, MI355X) set these explicitly. Without them, SGLang uses internal defaults that likely don't match the intended H100 tuning, potentially causing OOM or poor performance during benchmark runs.
Reviewed by Cursor Bugbot for commit 0e8ac92. Configure here.
Conflict resolution policy: preserve main's behavior for non-agentic
scenarios; PR has free rein over agentic scenarios.
YAML configs (.github/configs/{amd,nvidia}-master.yaml):
- Reverted to main verbatim for every recipe that exists on main
(29 recipes total — 19 nvidia + 10 amd — that PR-branch had also
modified). Main's image versions, search-spaces, and comments stay.
- Appended 9 net-new agentic recipes from PR-branch:
nvidia: qwen3.5-fp8-h100-sglang-agentic,
qwen3.5-fp8-b300-sglang-agentic-hicache,
kimik2.5-fp4-b200-vllm-agentic-lmcache,
dsv4-fp4-gb300-dynamo-vllm-agentic,
dsv4-fp4-gb300-cw-dynamo-vllm-agentic
amd: qwen3.5-fp8-mi355x-sglang-agentic-hicache,
dsv4-fp4-mi355x-vllm-agentic,
dsv4-fp4-mi355x-sglang-agentic,
dsr1-fp4-mi355x-sglang-disagg-mtp
Auto-merged everywhere else. Notable shared-infra changes are plumbing-
only (path reorg to single_node/fixed_seq_len/, launcher refactors,
new mount paths for aiperf mmap cache); no main recipe perf path
changes.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fd91e1a. Configure here.
| # cascade-cancel the other (see prior R20–R23 outages). The two sibling | ||
| # configs share recipe files via the same launch_gb300-cw.sh IS_AGENTIC | ||
| # overlay (recipes/vllm/deepseek-v4/agentic/), so a change to the recipe | ||
| # applies to both clusters with no duplication. |
There was a problem hiding this comment.
Duplicate comment block in nvidia config file
Low Severity
The multi-line comment block explaining the CoreWeave sibling of dsv4-fp4-gb300-dynamo-vllm-agentic is duplicated verbatim. Lines 9806–9813 and 9816–9823 are identical, both describing why CW is a separate config. One copy is sufficient.
Reviewed by Cursor Bugbot for commit fd91e1a. Configure here.
Adds the four scripts used to produce the semianalysisai/cc-traces-weka-with-subagents-*
HuggingFace datasets (060226 + the older 052726 series):
utils/sample_proxy_traces.py
Postgres -> per-session JSONLs. Applies the canonical filter stack
(--min/--max-trace-version, --min-main-turns, --require-cli-min,
--max-parallel-subagents, hardcoded image-content and
classifier-call exclusions).
utils/build_weka_hf_dataset.py
End-to-end orchestrator: sample -> proxy_to_weka.py -> traces.jsonl
-> plots -> README -> HF upload. Optional 256k variant with
per-request cap + timeline reshift (matches the existing
-052726-256k README semantics). Idempotent via work-dir caching +
--skip-* flags.
utils/plot_weka_distributions.py
Main-agent stream distribution plots (ISL/OSL/think-time histograms,
log + linear x).
utils/plot_subagent_distributions.py
Sub-agent fan-out plots (groups/trace, inners/group, intra-group
cache-hit rate, etc.).
proxy_to_weka.py was already tracked (committed earlier under eab58e9).
Signed-off-by: Cam Quilici <cameron@semianalysis.com>
… aiperf The five scripts used to build the semianalysisai/cc-traces-weka-with-subagents-* HuggingFace datasets live in their own subdir now so they can grow without crowding utils/: utils/agentic/sample_proxy_traces.py (Postgres -> per-session JSONLs) utils/agentic/proxy_to_weka.py (proxy rows -> weka trace JSON, with dedup) utils/agentic/build_weka_hf_dataset.py (orchestrator: sample -> convert -> plots -> README -> upload) utils/agentic/plot_weka_distributions.py (main-agent histograms) utils/agentic/plot_subagent_distributions.py (sub-agent fan-out plots) Internal docstring/path references inside the moved scripts updated to point at the new utils/agentic/ paths so README footers + --help examples stay accurate. Also bumps the aiperf submodule to the tip of cquil11/cjq/agentx-v0.3 (062a5de9), which includes the snapshot warmup planned-credit fix + pre-commit autofixes that landed since the previous pin (8473e154). Signed-off-by: Cam Quilici <cameron@semianalysis.com>
| seed = str(args.seed) | ||
| rows = sorted( | ||
| rows, | ||
| key=lambda r: hashlib.md5((r["session_id"] + seed).encode("utf-8")).hexdigest(), |
The aiperf submodule's 175 in-flight commits (agentic / observability / hicache / counter-pair / scenario work) were previously hosted on cquil11/aiperf as a personal fork branch (cjq/agentx-v0.3-subagents). Pushed the same SHA chain to SemiAnalysisAI/aiperf as the cjq/agentx-v0.3 branch and switched the submodule's source-of-truth URL accordingly. The pinned commit SHA (062a5de9) is unchanged — same content, same blob, just reachable via the org-owned remote now. Run \`git submodule sync utils/aiperf\` after pulling to refresh cached remote URLs in existing checkouts. Signed-off-by: Cam Quilici <cameron@semianalysis.com>
The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/ force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since garbage-collected it. It is now unfetchable ("upload-pack: not our ref") and absent from every CI runner cache, so actions/checkout fails on any cold runner with "Unable to find current revision in submodule path utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856). Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already declares, which is live/fetchable and contains the prior aiperf history as an ancestor. This makes the pin and the declared branch consistent again.
* feat: MiniMax-M3 MXFP8 full sweep config for GB300
Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7
topologies covering the full concurrency range:
- TP4/TP8 (low latency, conc 4-64)
- TP4+EP4 agg + 1P+1D disagg 2-node + 1P+1D collocated (mid, conc 64-512)
- DEP4/DEP8 (high throughput, conc 256-2048)
All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/.
GB300 recipes include srun_options mem=0 (CW DefMemPerCPU cgroup fix)
and omit safetensors-load-strategy prefetch (host-memory limit).
* chore: update perf-changelog pr-link to #1735
* Update runner name in nvidia-master.yaml
* fix: add sbatch_directives mem=0 + cpus-per-task=72 to M3 GB300 recipes
srun_options.mem=0 only grants a step the job's existing allocation; on
gb300-cw (DefMemPerCPU=4096, no DefCpuPerGPU) the job itself was only
allocated 4 GB/node and workers were cgroup-OOM-killed during engine
init (run 27452273567: oom_kill in StepId=7409.7 on slurm-gb300-133-193,
worker RLIMIT showed 4194304 KB). The canary passed because it landed on
gb300-nv, which doesn't enforce the cap. Mirrors the sbatch_directives
block of the DSV4 agentic recipes.
* fix: run M3 GB300 workers cache-only (HF_HUB_OFFLINE=1) to avoid fetch_model lock race
With the mem fix in place, run 27452976271 cleared the OOM but hit a new
failure: both nodes of the TP8-2n job called dynamo fetch_model within
200ms (191 @ :23.637, 193 @ :23.833), 191 took the per-blob .lock on the
shared /mnt/vast/hf-home cache and held it verifying the 444 GB snapshot,
193 retried ~6.4s and died 'Lock acquisition failed' (dynamo's rust hub
doesn't wait like Python hf_hub). The launcher already pre-stages and
verifies the snapshot offline before submit, so the workers never need to
fetch. Setting HF_HUB_OFFLINE=1 in every worker env block makes dynamo
serve cache-only and skip the download lock entirely, so co-fetching
workers no longer collide. Applied to all agg + disagg (prefill/decode)
env blocks across the 11 recipes.
* fix: re-pin utils/aiperf to live cjq/agentx-v0.3 tip (ff2b646c)
The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the
cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/
force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since
garbage-collected it. It is now unfetchable ("upload-pack: not our ref")
and absent from every CI runner cache, so actions/checkout fails on any
cold runner with "Unable to find current revision in submodule path
utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856).
Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already
declares, which is live/fetchable and contains the prior aiperf history as
an ancestor. This makes the pin and the declared branch consistent again.
* MiniMax-M3 GB300: disagg-only sweep + multi-node-NVLink KV transfer
Replace the aggregated M3 GB300 topologies with disaggregated-only, and
enable NixlConnector KV transfer over multi-node NVLink on every disagg
recipe. On gb300-cw the cross-node prefill->decode KV handoff was silently
falling back to RDMA/TCP (~268 MB/s, ~1400 tiny descriptors for M3 MSA
cache) — the disagg ceiling. Setting UCX_CUDA_IPC_ENABLE_MNNVL=y plus
--enable-cumem-allocator (VMM-registers KV so NIXL uses cuda_ipc across the
NVL fabric) lifts it to ~1.4-1.7 GB/s and gives +17% / +23% / +49%
out tok/s/gpu at conc 64 / 128 / 256 (jobs 7490 base vs 7493 MNNVL, 1P1D
TP4EP4). This is a GB300-only win: B300 8-GPU IB islands cannot move KV
over multi-node NVLink.
Sweep (1k1k), all MNNVL:
- 1P1D TP4+EP4 collocated 1n (8 GPU), conc 8-256 - low/mid latency
- 1P1D TP4+EP4 split 2n (8 GPU), conc 64-512 - mid throughput
- 1P + DP16+EP wide decode 5n (20 GPU), conc 512-2048 - max throughput
(decode keeps scaling on NVL where 1P1D saturates: ~1213 vs ~810
out tok/s/gpu @ conc 1024)
Removes all agg-gb300 recipes (1k1k + 8k1k); applies MNNVL to the 8k1k
disagg recipe too for consistency.
* M3 GB300: add 8k1k disagg sweep; drop unschedulable collocated-1n
The collocated-1n topology (disagg-gb300-1p1d-tp4ep4-1n) declared
gpus_per_node: 8, but gb300-cw nodes have 4 GPUs — sbatch rejects it with
"Requested node configuration is not available" even on a fully idle
cluster (confirmed: fails standalone with 28 nodes free; the split-2n and
wide-decode at gpus_per_node 4 schedule fine). It was an 8-GPU-node
template artifact that never reached sbatch before. Remove it (1k1k + 8k1k)
and let the split-2n cover the low-latency end (conc extended down to 8).
Add the 8k1k (isl 8192) scenario mirroring 1k1k with the two valid disagg
shapes (split-2n + wide DP16 decode), MNNVL KV transfer on both, seq params
retuned for long context (max-model-len 9472) and lower concurrency.
* M3 GB300: add rack-saturating balanced-ratio TP-ep1 max-throughput disagg config
Adds a 17-node (full-rack) disagg topology to the M3 GB300 sweep (1k1k +
8k1k) from on-cluster tuning (gb300-cw):
- PREFILL is the binding bottleneck, not decode width or KV transfer:
a single prefill worker left ~3967 reqs queued and starved 64 decode
GPUs. Balancing to 5 prefill : 12 decode (TP4) cleared the backlog and
lifted throughput +57% (535 -> 843 out tok/s/gpu @ conc 2048).
- TP-only decode (ep1, no expert parallelism) per the Qwen3.5-397B-A17B
recipes (closest M3 analog); M3 wide-EP/DP-attention all-to-all was
slower and DP32 < DP16 per-GPU.
- Kept the existing 1p1d (low/mid latency) and dep16dec (wide-decode)
topologies so CI measures the full Pareto rather than replacing them.
NixlConnector KV transfer stays on multi-node NVLink (MNNVL + cumem);
note KV transfer was verified NOT to bottleneck throughput (doubling its
bandwidth via num_threads changed end-to-end tok/s/gpu by ~0). recipe
yamls line up 1:1 with the nvidia-master.yaml CONFIG_FILE references.
* M3 GB300: replace dep16dec with 1P4D TP4-ep1; add prefill-heavy 10P7D for 8k1k
DSR1 GB300 patterns show wide-EP decode hurts M3's MoE all-to-all;
independent TP4 decode workers are strictly better. Also, 8k1k is
prefill-bound (616-req backlog at 5P:12D) — rebalance to 10P:7D
per DSR1/DSV4's prefill-heavy long-context ratios.
Changes:
- Replace dep16dec (EP16 single decode) with 1P+4D (4x TP4 ep1 decode)
for both 1k1k and 8k1k, same 5 nodes
- Add 10P+7D TP4 ep1 (17 nodes) for 8k1k max throughput
- Tighten concurrency ranges: 1P1D [4-32], 1P4D [64-512], 5P12D/10P7D [1024+]
* [Klaud Cold]minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI300X recipe (#1749)
* minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 MI300X recipe
Adds the spec-decoding=mtp sibling of minimaxm3-fp8-mi300x-vllm, based
on the MI300X non-MTP recipe + the MI355X MTP recipe. Keeps the MI300X
serve shape (BF16 KV cache — gfx942 lacks calibrated ROCm FP8 attention
scales — plus --no-enable-prefix-caching, TRITON_ATTN, --enforce-eager,
minimax_m3 parsers) and adds the Inferact/MiniMax-M3-EAGLE3 draft via
--speculative-config (method eagle3, 3 spec tokens) + chat-template
prompts.
Carries the same in-place EAGLE3 patch as the MI355X MTP recipe: the
shipped ROCm image's AMD MiniMax-M3 model lacks SupportsEagle3, so the
recipe patches the installed amd/model.py before serving
(functionstackx/vllm#1, upstream vllm-project/vllm#45546; validated
green on MI355X). Idempotent; hard-fails on base drift.
TP8-only search space (gfx942 192 GB is memory-tight, like H100), TP8
latency rows started at conc 1, matching the H100/MI355X MTP recipes.
Also adds SPEC_SUFFIX to launch_mi300x-amds.sh so spec-decoding=mtp
routes to the _mtp script (the launcher hardcoded _mi300x.sh).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp (#1749)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* [AMD] perf: enable MiniMax M3 CUDA graphs on MI300X (#1750)
* feat: add MiniMax M3 MI300X day-zero benchmark
* chore: link MiniMax M3 MI300X changelog
* fix: mount ROCm devices on MI300X
* fix: disable prefix caching for MI300X MiniMax M3
* fix: use bf16 kv cache for MI300X MiniMax M3
* perf: enable MI300X MiniMax M3 CUDA graphs
* chore: link MI300X CUDA graph changelog
* [Klaud Cold] minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager, VLLM_USE_BREAKABLE_CUDAGRAPH=0) (#1756)
* minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager)
Remove --enforce-eager from the MI300X EAGLE3 MTP recipe and set
VLLM_USE_BREAKABLE_CUDAGRAPH=0, matching the non-MTP MI300X recipe
(#1750). Avoids the M3-decode breakable-cudagraph path that previously
forced eager execution. Re-sweeps minimaxm3-fp8-mi300x-vllm-mtp.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp cudagraphs
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* M3 GB300: drop dominated configs, restore 1P1D full range
Data from run 27489709722 showed:
- 1P4D (20 GPU) strictly dominated by 1P1D (8 GPU): 320 vs 974
out/s/gpu @ conc 128 (1k1k). Single prefill can't feed 4 decode
workers — 1P:4D ratio is too decode-heavy.
- 8k1k 5P12D (68 GPU) dominated by 10P7D: 567 vs 874 out/s/gpu
@ conc 1024. Prefill-heavy ratio is correct for long context.
Changes:
- Remove 1P4D recipes (both 1k1k and 8k1k)
- Remove 8k1k 5P12D recipe (dominated by 10P7D)
- Restore 1P1D to full concurrency range [8-512] 1k1k, [8-256] 8k1k
(was truncated to [4-32] to avoid 1P4D overlap)
Final GB300 configs: 1P1D (latency-to-mid) + rack-saturating (max tput)
1k1k: 1P1D [8-512] + 5P12D [2048-8192]
8k1k: 1P1D [8-256] + 10P7D [1024-4096]
* M3 GB300 disagg: add DSV4-level decode optimizations
Port decode optimizations from DSV4 GB300 disagg reference configs to
all 4 M3 GB300 recipe files:
- fp8 KV cache (2x decode slot capacity vs bf16)
- max-num-seqs/max-num-batched-tokens 256→512
- CUDA graph compilation (FULL_DECODE_ONLY mode)
- NCCL MNNVL env vars (CUMEM_ENABLE, MNNVL_ENABLE, NVLS_ENABLE)
- enable-ep-weight-filter + no-disable-hybrid-kv-cache-manager
- stream-interval 32→50 on decode
* Switch GB300 M3 recipes to nightly-aarch64 + add Marlin MoE for TP-only workers
- All 4 recipes: container vllm/vllm-openai:minimax-m3 → nightly-aarch64
(contains upstream head_ratio fix vllm#45879, avoids gemm1_alpha crash)
- TP-only recipes (5p12d-tp4ep1, 10p7d-tp4ep1): add moe-backend: marlin
for both prefill and decode workers per PR #1809 pattern
- EP recipes (1p1d-tp4ep4): no Marlin (EP enabled)
- nvidia-master.yaml: update image, comment out 1k1k (run 8k1k only)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: switch GB300 M3 runner from gb300-cw to gb300-nv
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: add minimaxm3-fp8 to gb300-nv launcher + switch recipes to alias-based model path
- Add minimaxm3 fp8 case to launch_gb300-nv.sh (MODEL_PATH, srt-slurm clone)
- Switch recipe model.path from hf:MiniMaxAI/MiniMax-M3-MXFP8 to minimax-m3-mxfp8
(alias resolved via srtslurm.yaml model_paths, matching GB200 pattern)
- Remove __M3_HF_HOME__ placeholder (extra_mount, HF_HOME, HF_HUB_OFFLINE)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: redesign GB300 M3 recipes — DEP8 prefill, TEP8/TP8/DEP8 decode
All prefill workers switched to DEP8 (TP1 DP8 EP, 8 GPU, 2 nodes).
Low conc (<128): two decode variants — TEP8 (TP8+EP8) and TP8+Marlin.
High conc (128+): DEP8 decode, 2P+7D = 18 nodes.
TP8 decode (not TP4) to avoid Marlin OOM seen on previous run.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: TEP4 prefill + B300-optimal decode for GB300 M3 disagg
Switch all prefill from DEP8 (TP1 DP8 EP, 2 nodes) to TEP4
(TP4+EP4, 1 node), halving per-worker node footprint. Decode
configs follow B300 run 27630519240 optimal points (spec=none):
- conc 8-32: TP4+Marlin (no EP)
- conc 64-256: TEP4 (TP4+EP4)
- conc 512/1024: TEP8 (8k1k) or DEP8 (1k1k), max 2 workers × 6n
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: adapt NV B300 PR #1863 disagg configs for GB300 M3 sweep
Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863
B300 dynamo-vllm disagg search matrix, adapted for GB300 NVL72
(4 GPU/node):
- All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter
per-worker footprint allows more prefill workers
- Decode types: TP4+Marlin, TEP8, DEP8, DEP4
- 4p3d (3 decode workers) skipped
- 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active)
- PR 1863 vllm_config values (max-num-seqs up to 4096,
max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384)
- Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead
of enforce-eager
- kv-cache-dtype: fp8, req_rate: inf for all benchmarks
- GB300 MNNVL/NVLS env vars + sbatch mem=0 preserved
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: reduce GB300 DEP CUDA graph capture sizes
---------
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cameron Quilici <cjquilici@gmail.com>


Summary
Move the AgentX benchmarks onto
aiperfand add the v0.3 config set. This replaces the oldtrace-replaypath, refreshes the Weka replay data and metrics, and adds the offload configs needed for the current AMD and NVIDIA runs.This branch includes iterative cluster bring-up commits. The useful review points are grouped below.
Commit groups
AIPerf migration and repo layout
trace-replaysubmodule and standardize agentic artifacts underaiperf_artifacts/:a70f1ba6,65582282benchmarks/single_node/fixed_seq_len/:8eec0d4e,f89cdfe2,1b41cd0bSemiAnalysisAI/aiperf:f06e80ce,fcaa38a4,ebe6d0e5Replay data, failure handling, and metrics
9ea73705,4fec279f,4aeb16404a512375,36cb52419e41c1a2,7ad7dd4c,acc2c731,4933cf34,6d884b9b,b27295c55c15fa9d,81fd6bf0,380dcd78,bcf338cdSingle-node offload coverage
aae82c07b07bd58e,e29fb3b1,bb64d3e2,494169745a3cd6a6,0103241d,5db26686,69cdbc25,03a85abe327c4d9a,afaec729,72cf856f,77e648db907ad2e9,99cd0350GB300 multinode agentic runs
5e1ca4eafa28004c329d1683,52af9d4b,3274dea8,92d2738be5759810,b2ffd9b3Runners and matrix handling
83fa8ec3,f999fef9a98fcaa8,34063558967c50caDataset variants
e1e4d448,4e62c597eab58e95Validation
Tested through targeted workflow dispatches while bringing up the new configs.