chore: agentx v0.3 by cquil11 · Pull Request #1571 · SemiAnalysisAI/InferenceX

cquil11 · 2026-05-27T13:58:31Z

Summary

Move the AgentX benchmarks onto aiperf and add the v0.3 config set. This replaces the old trace-replay path, refreshes the Weka replay data and metrics, and adds the offload configs needed for the current AMD and NVIDIA runs.

This branch includes iterative cluster bring-up commits. The useful review points are grouped below.

Commit groups

AIPerf migration and repo layout

Remove the trace-replay submodule and standardize agentic artifacts under aiperf_artifacts/: a70f1ba6, 65582282
Move fixed-sequence launchers into benchmarks/single_node/fixed_seq_len/: 8eec0d4e, f89cdfe2, 1b41cd0b
Track the AgentX dataset tooling in-tree and switch the submodule remote to SemiAnalysisAI/aiperf: f06e80ce, fcaa38a4, ebe6d0e5

Replay data, failure handling, and metrics

Switch to the with-subagents Weka corpus and align the InferenceX-side dataset constants: 9ea73705, 4fec279f, 4aeb1640
Propagate replay failures and fail runs with excessive request errors: 4a512375, 36cb5241
Refresh prefix-cache and realtime metric handling for vLLM and SGLang: 9e41c1a2, 7ad7dd4c, acc2c731, 4933cf34, 6d884b9b, b27295c5
Update mmap-cache metadata and stale-cache handling: 5c15fa9d, 81fd6bf0, 380dcd78, bcf338cd

Single-node offload coverage

Add DSv4 SGLang agentic coverage on MI355X: aae82c07
Add Kimi LMCache coverage, including LMCache MP on B200 and MI355X: b07bd58e, e29fb3b1, bb64d3e2, 49416974
Fix the ROCm LMCache path used by the MI355X runs: 5a3cd6a6, 0103241d, 5db26686, 69cdbc25, 03a85abe
Add and tune Qwen SGLang HiCache coverage across B300, H100, and MI355X: 327c4d9a, afaec729, 72cf856f, 77e648db
Size the DSv4 native CPU-offload configs for the agentic workload: 907ad2e9, 99cd0350

GB300 multinode agentic runs

Add DSv4 disaggregated vLLM agentic recipes and the local recipe overlay: 5e1ca4ea
Add the CoreWeave sibling config: fa28004c
Fix recipe resources and transport limits found during cluster bring-up: 329d1683, 52af9d4b, 3274dea8, 92d2738b
Preserve server logs when a multinode run exits or is cancelled: e5759810, b2ffd9b3

Runners and matrix handling

Apply runner filtering to agentic configs and use the registered MI355X labels: 83fa8ec3, f999fef9
Wire AIPerf mmap-cache mounts into the H200 and H100 Slurm launchers: a98fcaa8, 34063558
Match the H200 DGXC container UID behavior to the B200 runner: 967c50ca

Dataset variants

Add the Weka loader override and the 256k Minimax corpus path: e1e4d448, 4e62c597
Drop exact duplicate rows while building Weka sessions: eab58e95

Validation

Tested through targeted workflow dispatches while bringing up the new configs.

…loadingConnector vLLM's --kv_offloading_backend native resolves to two different connectors based on the VLLM_USE_SIMPLE_KV_OFFLOAD env var (see vllm/config/vllm.py:662): VLLM_USE_SIMPLE_KV_OFFLOAD=1 -> SimpleCPUOffloadConnector (the path we were using; carries the popleft_n + context-overflow + completion-barrier bugs we hit on B200/B300/H200) unset (default) -> OffloadingConnector (the regular native path) This commit drops the env var and the JSON form, switching MI355X to the shortcut form which now routes to OffloadingConnector. We're trying the regular path here to see if it sidesteps the SimpleCPUOffloadConnector- specific issues that have been forcing lazy_offload + workarounds. Also drops the --kv-transfer-config JSON since the shortcut form constructs the KVTransferConfig itself at engine startup. Keeps --disable-hybrid-kv-cache-manager since MI355X uses --block-size=1 + AITER which doesn't play with the hybrid manager.

Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors the dsv4-fp4-b200-vllm-agentic CONC sweep (tp8 [16,32,64] + tp8 dp-attn [64,128,256]) so the two SKUs can be compared on the same trace load. Uses the same SGLang image as the fixed-seq-len sibling (rocm/sgl-dev:rocm720-mi35x-0363e6c-20260509-DSv4). Offload sweep is none-only (SGLang has no equivalent of vLLM's SimpleCPUOffloadConnector that we exercise on b200). Launcher swaps the fixed-seq-len harness (run_benchmark_serving) for the agentic harness (build_replay_cmd / write_agentic_result_json / analyze_benchmark_distributions) but keeps all SGLang server flags and SGLANG_* env vars identical to the fixed-seq-len sibling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R2 dispatch failed on all 6 b200 shards with the same enroot error during manifest fetch: [INFO] Fetching image manifest list [INFO] Fetching image manifest [ERROR] Could not process JSON input curl: (23) Failure writing output to destination Docker Hub confirms the image exists with a clean Docker v2 manifest, but enroot import was being invoked as `docker://docker.io/cquil/vllm-openai:...` because the image field had the docker.io/ prefix. Every other image entry in the repo uses the bare `org/repo:tag` form (no docker.io/ prefix), so this entry was the outlier. Dropping the prefix matches convention and should let enroot resolve the registry host normally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

First multi-node agentic config with the recipe local to this repo. Adds: - Two new agentic recipes under benchmarks/multi_node/srt-slurm-recipes/ vllm/deepseek-v4/agentic/, adapted from the corresponding 8k1k fixed- seq-len siblings: * disagg-gb300-1p6d-dep4-tp4-agentic.yaml (low-lat conc=32, mid conc=192) * disagg-gb300-4p1d-dep4-dep8-24-c4096-agentic.yaml (high-tput conc=4096) Both drop max-model-len, drop no-enable-prefix-caching, add DSv4 tool/reasoning parsers, switch benchmark.type sa-bench -> custom (hands off to benchmarks/multi_node/agentic_srt.sh which builds the aiperf inferencex-agentx-mvp invocation). - New IS_AGENTIC=1 branch at the top of runners/launch_gb300-nv.sh's framework conditional. Clones the cquil11/srt-slurm-nv fork (the only srt-slurm build that supports benchmark.type=custom) on the cam/sa-submission-q2-2026 branch and overlays the local agentic recipes into recipes/vllm/deepseek-v4/agentic/ so iteration stays in this repo. - New dsv4-fp4-gb300-dynamo-vllm-agentic config entry in nvidia-master.yaml as a sibling of the byte-identical-to-origin/main dsv4-fp4-gb300-dynamo-vllm base. Three-tier sweep: * low-latency (conc=32, 1p6d shape, 28 GPUs / 8 nodes) * mid (conc=192, 1p6d shape, same alloc as low-lat) * high-tput (conc=4096, 4p1d shape, 24 GPUs / 7 nodes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R1 of dsv4-fp4-gb300-dynamo-vllm-agentic failed at `srtctl apply` with two schema errors against the cquil11/srt-slurm-nv fork: Invalid config: {'dynamo': {'wheel': ['Unknown field.']}, 'benchmark': {'env': {'PORT': {'value': ['Not a valid string.']}}}} The first (dynamo.wheel) is fixed by cherry-picking commit 0060f857 from NVIDIA upstream onto cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (adds wheel field + install scripts; pushed separately). The second (PORT) is fixed here: env values must be strings, so `PORT: 8000` -> `PORT: "8000"`. INFMAX_CONTAINER_WORKSPACE / RESULT_DIR parse as strings due to their / chars, and IS_MULTINODE was already quoted; PORT was the only bare int. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R2 of dsv4-fp4-gb300-dynamo-vllm-agentic landed all 3 shards on gb300-cw_N runners (CoreWeave self-hosted runners advertise both gb300-cw AND gb300-nv labels). RUNNER_NAME%%_* resolves to gb300-cw, which routes to runners/launch_gb300-cw.sh — but that launcher had no IS_AGENTIC handling, so it cloned upstream NVIDIA/srt-slurm (which lacks benchmark.type=custom) instead of the cquil11 fork. srtctl apply then failed: Invalid config: {'benchmark': {'command': ['Unknown field.'], 'env': ['Unknown field.']}} Mirrors the IS_AGENTIC=1 branch I added earlier to launch_gb300-nv.sh: use cquil11/srt-slurm-nv@cam/sa-submission-q2-2026 (now patched with dynamo.wheel support via cherry-picked NVIDIA commit 0060f857) and overlay our local agentic recipes from benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/agentic/. Both gb300-nv and gb300-cw launchers now handle IS_AGENTIC identically, so the workload runs correctly regardless of which runner picks it up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Upstream NVIDIA/srt-slurm@main has caught up on every schema feature the agentic path needs: - BenchmarkType.CUSTOM + benchmark.command + benchmark.env (the hook that hands off to benchmarks/multi_node/agentic_srt.sh) - DynamoConfig.wheel (so our vllm recipes can pin the same ai-dynamo wheel as the fixed-seq-len path) - default_bash_preamble (no more "Unknown field" warning) So we don't need the cquil11/srt-slurm-nv fork anymore. Pin to upstream commit 127597c0e6d3 (current HEAD) for reproducibility; bump as upstream evolves. Also fix: `uv venv` defaults to no-pip. The upstream prefetch-ai-dynamo-wheel.sh script (called by srtctl when a recipe has `dynamo.wheel` set) does `python3 -m pip download`, which fails with "No module named pip" without a seeded venv. Adding --seed installs pip+setuptools+wheel into the venv so the prefetch path works. R4 of dsv4-fp4-gb300-dynamo-vllm-agentic showed this error on the gb300-cw runner immediately after the lockfile cleanup unblocked the import_squash step. Both gb300-cw and gb300-nv launchers updated identically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R5 first-shard failure on gb300-nv runner: fatal: reference is not a tree: 127597c0e6d3c1b3ffd7ac02dd0fea2d2fd62f74 I extrapolated the 40-char SHA from a 7-char short `127597c` shown in git log output instead of resolving it. The real SHA is 127597c2926467db06e6707e0aa9227261c6c02a (NVIDIA/srt-slurm@main, "Update GB300 FP8 GLM-5 recipe (#160)"). R5's gb300-cw shards didn't immediately fail on the same error — either they hadn't reached the checkout step yet when I noticed, or their git was more lenient about the prefix-then-garbage SHA. Either way, the fixed SHA works for both. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… launcher Two issues caught in R5: 1) dynamo-vllm worker rejects chat parser flags The worker entrypoint (different argparser than `vllm serve`) errors: __main__.py: error: unrecognized arguments: --enable-auto-tool-choice --tool-call-parser deepseek_v4 These belong on the dynamo frontend, not the worker. In disagg, chat parsing happens at the frontend; workers just take tokens. The 8k1k sibling recipes (which work) don't set these either. I mistakenly ported them from the single-node launchers, which run `vllm serve` directly (the chat-serving entrypoint). Drop --tool-call-parser, --enable-auto-tool-choice, --reasoning-parser from both prefill and decode blocks in both agentic recipes. Keep --tokenizer-mode deepseek_v4 (worker DOES accept that one). 2) launch_gb300-cw.sh was missing set -e The fabricated SHA bug from the prior commit only surfaced on the nv launcher (which has set -exo pipefail). The cw launcher silently swallowed the failed `git checkout` and proceeded on origin/HEAD — which happened to be the right commit, masking the bug. Add `set -exo pipefail` to match the nv launcher; loud failures are safer than silent ones. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R6 surfaced via srtctl preflight that /scratch/models/DeepSeek-V4-Pro is not staged on the gb300-nv cluster: Error: Preflight failed for ...disagg-gb300-1p6d-dep4-tp4-agentic.yaml: - model.path: Model alias 'deepseek-v4-pro' resolved to '/scratch/models/DeepSeek-V4-Pro', but that path is unavailable. DSR1 weights ARE staged on /scratch (node-local SSD), but DSv4-Pro was never staged there. The 806 GB DSv4-Pro checkpoint lives at /home/sa-shared/models/DeepSeek-V4-Pro (NFS, shared across nodes). This silently broke the existing 8k1k fixed-seq-len path for dsv4-vllm on gb300-nv too (just hadn't been exercised against the stricter upstream srtctl preflight). Fix is single-file: re-point the DSv4 leg of the per-model conditional to the NFS path. NFS is slower than /scratch but that's where the model actually lives. Stage to /scratch and switch back if model load I/O becomes a bottleneck. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…S ELOOP R7 of dsv4-fp4-gb300-dynamo-vllm-agentic: Fatal error: Symlink loop from '/home/sa-shared/models/DeepSeek-V4-Pro' OSError: [Errno 40] Too many levels of symbolic links Same Vast NFS ELOOP bug we hit on the squash lockfiles in R3/R4: the /home/sa-shared/ NFS mount returns ELOOP to workflow worker processes (specifically those spawned through GHA runner pod -> sbatch -> pyxis/enroot), even though the same path is a regular directory from interactive sessions (verified via gb300-slurm + srun on c001 — both Path.resolve() and ls succeed cleanly). Workaround: /data/ and /home/sa-shared/ are SEPARATE mount points backed by the SAME storage (storage-vip.vast.p03.globalai.run, with /scratch and /scratch/home/sa-shared as the server-side paths). Switching MODEL_PATH to /data/home/sa-shared/models/DeepSeek-V4-Pro gives us identical files with a separate NFS client cache, which isn't poisoned in the workflow context. Doesn't fix the underlying Vast NFS bug — just routes around it. Long-term: stage DSv4-Pro to /scratch/models/ (node-local SSD) like DSR1, both for performance and to bypass this whole mount class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R7 of dsv4-fp4-gb300-dynamo-vllm-agentic had 6/8 worker srun steps OOM-killed within 30s, with `torch.AcceleratorError: CUDA-capable device(s) is/are busy or unavailable` (CUDA init aborts when SIGKILL races it). sacct showed each worker step got AllocTRES mem=4G (empirically verified on CW: default sbatch w/ --gres=gpu:4 -> AllocTRES mem=4G; same sbatch w/ --mem=0 -> AllocTRES mem=868G). Root cause: srt-slurm's start_srun_process doesn't pass --mem on the container srun, so it gets cpus_per_task × DefMemPerCPU = 4 GB by default on clusters with positive DefMemPerCPU (CW gb300 has 4096). 4 GB is wildly insufficient for a vLLM worker mmap'ing multi-GB model weights and pinning CUDA buffers. Fix: re-point both gb300 launchers' IS_AGENTIC clone from upstream NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/agentic-mem-0 (96c443a), which is the same upstream commit + a single patch adding `--mem 0` to start_srun_process when container_image is set. Long-term: PR the --mem=0 change upstream so we can drop the fork indirection for this feature class. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R9 hit the same Vast NFS ELOOP we fixed for the model path in R8, but this time on the squash lockfile: /usr/bin/bash: line 2: /home/sa-shared/gharunners/squash/<image>.sqsh.lock: Too many levels of symbolic links The /home/sa-shared/ NFS mount poisons lockfiles AND data files alike under the workflow worker NFS session. We applied the /data/ workaround for MODEL_PATH; now do the same for SQUASH_FILE + NGINX_SQUASH_FILE which were still pointing at the bad mount. Both /home/sa-shared/ and /data/ are mounted from the same Vast backing storage; same files, separate NFS client cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Earlier I patched srt-slurm's start_srun_process to default --mem=0 on container srun. That's the wrong layer — srtctl has a documented top-level recipe field `srun_options:` (see docs/config-reference.md#srun_options) that gets threaded straight through to the worker srun via mixins/worker_stage.py:235 (`srun_options=self.runtime.srun_options`) and start_srun_process line 248 (`for key, value in srun_options.items()`). Switch to that mechanism: - Add `srun_options: {mem: "0"}` to both agentic recipes - Revert both launchers from the cquil11 fork pin back to upstream NVIDIA/srt-slurm@127597c (the fork patch in cam/agentic-mem-0 is now redundant; leaving the branch around as a fallback but not pinned in the launcher) R9/R10 confirmed sacct still showed mem=4G per worker step despite the launcher cloning the patched fork — likely because srtctl's uv-sync inside the sbatch rebuilds the venv from pyproject.toml and the editable install from src/ doesn't include code modifications the way uv pip install -e . would. The recipe-level mechanism doesn't depend on patching srtctl at all so this whole class of "is the patch loaded?" question goes away. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R11 verified that srun_options.mem=0 IS now in the worker srun cmdline (confirmed via /proc/<pid>/cmdline on the head node). BUT sacct still showed AllocTRES mem=4G per step. Why: the sbatch only requested `--ntasks=8` with no `--mem`, so the JOB allocation per node is bound to cpus_per_task × DefMemPerCPU = 1 × 4 GB = 4 GB. `--mem=0` on srun means "use ALL of what the JOB has on this node" — and the job has 4 GB. There's nothing to grow into. The other half of the fix is `sbatch_directives.mem=0` which emits `#SBATCH --mem=0` in the generated sbatch script (per src/srtctl/templates/job_script_minimal.j2:26), making SLURM allocate all available node memory (~868 GB on CW gb300) up front. Both layers needed: - sbatch_directives.mem=0 → JOB gets full node memory - srun_options.mem=0 → each container srun step uses it (without this, srun defaults back to cpus_per_task × DefMemPerCPU = 4 GB) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ation) R12 progressed past the memory layer (sbatch_directives.mem=0 from prior commit worked; sacct showed AllocTRES mem=868G per worker), but failed ~10 min in with etcd lease-keepalive `deadline exceeded` errors followed by every worker SIGKILL'd at 16:36:03. Root cause from infra.out: etcd reported `max-cpu-set: 1` at startup. SLURM's default cpus_per_task=1 starved single-CPU etcd under load from 24 concurrent dynamo DP rank lease keep-alives (16 prefill + 8 decode). etcd's gRPC handler couldn't process RPCs fast enough → cascading lease deadline exceeded → workers crashed → orchestrator cancelled job → infra step itself SIGKILL'd at 16:35:49 ("STEP 4572.2 ON slurm-gb300-138-249 CANCELLED ... DUE to SIGNAL Killed"). Fix: sbatch_directives.cpus-per-task=72 grants every task (including the GPU-less infra step) one CW gb300 NUMA socket. etcd now has plenty of compute; vLLM workers also get more aux CPU for tokenizer threads etc. Why cw needs this and nv doesn't: nv cluster's JobDefaults includes DefCpuPerGPU=35 → any task with --gres=gpu:N auto-gets 35*N CPUs (= 140 on a 4-GPU task). cw has no per-GPU default → tasks get cpus_per_task=1 by default. The infra step has no --gres flag at all so it's the worst case on cw. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two changes: 1) Pin to NVIDIA cluster (drop CW) The dsv4-fp4-gb300-dynamo-vllm-agentic runner field was `gb300`, which is the generic label both NV and CW runner pools advertise (per gh api runners). So shards landed on either cluster, which meant we kept debugging the same recipe path against two different cluster configs (NV's DefCpuPerGPU=35 vs CW's DefMemPerCPU=4096 with no per-GPU defaults). Switch to `runner: gb300-nv`, a label only the NV pool advertises. This matches just gb300-nv_0/1/2 going forward. 2) MODEL_PATH switched to /scratch/models/DeepSeek-V4-Pro The node-local SSD on NV compute nodes. Faster than the /data/home/sa-shared NFS path (where DSv4-Pro currently lives). Caveat: /scratch doesn't exist on the GHA runner pod, so srtctl preflight may fail with "Model alias resolved to ..., but that path is unavailable." We're trying this anyway to see whether the runner pod has /scratch mounted; if it errors, next step is to either (a) patch srt-slurm to add a `skip_model_preflight` recipe field or (b) stub a symlink on the runner pod. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The agentic recipe pins MODEL_PATH=/scratch/models/DeepSeek-V4-Pro (node-local NVMe on compute nodes). srtctl's _preflight_model runs in-process on whatever node invokes srtctl — the GHA runner pod, which doesn't have /scratch mounted — so it bails before sbatch with "Model alias 'deepseek-v4-pro' resolved to '/scratch/...', but that path is unavailable" (R14 hit this). Switch the IS_AGENTIC=1 clone target from NVIDIA/srt-slurm@127597c to cquil11/srt-slurm-nv@cam/no-preflight-flag (854b3fd), which adds one CLI flag — `srtctl apply --no-preflight` — that skips just the optional Python-level FS precheck. vLLM still fails loudly at runtime if the path is genuinely missing on the compute node. The flag is only passed when IS_AGENTIC=1. Fixed-seq-len recipes resolve model.path to an NFS path visible from the runner pod, where the precheck is a useful sanity guard, so leave enforcement on for them. Fork commit: cquil11/srt-slurm-nv@854b3fd Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Aiperf's content-addressed mmap dataset cache (~65 GB per dataset) needs to be persisted across runs so the first run of the day doesn't re-tokenize + re-write it on every shard. Same pattern as launch_h200-dgxc-slurm.sh, launch_b200-dgxc.sh, launch_mi355x-amds.sh. Three layers wired: 1) Host paths (cluster-specific, created with 0777 so all gharunner_X SLURM users can write): gb300-nv /data/home/sa-shared/gharunners/ai-perf-cache gb300-cw /mnt/vast/ai-perf-cache 2) Both launchers export AIPERF_MMAP_CACHE_HOST_PATH and add a line to the generated srtslurm.yaml's default_mounts block — srt-slurm's runtime.py reads default_mounts via get_srtslurm_setting() and bind-mounts each entry into every worker container. cw already had a default_mounts block (for dynamo-wheels-cache); nv had none. 3) Both agentic recipes set AIPERF_DATASET_MMAP_CACHE_DIR=/aiperf_mmap_cache in benchmark.env so the aiperf process inside the container reads from the persistent mount instead of ~/.cache/aiperf/dataset_mmap. Single-node launchers don't need updating — they have their own srun --container-mounts line that already bind-mounts the cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Brings in 45 commits from upstream/ajc/inferencex-agentx-mvp (PR #875): - InferenceX AgentX-MVP scenario (default corpus switched to 051226 no-subagents 949-trace variant) - semianalysis_cc_traces_weka_no_subagents HF loader - Wrap-fill trajectory recycling + correlation-id double-recycle guard - DAG benchmarks, reproducible payload replay, agentic_replay E2E test - assorted dataset/timing fixes Local commits preserved (no rebase). One docstring-only conflict in src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py resolved by taking upstream's text (more comprehensive — documents both 042026 and 051226 variants). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

vllm/vllm-openai:v0.21.0-ubuntu2404 ships without git, but pip's editable install (-e) of utils/aiperf invokes `git version` to record direct_url.json provenance. Without git, every R16 shard on both gb300-nv and gb300-cw failed at: + python3 -m pip install --break-system-packages -q --ignore-installed -e /infmax-workspace/utils/aiperf ERROR: Error [Errno 2] No such file or directory: 'git' while executing command git version ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH? This happens AFTER server boot is healthy and "Server is healthy - starting benchmark" has fired, so all the upstream cluster/recipe work (preflight, mem=0 x2 layers, etcd cpus-per-task=72, --no-preflight, /scratch model path, NixlConnector P<->D, model load) is working end-to-end. Only the pip install step is blocked. Fix: prepend a `command -v git || apt-get update && apt-get install -y git` to install_agentic_deps. Cheap no-op on images that already ship git (AMD images, custom containers). The vLLM image's apt is functional from inside the container so this works without container rebuild. The -e install was introduced yesterday in e92a9bf (aiperf v0.2 migration); previously the agentic flow used kv-cache-tester which didn't need git. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t containers R17 surfaced two distinct failures, one per cluster: 1) gb300-cw (all 3 shards): aiperf rejected --public-dataset semianalysis_cc_traces_weka with "Scenario invariants violated ... required loader=any of ['semianalysis_cc_traces_weka_no_subagents', 'weka_trace']". Yesterday's aiperf merge (PR #875 commit fef78a96) switched the inferencex-agentx-mvp scenario's default corpus to the 051226 no-subagents 949-trace variant and tightened the loader contract. The old name is no longer accepted. Fix: resolve_trace_source emits --public-dataset semianalysis_cc_traces_weka_no_subagents. 2) gb300-nv (all 3 shards): "dpkg: error: requested operation requires superuser privilege" from yesterday's install_agentic_deps git install path. The gb300-nv pyxis/enroot setup maps the calling user (sa-shared) into the container as non-root, while gb300-cw runs as root. The git install needs sudo on nv; cw is fine without. Fix: branch on `id -u` — apt-get directly when root, sudo apt-get otherwise. The vllm-base layer installs `sudo` so the binary is available, and the typical enroot config grants the calling user passwordless sudo. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

R17/R18 made it clear that there's no clean way to install git into the vllm/vllm-openai container at run-time on gb300-nv: - R16/R17: container ships without git -> pip's editable install of aiperf fails with "Cannot find command 'git'" - R18: tried `sudo apt-get install git`. gb300-nv pyxis/enroot remaps the calling user to uid=345200007 inside the container, and sudo refuses to run with "/usr/bin/sudo must be owned by uid 0 and have the setuid bit set" -- the setuid bit can't carry across user namespaces. cw container runs as root so sudo wasn't tripped there, but the right answer is one that works on both clusters. The actual fix is upstream from this entirely: drop `-e`. pip's editable install needs git only to record direct_url.json provenance; the non-editable install just builds a wheel via hatchling and copies into site-packages. aiperf's pyproject.toml pins version="0.8.0" rather than deriving it from git tags, so non-editable install works without git in any environment. We don't edit aiperf source mid-benchmark anyway -- loss of -e ergonomics is zero. `--ignore-installed` is still needed (handles the apt-managed-blinker distutils-uninstall pile-up) and is orthogonal to -e. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop the sudo/root-detection complexity from R18 and restore -e on the aiperf pip install. Per user direction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The vllm/vllm-openai container ships without git; agentic_srt.sh needs to apt-get install it because pip's install of utils/aiperf calls `git version`. R17/R18/R19/R20 chased this on gb300-nv with various combinations of sudo / no-sudo / drop-e / etc., all failing because pyxis maps the calling user to uid 345200007 inside the container and dpkg's hardcoded geteuid()!=0 check rejects every attempt regardless of filesystem permissions. The cleanest fix is to ask pyxis to remap us to uid 0 inside the container, matching the gb300-cw behavior (where the container already runs as root and apt-get install works directly). pyxis exposes this as a per-srun flag: --container-remap-root. srt-slurm renders empty-string srun_options as flag-only srun args (see core/slurm.py:250 in NVIDIA/srt-slurm@127597c). No-op on gb300-cw (cw is already remapped to root by default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Picks up cquil11/srt-slurm-nv@6e34b8b which propagates srun_options through the benchmark_stage srun (previously only worker/frontend/ telemetry stages honored them). Required for the recipe-level srun_options.container-remap-root: "" to apply to the benchmark.command container — the one that runs agentic_srt.sh + apt install git. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Picks up cquil11/aiperf@9b858ae which fixes PhaseRunner.cancel() to set all_credits_sent_event / all_credits_returned_event so the outer runner awaits wake immediately. Previously cancelled runs (e.g. via --failed-request-threshold) blocked for the full phase timeout (~1800s default) before reaching the graceful exit path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ncel) When a workflow run is cancelled mid-flight (gh run cancel, or UI cancel button), the launcher gets SIGTERM during its `tail -F` wait and exits before reaching the `tar czf .../multinode_server_logs.tar.gz` line in the main flow. The Upload server logs workflow step runs (it has if: always()) but finds no file (if-no-files-found: ignore silently skips), so the artifact never gets uploaded. Fix: install an EXIT trap right after JOB_ID extraction that produces the tarball on any exit path — normal completion, error, SIGTERM, SIGKILL of our parent. The main-flow tar block is now an idempotent no-op (kept for log narrative). Applied identically to both gb300-nv and gb300-cw launchers. The b200-dgxc launcher has the same pattern but its multi-node flow is currently only used by other configs; leaving it alone for now to avoid mixing unrelated changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gb300-nv 1p6d agentic runs hit ~15% errors at conc=32 from Dynamo NATS RPC deadline timeouts when the single prefill worker is saturated by 32 concurrent 50-100k token prefills. Each timeout returns HTTP 500 "Failed to generate completions: Prefill execution failed: ... NATS request to dynamo_prefill.generate-... failed: ... deadline has elapsed" — a real failure but driven by the single-prefill-worker capacity limit, not a regression. At the previous 0.05 threshold the run tripped its ProfileCancel mechanism early and produced no usable numbers. At 0.20 the run completes and we get steady-state metrics for the ~85% of requests that succeed; the underlying NATS saturation is a separate work item (Dynamo deadline tuning, or more prefill workers in the recipe, or both). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Agentic replay traces have a theoretical prefix-cache hit rate above 95% on every workload we benchmark; the realtime srv row only reads 0.0% because the launch script turns the SGLang RadixAttention cache off. Every server recipe in this directory had it on — either as the only branch of an OFFLOADING=none case or as an unconditional launch-line flag — so the hit-rate number was never meaningful and the run was paying full prefill cost on every turn. Removed unconditionally from: dsv4_fp4_mi355x_sglang, glm5.1_fp4_mi355x, glm5_fp8_b200, qwen3.5_bf16_b200, qwen3.5_fp8_b200, qwen3.5_fp8_mi355x. Removed from the OFFLOADING=none branch of: qwen3.5_fp8_h100, qwen3.5_fp8_b300_sglang, qwen3.5_fp8_mi355x_sglang. Replaced with a short comment so the next person editing the `case` doesn't put it back. OFFLOADING=none still means "no CPU/host offload"; the GPU RadixAttention cache stays on, which is the only sensible default for an agentic workload. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Pulls in cjq/agentx-v0.3-subagents @ b2d047dd, which switches the realtime srv-row prefix_cache_hit_rate fallback from SGLang's per-batch `cache_hit_rate` gauge (reads 0 between requests) to the cumulative `cached_tokens_total` / `prompt_tokens_total` counter pair, matching vLLM's `hits/queries` shape. Also unlocks unique_input_tokens_srv on SGLang. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

aiperf cea3b7e7 replaces _TraceIdleTiming.child_by_request_id's id(req)-based keying with a stable (session_id, idx) key, so the parallel reconstruction path's ProcessPoolExecutor pickle round-trip no longer breaks the lookup with KeyError. Unblocks every recipe that trips into the parallel reconstruction path -- most reliably the 256k-capped corpus (470 traces, around WEKA_PARALLEL_THRESHOLD) which caused 15/15 failures in InferenceX run 26554741458. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

cursor · 2026-05-28T15:13:36Z

+    kill "$tail_pid" 2>/dev/null || true
+    wait "$tail_pid" 2>/dev/null || true
+    cat "$LMCACHE_LOG" >&2 || true
+    exit 1


Duplicated LMCache helper functions across three scripts

Low Severity

cleanup_lmcache_server and wait_for_lmcache_ready are copy-pasted identically across kimik2.5_fp4_mi355x.sh, kimik2.5_fp4_b200.sh, and dsv4_fp4_b200_vllm.sh. These are non-trivial functions (~45 lines each instance) that belong in benchmark_lib.sh alongside the other shared agentic helpers like run_agentic_replay_and_write_outputs. Triplicating them increases the risk of inconsistent bug fixes.

Additional Locations (2)

benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh#L39-L85

benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh#L59-L105

^{Reviewed by Cursor Bugbot for commit c00454e. Configure here.}

…agent skip aiperf 666887ff makes _build_parallel_reconstruction_tasks skip child_plans whose subagent_index is in the dropped set, matching the serial path's existing filter at line ~1172. Pairs with cea3b7e7's id()->(session_id, idx) keying fix: that one made the parallel-path lookup correct for active subagents, this one prevents the lookup from running at all for dropped subagents (which were never in the timing dict). Without this, the qwen3.5-fp8-h100-sglang-agentic recipe (and any other recipe that crosses WEKA_PARALLEL_THRESHOLD) crashed with KeyError on the first dropped subagent -- see InferenceX run 26583416531. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

aiperf 89d67bb4 makes acquire_cache_lock fast-path when the cache is already populated (entry/manifest.json exists). Prevents stale .lock files from a SIGKILLed populator from wedging every subsequent waiter on shared NFS -- see InferenceX run 26585006455 where 10+ jobs sat 14+ minutes printing 'Still waiting on mmap-cache populate lock' next to a complete 32 GB cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

cursor · 2026-05-28T16:15:48Z

+    # --cuda-graph-max-bs "$CONC"
+    # --max-running-requests "$CONC"
+    # --max-prefill-tokens 8192
+    # --chunked-prefill-size 8192


Commented-out server parameters look like debug leftovers

Medium Severity

Four critical SGLang server parameters (--cuda-graph-max-bs, --max-running-requests, --max-prefill-tokens, --chunked-prefill-size) are commented out inside the SGLANG_CMD array. All sibling scripts (B200, B300, MI355X) set these explicitly. Without them, SGLang uses internal defaults that likely don't match the intended H100 tuning, potentially causing OOM or poor performance during benchmark runs.

^{Reviewed by Cursor Bugbot for commit 0e8ac92. Configure here.}

Conflict resolution policy: preserve main's behavior for non-agentic scenarios; PR has free rein over agentic scenarios. YAML configs (.github/configs/{amd,nvidia}-master.yaml): - Reverted to main verbatim for every recipe that exists on main (29 recipes total — 19 nvidia + 10 amd — that PR-branch had also modified). Main's image versions, search-spaces, and comments stay. - Appended 9 net-new agentic recipes from PR-branch: nvidia: qwen3.5-fp8-h100-sglang-agentic, qwen3.5-fp8-b300-sglang-agentic-hicache, kimik2.5-fp4-b200-vllm-agentic-lmcache, dsv4-fp4-gb300-dynamo-vllm-agentic, dsv4-fp4-gb300-cw-dynamo-vllm-agentic amd: qwen3.5-fp8-mi355x-sglang-agentic-hicache, dsv4-fp4-mi355x-vllm-agentic, dsv4-fp4-mi355x-sglang-agentic, dsr1-fp4-mi355x-sglang-disagg-mtp Auto-merged everywhere else. Notable shared-infra changes are plumbing- only (path reorg to single_node/fixed_seq_len/, launcher refactors, new mount paths for aiperf mmap cache); no main recipe perf path changes.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit fd91e1a. Configure here.}

cursor · 2026-06-02T16:49:18Z

+# cascade-cancel the other (see prior R20–R23 outages). The two sibling
+# configs share recipe files via the same launch_gb300-cw.sh IS_AGENTIC
+# overlay (recipes/vllm/deepseek-v4/agentic/), so a change to the recipe
+# applies to both clusters with no duplication.


Duplicate comment block in nvidia config file

Low Severity

The multi-line comment block explaining the CoreWeave sibling of dsv4-fp4-gb300-dynamo-vllm-agentic is duplicated verbatim. Lines 9806–9813 and 9816–9823 are identical, both describing why CW is a separate config. One copy is sufficient.

^{Reviewed by Cursor Bugbot for commit fd91e1a. Configure here.}

Adds the four scripts used to produce the semianalysisai/cc-traces-weka-with-subagents-* HuggingFace datasets (060226 + the older 052726 series): utils/sample_proxy_traces.py Postgres -> per-session JSONLs. Applies the canonical filter stack (--min/--max-trace-version, --min-main-turns, --require-cli-min, --max-parallel-subagents, hardcoded image-content and classifier-call exclusions). utils/build_weka_hf_dataset.py End-to-end orchestrator: sample -> proxy_to_weka.py -> traces.jsonl -> plots -> README -> HF upload. Optional 256k variant with per-request cap + timeline reshift (matches the existing -052726-256k README semantics). Idempotent via work-dir caching + --skip-* flags. utils/plot_weka_distributions.py Main-agent stream distribution plots (ISL/OSL/think-time histograms, log + linear x). utils/plot_subagent_distributions.py Sub-agent fan-out plots (groups/trace, inners/group, intra-group cache-hit rate, etc.). proxy_to_weka.py was already tracked (committed earlier under eab58e9). Signed-off-by: Cam Quilici <cameron@semianalysis.com>

… aiperf The five scripts used to build the semianalysisai/cc-traces-weka-with-subagents-* HuggingFace datasets live in their own subdir now so they can grow without crowding utils/: utils/agentic/sample_proxy_traces.py (Postgres -> per-session JSONLs) utils/agentic/proxy_to_weka.py (proxy rows -> weka trace JSON, with dedup) utils/agentic/build_weka_hf_dataset.py (orchestrator: sample -> convert -> plots -> README -> upload) utils/agentic/plot_weka_distributions.py (main-agent histograms) utils/agentic/plot_subagent_distributions.py (sub-agent fan-out plots) Internal docstring/path references inside the moved scripts updated to point at the new utils/agentic/ paths so README footers + --help examples stay accurate. Also bumps the aiperf submodule to the tip of cquil11/cjq/agentx-v0.3 (062a5de9), which includes the snapshot warmup planned-credit fix + pre-commit autofixes that landed since the previous pin (8473e154). Signed-off-by: Cam Quilici <cameron@semianalysis.com>

+        seed = str(args.seed)
+        rows = sorted(
+            rows,
+            key=lambda r: hashlib.md5((r["session_id"] + seed).encode("utf-8")).hexdigest(),


The aiperf submodule's 175 in-flight commits (agentic / observability / hicache / counter-pair / scenario work) were previously hosted on cquil11/aiperf as a personal fork branch (cjq/agentx-v0.3-subagents). Pushed the same SHA chain to SemiAnalysisAI/aiperf as the cjq/agentx-v0.3 branch and switched the submodule's source-of-truth URL accordingly. The pinned commit SHA (062a5de9) is unchanged — same content, same blob, just reachable via the org-owned remote now. Run \`git submodule sync utils/aiperf\` after pulling to refresh cached remote URLs in existing checkouts. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/ force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since garbage-collected it. It is now unfetchable ("upload-pack: not our ref") and absent from every CI runner cache, so actions/checkout fails on any cold runner with "Unable to find current revision in submodule path utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856). Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already declares, which is live/fetchable and contains the prior aiperf history as an ancestor. This makes the pin and the declared branch consistent again.

* feat: MiniMax-M3 MXFP8 full sweep config for GB300 Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7 topologies covering the full concurrency range: - TP4/TP8 (low latency, conc 4-64) - TP4+EP4 agg + 1P+1D disagg 2-node + 1P+1D collocated (mid, conc 64-512) - DEP4/DEP8 (high throughput, conc 256-2048) All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/. GB300 recipes include srun_options mem=0 (CW DefMemPerCPU cgroup fix) and omit safetensors-load-strategy prefetch (host-memory limit). * chore: update perf-changelog pr-link to #1735 * Update runner name in nvidia-master.yaml * fix: add sbatch_directives mem=0 + cpus-per-task=72 to M3 GB300 recipes srun_options.mem=0 only grants a step the job's existing allocation; on gb300-cw (DefMemPerCPU=4096, no DefCpuPerGPU) the job itself was only allocated 4 GB/node and workers were cgroup-OOM-killed during engine init (run 27452273567: oom_kill in StepId=7409.7 on slurm-gb300-133-193, worker RLIMIT showed 4194304 KB). The canary passed because it landed on gb300-nv, which doesn't enforce the cap. Mirrors the sbatch_directives block of the DSV4 agentic recipes. * fix: run M3 GB300 workers cache-only (HF_HUB_OFFLINE=1) to avoid fetch_model lock race With the mem fix in place, run 27452976271 cleared the OOM but hit a new failure: both nodes of the TP8-2n job called dynamo fetch_model within 200ms (191 @ :23.637, 193 @ :23.833), 191 took the per-blob .lock on the shared /mnt/vast/hf-home cache and held it verifying the 444 GB snapshot, 193 retried ~6.4s and died 'Lock acquisition failed' (dynamo's rust hub doesn't wait like Python hf_hub). The launcher already pre-stages and verifies the snapshot offline before submit, so the workers never need to fetch. Setting HF_HUB_OFFLINE=1 in every worker env block makes dynamo serve cache-only and skip the download lock entirely, so co-fetching workers no longer collide. Applied to all agg + disagg (prefill/decode) env blocks across the 11 recipes. * fix: re-pin utils/aiperf to live cjq/agentx-v0.3 tip (ff2b646c) The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/ force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since garbage-collected it. It is now unfetchable ("upload-pack: not our ref") and absent from every CI runner cache, so actions/checkout fails on any cold runner with "Unable to find current revision in submodule path utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856). Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already declares, which is live/fetchable and contains the prior aiperf history as an ancestor. This makes the pin and the declared branch consistent again. * MiniMax-M3 GB300: disagg-only sweep + multi-node-NVLink KV transfer Replace the aggregated M3 GB300 topologies with disaggregated-only, and enable NixlConnector KV transfer over multi-node NVLink on every disagg recipe. On gb300-cw the cross-node prefill->decode KV handoff was silently falling back to RDMA/TCP (~268 MB/s, ~1400 tiny descriptors for M3 MSA cache) — the disagg ceiling. Setting UCX_CUDA_IPC_ENABLE_MNNVL=y plus --enable-cumem-allocator (VMM-registers KV so NIXL uses cuda_ipc across the NVL fabric) lifts it to ~1.4-1.7 GB/s and gives +17% / +23% / +49% out tok/s/gpu at conc 64 / 128 / 256 (jobs 7490 base vs 7493 MNNVL, 1P1D TP4EP4). This is a GB300-only win: B300 8-GPU IB islands cannot move KV over multi-node NVLink. Sweep (1k1k), all MNNVL: - 1P1D TP4+EP4 collocated 1n (8 GPU), conc 8-256 - low/mid latency - 1P1D TP4+EP4 split 2n (8 GPU), conc 64-512 - mid throughput - 1P + DP16+EP wide decode 5n (20 GPU), conc 512-2048 - max throughput (decode keeps scaling on NVL where 1P1D saturates: ~1213 vs ~810 out tok/s/gpu @ conc 1024) Removes all agg-gb300 recipes (1k1k + 8k1k); applies MNNVL to the 8k1k disagg recipe too for consistency. * M3 GB300: add 8k1k disagg sweep; drop unschedulable collocated-1n The collocated-1n topology (disagg-gb300-1p1d-tp4ep4-1n) declared gpus_per_node: 8, but gb300-cw nodes have 4 GPUs — sbatch rejects it with "Requested node configuration is not available" even on a fully idle cluster (confirmed: fails standalone with 28 nodes free; the split-2n and wide-decode at gpus_per_node 4 schedule fine). It was an 8-GPU-node template artifact that never reached sbatch before. Remove it (1k1k + 8k1k) and let the split-2n cover the low-latency end (conc extended down to 8). Add the 8k1k (isl 8192) scenario mirroring 1k1k with the two valid disagg shapes (split-2n + wide DP16 decode), MNNVL KV transfer on both, seq params retuned for long context (max-model-len 9472) and lower concurrency. * M3 GB300: add rack-saturating balanced-ratio TP-ep1 max-throughput disagg config Adds a 17-node (full-rack) disagg topology to the M3 GB300 sweep (1k1k + 8k1k) from on-cluster tuning (gb300-cw): - PREFILL is the binding bottleneck, not decode width or KV transfer: a single prefill worker left ~3967 reqs queued and starved 64 decode GPUs. Balancing to 5 prefill : 12 decode (TP4) cleared the backlog and lifted throughput +57% (535 -> 843 out tok/s/gpu @ conc 2048). - TP-only decode (ep1, no expert parallelism) per the Qwen3.5-397B-A17B recipes (closest M3 analog); M3 wide-EP/DP-attention all-to-all was slower and DP32 < DP16 per-GPU. - Kept the existing 1p1d (low/mid latency) and dep16dec (wide-decode) topologies so CI measures the full Pareto rather than replacing them. NixlConnector KV transfer stays on multi-node NVLink (MNNVL + cumem); note KV transfer was verified NOT to bottleneck throughput (doubling its bandwidth via num_threads changed end-to-end tok/s/gpu by ~0). recipe yamls line up 1:1 with the nvidia-master.yaml CONFIG_FILE references. * M3 GB300: replace dep16dec with 1P4D TP4-ep1; add prefill-heavy 10P7D for 8k1k DSR1 GB300 patterns show wide-EP decode hurts M3's MoE all-to-all; independent TP4 decode workers are strictly better. Also, 8k1k is prefill-bound (616-req backlog at 5P:12D) — rebalance to 10P:7D per DSR1/DSV4's prefill-heavy long-context ratios. Changes: - Replace dep16dec (EP16 single decode) with 1P+4D (4x TP4 ep1 decode) for both 1k1k and 8k1k, same 5 nodes - Add 10P+7D TP4 ep1 (17 nodes) for 8k1k max throughput - Tighten concurrency ranges: 1P1D [4-32], 1P4D [64-512], 5P12D/10P7D [1024+] * [Klaud Cold]minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI300X recipe (#1749) * minimaxm3-fp8-mi300x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 MI300X recipe Adds the spec-decoding=mtp sibling of minimaxm3-fp8-mi300x-vllm, based on the MI300X non-MTP recipe + the MI355X MTP recipe. Keeps the MI300X serve shape (BF16 KV cache — gfx942 lacks calibrated ROCm FP8 attention scales — plus --no-enable-prefix-caching, TRITON_ATTN, --enforce-eager, minimax_m3 parsers) and adds the Inferact/MiniMax-M3-EAGLE3 draft via --speculative-config (method eagle3, 3 spec tokens) + chat-template prompts. Carries the same in-place EAGLE3 patch as the MI355X MTP recipe: the shipped ROCm image's AMD MiniMax-M3 model lacks SupportsEagle3, so the recipe patches the installed amd/model.py before serving (functionstackx/vllm#1, upstream vllm-project/vllm#45546; validated green on MI355X). Idempotent; hard-fails on base drift. TP8-only search space (gfx942 192 GB is memory-tight, like H100), TP8 latency rows started at conc 1, matching the H100/MI355X MTP recipes. Also adds SPEC_SUFFIX to launch_mi300x-amds.sh so spec-decoding=mtp routes to the _mtp script (the launcher hardcoded _mi300x.sh). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp (#1749) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> * [AMD] perf: enable MiniMax M3 CUDA graphs on MI300X (#1750) * feat: add MiniMax M3 MI300X day-zero benchmark * chore: link MiniMax M3 MI300X changelog * fix: mount ROCm devices on MI300X * fix: disable prefix caching for MI300X MiniMax M3 * fix: use bf16 kv cache for MI300X MiniMax M3 * perf: enable MI300X MiniMax M3 CUDA graphs * chore: link MI300X CUDA graph changelog * [Klaud Cold] minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager, VLLM_USE_BREAKABLE_CUDAGRAPH=0) (#1756) * minimaxm3-fp8-mi300x-vllm-mtp: run with CUDA graphs (drop --enforce-eager) Remove --enforce-eager from the MI300X EAGLE3 MTP recipe and set VLLM_USE_BREAKABLE_CUDAGRAPH=0, matching the non-MTP MI300X recipe (#1750). Avoids the M3-decode breakable-cudagraph path that previously forced eager execution. Re-sweeps minimaxm3-fp8-mi300x-vllm-mtp. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * perf-changelog: fill in PR link for minimaxm3-fp8-mi300x-vllm-mtp cudagraphs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> * M3 GB300: drop dominated configs, restore 1P1D full range Data from run 27489709722 showed: - 1P4D (20 GPU) strictly dominated by 1P1D (8 GPU): 320 vs 974 out/s/gpu @ conc 128 (1k1k). Single prefill can't feed 4 decode workers — 1P:4D ratio is too decode-heavy. - 8k1k 5P12D (68 GPU) dominated by 10P7D: 567 vs 874 out/s/gpu @ conc 1024. Prefill-heavy ratio is correct for long context. Changes: - Remove 1P4D recipes (both 1k1k and 8k1k) - Remove 8k1k 5P12D recipe (dominated by 10P7D) - Restore 1P1D to full concurrency range [8-512] 1k1k, [8-256] 8k1k (was truncated to [4-32] to avoid 1P4D overlap) Final GB300 configs: 1P1D (latency-to-mid) + rack-saturating (max tput) 1k1k: 1P1D [8-512] + 5P12D [2048-8192] 8k1k: 1P1D [8-256] + 10P7D [1024-4096] * M3 GB300 disagg: add DSV4-level decode optimizations Port decode optimizations from DSV4 GB300 disagg reference configs to all 4 M3 GB300 recipe files: - fp8 KV cache (2x decode slot capacity vs bf16) - max-num-seqs/max-num-batched-tokens 256→512 - CUDA graph compilation (FULL_DECODE_ONLY mode) - NCCL MNNVL env vars (CUMEM_ENABLE, MNNVL_ENABLE, NVLS_ENABLE) - enable-ep-weight-filter + no-disable-hybrid-kv-cache-manager - stream-interval 32→50 on decode * Switch GB300 M3 recipes to nightly-aarch64 + add Marlin MoE for TP-only workers - All 4 recipes: container vllm/vllm-openai:minimax-m3 → nightly-aarch64 (contains upstream head_ratio fix vllm#45879, avoids gemm1_alpha crash) - TP-only recipes (5p12d-tp4ep1, 10p7d-tp4ep1): add moe-backend: marlin for both prefill and decode workers per PR #1809 pattern - EP recipes (1p1d-tp4ep4): no Marlin (EP enabled) - nvidia-master.yaml: update image, comment out 1k1k (run 8k1k only) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: switch GB300 M3 runner from gb300-cw to gb300-nv Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add minimaxm3-fp8 to gb300-nv launcher + switch recipes to alias-based model path - Add minimaxm3 fp8 case to launch_gb300-nv.sh (MODEL_PATH, srt-slurm clone) - Switch recipe model.path from hf:MiniMaxAI/MiniMax-M3-MXFP8 to minimax-m3-mxfp8 (alias resolved via srtslurm.yaml model_paths, matching GB200 pattern) - Remove __M3_HF_HOME__ placeholder (extra_mount, HF_HOME, HF_HUB_OFFLINE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: redesign GB300 M3 recipes — DEP8 prefill, TEP8/TP8/DEP8 decode All prefill workers switched to DEP8 (TP1 DP8 EP, 8 GPU, 2 nodes). Low conc (<128): two decode variants — TEP8 (TP8+EP8) and TP8+Marlin. High conc (128+): DEP8 decode, 2P+7D = 18 nodes. TP8 decode (not TP4) to avoid Marlin OOM seen on previous run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: TEP4 prefill + B300-optimal decode for GB300 M3 disagg Switch all prefill from DEP8 (TP1 DP8 EP, 2 nodes) to TEP4 (TP4+EP4, 1 node), halving per-worker node footprint. Decode configs follow B300 run 27630519240 optimal points (spec=none): - conc 8-32: TP4+Marlin (no EP) - conc 64-256: TEP4 (TP4+EP4) - conc 512/1024: TEP8 (8k1k) or DEP8 (1k1k), max 2 workers × 6n Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: adapt NV B300 PR #1863 disagg configs for GB300 M3 sweep Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863 B300 dynamo-vllm disagg search matrix, adapted for GB300 NVL72 (4 GPU/node): - All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter per-worker footprint allows more prefill workers - Decode types: TP4+Marlin, TEP8, DEP8, DEP4 - 4p3d (3 decode workers) skipped - 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active) - PR 1863 vllm_config values (max-num-seqs up to 4096, max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384) - Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead of enforce-eager - kv-cache-dtype: fp8, req_rate: inf for all benchmarks - GB300 MNNVL/NVLS env vars + sbatch mem=0 preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: reduce GB300 DEP CUDA graph capture sizes --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Cameron Quilici <cjquilici@gmail.com>

cquil11 and others added 30 commits May 17, 2026 15:50

dsv4-fp4-b200-vllm-agentic: bump image to cquil v0.21.0 custom build

9996180

Test SimpleCPUOffloadConnector lazy_offload behavior on a newer vLLM than the default v0.20.0-cu130. Image: cquil/vllm-openai:v0.21.0-8813c92. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

agentic: simplify git install to bare apt-get update && install; keep -e

ea13e41

Drop the sudo/root-detection complexity from R18 and restore -e on the aiperf pip install. Per user direction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml Outdated

cquil11 added 5 commits May 27, 2026 18:11

testing qwen

842a0cf

testing qwen

5d10625

testing qwen

717385a

testing qwen

6a77acb

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread benchmarks/single_node/agentic/qwen3.5_fp8_h100.sh

cursor Bot reviewed May 28, 2026

View reviewed changes

Comment thread .github/configs/amd-master.yaml Outdated

cquil11 and others added 2 commits May 28, 2026 11:06

testing qwen

0e8ac92

cursor Bot reviewed May 28, 2026

View reviewed changes

cquil11 and others added 2 commits May 28, 2026 12:37

chore(aiperf): bump submodule for snapshot warmup fix

57fdef7

cursor Bot reviewed Jun 2, 2026

View reviewed changes

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

Comment thread utils/agentic/sample_proxy_traces.py Fixed

github-advanced-security AI found potential problems Jun 2, 2026

View reviewed changes

Comment thread utils/agentic/sample_proxy_traces.py

seed = str(args.seed)

rows = sorted(

rows,

key=lambda r: hashlib.md5((r["session_id"] + seed).encode("utf-8")).hexdigest(),

cquil11 changed the title ~~[WIP] Chore/agentx v0.3~~ chore: agentx v0.3 Jun 2, 2026

cquil11 merged commit 1b23499 into main Jun 2, 2026
5 of 6 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 2, 2026

cquil11 deleted the chore/agentx-v0.3 branch June 2, 2026 17:16

functionstackx restored the chore/agentx-v0.3 branch June 2, 2026 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: agentx v0.3#1571

chore: agentx v0.3#1571
cquil11 merged 152 commits into
mainfrom
chore/agentx-v0.3

cquil11 commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 28, 2026

Uh oh!

Uh oh!

cursor Bot May 28, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

cquil11 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit groups

AIPerf migration and repo layout

Replay data, failure handling, and metrics

Single-node offload coverage

GB300 multinode agentic runs

Runners and matrix handling

Dataset variants

Validation

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

Duplicated LMCache helper functions across three scripts

Uh oh!

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

Commented-out server parameters look like debug leftovers

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 2, 2026

Choose a reason for hiding this comment

Duplicate comment block in nvidia config file

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cquil11 commented May 27, 2026 •

edited

Loading