MiniMax-M3 MXFP8 full sweep config for GB300 by Oseltamivir · Pull Request #1735 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-13T01:06:19Z

Summary

Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7 topologies: TP4, TP8, TP4+EP4, 1P+1D disagg (2-node), 1P+1D collocated (1-node), DEP4, DEP8
All GB300 recipes include sbatch_directives: mem: "0" / cpus-per-task: "72" plus srun_options: mem: "0" (CW DefMemPerCPU=4096 cgroup fix — step-level mem=0 alone only grants what the job allocation already holds) and omit safetensors prefetch (host-memory limit)
All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/
Concurrency sweep: TP 4-64, TEP 128-512, disagg 64-512, DEP4 256-1024, DEP8 512-2048

Test plan

GB300 disagg canary passed: run 27449360195 (landed on gb300-nv, so it did not exercise the CW cgroup cap)
CW cgroup fix (sbatch_directives mem=0): full-sweep run 27452273567 OOMed on gb300-cw before the fix; needs a re-run that lands on gb300-cw
CW cgroup fix verified: run 27452976271 cleared the OOM on gb300-cw (generated sbatch shows --mem=0 --cpus-per-task=72)
HF cache lock-race fix (HF_HUB_OFFLINE=1 in worker env): run 27452976271 then failed because both TP8-2n nodes raced on the shared HF blob .lock (Lock acquisition failed); workers now run cache-only against the launcher-pre-staged snapshot — needs a re-run on gb300-cw to confirm
Full sweep dispatched: TBD after merge

Note

Medium Risk
Large benchmark/CI surface area and multinode Slurm recipes with real GPU cost; workflow now injects an HF token into job env (scoped secret, but still sensitive operational change).

Overview
Adds MiniMax-M3 MXFP8 multinode disaggregated benchmarking on GB300 via a new minimaxm3-fp8-gb300-dynamo-vllm block in nvidia-master.yaml, with prefill DEP2 and decode variants (TP4+Marlin, TEP8, DEP4, DEP8) at 1k/1k and 8k/1k, each pointing at new Slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/. Recipes use fp8 KV cache, Nixl disagg, GB300 MNNVL/NCCL env, mem: "0" sbatch/srun cgroup settings, and vllm/vllm-openai:nightly-aarch64.

Wires the runner: launch_gb300-nv.sh resolves minimaxm3 + fp8 model paths and copies the new recipe tree into srt-slurm for dynamo-vllm. benchmark-multinode-tmpl.yml exports HF_TOKEN from a repo secret so Slurm workers can pull large Hub snapshots without anonymous rate limits.

Documents DEP CUDA-graph capture OOM tuning in KLAUD_DEBUG.md and records the config in perf-changelog.yaml.

^{Reviewed by Cursor Bugbot for commit cad3e01. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add minimaxm3-fp8-gb300-dynamo-vllm to nvidia-master.yaml with 7 topologies covering the full concurrency range: - TP4/TP8 (low latency, conc 4-64) - TP4+EP4 agg + 1P+1D disagg 2-node + 1P+1D collocated (mid, conc 64-512) - DEP4/DEP8 (high throughput, conc 256-2048) All recipe YAMLs included under minimax-m3-gb300-fp8/{1k1k,8k1k}/. GB300 recipes include srun_options mem=0 (CW DefMemPerCPU cgroup fix) and omit safetensors-load-strategy prefetch (host-memory limit).

github-actions · 2026-06-13T01:06:27Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-13T01:19:41Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452223695
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452223695

github-actions · 2026-06-13T01:35:06Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452273567
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452273567

srun_options.mem=0 only grants a step the job's existing allocation; on gb300-cw (DefMemPerCPU=4096, no DefCpuPerGPU) the job itself was only allocated 4 GB/node and workers were cgroup-OOM-killed during engine init (run 27452273567: oom_kill in StepId=7409.7 on slurm-gb300-133-193, worker RLIMIT showed 4194304 KB). The canary passed because it landed on gb300-nv, which doesn't enforce the cap. Mirrors the sbatch_directives block of the DSV4 agentic recipes.

github-actions · 2026-06-13T01:54:37Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27452976271
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27452976271

…h_model lock race With the mem fix in place, run 27452976271 cleared the OOM but hit a new failure: both nodes of the TP8-2n job called dynamo fetch_model within 200ms (191 @ :23.637, 193 @ :23.833), 191 took the per-blob .lock on the shared /mnt/vast/hf-home cache and held it verifying the 444 GB snapshot, 193 retried ~6.4s and died 'Lock acquisition failed' (dynamo's rust hub doesn't wait like Python hf_hub). The launcher already pre-stages and verifies the snapshot offline before submit, so the workers never need to fetch. Setting HF_HUB_OFFLINE=1 in every worker env block makes dynamo serve cache-only and skip the download lock entirely, so co-fetching workers no longer collide. Applied to all agg + disagg (prefill/decode) env blocks across the 11 recipes.

github-actions · 2026-06-13T02:07:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453434847
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453434847

github-actions · 2026-06-13T02:29:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453693856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453693856

github-actions · 2026-06-13T04:01:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453693856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453693856

The previous pin 062a5de9 (set by #1571 "chore: agentx v0.3") was the cjq/agentx-v0.3 tip on 2026-06-02, but that branch was later rebased/ force-pushed (now at ff2b646c) which orphaned 062a5de9; GitHub has since garbage-collected it. It is now unfetchable ("upload-pack: not our ref") and absent from every CI runner cache, so actions/checkout fails on any cold runner with "Unable to find current revision in submodule path utils/aiperf" (e.g. the newly-added gb300-cw runner-4, run 27453693856). Re-pin to the current cjq/agentx-v0.3 tip — the branch .gitmodules already declares, which is live/fetchable and contains the prior aiperf history as an ancestor. This makes the pin and the declared branch consistent again.

github-actions · 2026-06-13T04:56:47Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27453693856
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27453693856

github-actions · 2026-06-13T05:05:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27457134583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27457134583

github-actions · 2026-06-13T07:33:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27457134583
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27457134583

Replace the aggregated M3 GB300 topologies with disaggregated-only, and enable NixlConnector KV transfer over multi-node NVLink on every disagg recipe. On gb300-cw the cross-node prefill->decode KV handoff was silently falling back to RDMA/TCP (~268 MB/s, ~1400 tiny descriptors for M3 MSA cache) — the disagg ceiling. Setting UCX_CUDA_IPC_ENABLE_MNNVL=y plus --enable-cumem-allocator (VMM-registers KV so NIXL uses cuda_ipc across the NVL fabric) lifts it to ~1.4-1.7 GB/s and gives +17% / +23% / +49% out tok/s/gpu at conc 64 / 128 / 256 (jobs 7490 base vs 7493 MNNVL, 1P1D TP4EP4). This is a GB300-only win: B300 8-GPU IB islands cannot move KV over multi-node NVLink. Sweep (1k1k), all MNNVL: - 1P1D TP4+EP4 collocated 1n (8 GPU), conc 8-256 - low/mid latency - 1P1D TP4+EP4 split 2n (8 GPU), conc 64-512 - mid throughput - 1P + DP16+EP wide decode 5n (20 GPU), conc 512-2048 - max throughput (decode keeps scaling on NVL where 1P1D saturates: ~1213 vs ~810 out tok/s/gpu @ conc 1024) Removes all agg-gb300 recipes (1k1k + 8k1k); applies MNNVL to the 8k1k disagg recipe too for consistency.

github-actions · 2026-06-13T21:15:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27479316691
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27479316691

Oseltamivir · 2026-06-14T08:56:26Z

/run_sweep

github-actions · 2026-06-14T10:42:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27493886226
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27493886226

…-sweep

Port decode optimizations from DSV4 GB300 disagg reference configs to all 4 M3 GB300 recipe files: - fp8 KV cache (2x decode slot capacity vs bf16) - max-num-seqs/max-num-batched-tokens 256→512 - CUDA graph compilation (FULL_DECODE_ONLY mode) - NCCL MNNVL env vars (CUMEM_ENABLE, MNNVL_ENABLE, NVLS_ENABLE) - enable-ep-weight-filter + no-disable-hybrid-kv-cache-manager - stream-interval 32→50 on decode

github-actions · 2026-06-14T18:05:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27507155862
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27507155862

…ly workers - All 4 recipes: container vllm/vllm-openai:minimax-m3 → nightly-aarch64 (contains upstream head_ratio fix vllm#45879, avoids gemm1_alpha crash) - TP-only recipes (5p12d-tp4ep1, 10p7d-tp4ep1): add moe-backend: marlin for both prefill and decode workers per PR #1809 pattern - EP recipes (1p1d-tp4ep4): no Marlin (EP enabled) - nvidia-master.yaml: update image, comment out 1k1k (run 8k1k only) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…-sweep # Conflicts: # .github/configs/nvidia-master.yaml # runners/launch_gb300-cw.sh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-06-19T07:04:04Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27809896613
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27809896613

github-actions · 2026-06-19T07:08:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27810876051
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27810876051

…-based model path - Add minimaxm3 fp8 case to launch_gb300-nv.sh (MODEL_PATH, srt-slurm clone) - Switch recipe model.path from hf:MiniMaxAI/MiniMax-M3-MXFP8 to minimax-m3-mxfp8 (alias resolved via srtslurm.yaml model_paths, matching GB200 pattern) - Remove __M3_HF_HOME__ placeholder (extra_mount, HF_HOME, HF_HUB_OFFLINE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All prefill workers switched to DEP8 (TP1 DP8 EP, 8 GPU, 2 nodes). Low conc (<128): two decode variants — TEP8 (TP8+EP8) and TP8+Marlin. High conc (128+): DEP8 decode, 2P+7D = 18 nodes. TP8 decode (not TP4) to avoid Marlin OOM seen on previous run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-06-19T07:58:50Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27811406440
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27811406440

github-actions · 2026-06-19T09:53:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27813204331
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27813204331

Switch all prefill from DEP8 (TP1 DP8 EP, 2 nodes) to TEP4 (TP4+EP4, 1 node), halving per-worker node footprint. Decode configs follow B300 run 27630519240 optimal points (spec=none): - conc 8-32: TP4+Marlin (no EP) - conc 64-256: TEP4 (TP4+EP4) - conc 512/1024: TEP8 (8k1k) or DEP8 (1k1k), max 2 workers × 6n Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-06-19T12:07:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27822002618
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27822002618

Replace TEP4 prefill + B300-optimal decode recipes with NV's PR #1863 B300 dynamo-vllm disagg search matrix, adapted for GB300 NVL72 (4 GPU/node): - All prefill switched to DEP2 (TP1 DP2 EP, 2 GPU/worker) — lighter per-worker footprint allows more prefill workers - Decode types: TP4+Marlin, TEP8, DEP8, DEP4 - 4p3d (3 decode workers) skipped - 15 recipe files: 8 for 8k1k, 7 for 1k1k (both ISLs active) - PR 1863 vllm_config values (max-num-seqs up to 4096, max-cudagraph-capture-size up to 8192, max-num-batched-tokens 16384) - Prefill uses cudagraph (max-cudagraph-capture-size: 2048) instead of enforce-eager - kv-cache-dtype: fp8, req_rate: inf for all benchmarks - GB300 MNNVL/NVLS env vars + sbatch mem=0 preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-06-20T08:01:57Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27861755465
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27861755465

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cad3e01. Configure here.}

cursor · 2026-06-20T08:10:18Z

+      stream-interval: 32
+      max-num-seqs: 4096
+      max-num-batched-tokens: 16384
+      max-cudagraph-capture-size: 8192


TEP8 cudagraph limits too high

High Severity

Four 1k1k TEP8 decode recipes still set max-num-seqs: 4096 and max-cudagraph-capture-size: 8192, while the same change documents GB300 MiniMax-M3 graph capture OOM from those magnitudes and caps DEP decoders at 512/2048. Decode startup can hit CUDA OOM during graph capture before benchmarks run.

Additional Locations (2)

benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/1k1k/disagg-gb300-2p1d-dep2-tep8-3n.yaml#L93-L96

benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/1k1k/disagg-gb300-2p2d-dep2-tep8-5n.yaml#L93-L96

^{Reviewed by Cursor Bugbot for commit cad3e01. Configure here.}

github-actions · 2026-06-20T11:32:17Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27865193510
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27865193510

Oseltamivir · 2026-06-20T18:00:56Z

/reuse-sweep-run

Oseltamivir requested a review from a team June 13, 2026 01:06

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners June 13, 2026 01:06

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

chore: update perf-changelog pr-link to #1735

e3fa89f

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml Outdated

Comment thread ...multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/8k1k/disagg-gb300-1p1d-tp4ep4-2n.yaml

Oseltamivir added the full-sweep-enabled label Jun 13, 2026

Merge branch 'main' into feat/minimax-m3-gb300-sweep

b915c89

Update runner name in nvidia-master.yaml

afc3f92

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread .github/configs/nvidia-master.yaml Outdated

Comment thread runners/launch_gb300-cw.sh Outdated

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/8k1k/agg-gb300-dep8-2n.yaml Outdated

Merge branch 'main' into feat/minimax-m3-gb300-sweep

ce76bd7

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread runners/launch_gb300-cw.sh Outdated

Oseltamivir added 2 commits June 12, 2026 21:54

Merge branch 'main' into feat/minimax-m3-gb300-sweep

7ea8b0b

Oseltamivir requested a review from 1am9trash as a code owner June 14, 2026 08:56

Oseltamivir added 2 commits June 14, 2026 10:51

Merge remote-tracking branch 'origin/main' into feat/minimax-m3-gb300…

a8d3eb5

…-sweep

Oseltamivir and others added 2 commits June 19, 2026 14:36

Merge remote-tracking branch 'origin/main' into feat/minimax-m3-gb300…

2a97ca2

…-sweep # Conflicts: # .github/configs/nvidia-master.yaml # runners/launch_gb300-cw.sh

cursor Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread ...multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/8k1k/disagg-gb300-1p1d-tp4ep4-2n.yaml Outdated

Comment thread ...multi_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/8k1k/disagg-gb300-1p1d-tp4ep4-2n.yaml Outdated

fix: switch GB300 M3 runner from gb300-cw to gb300-nv

d4deb1e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Oseltamivir and others added 2 commits June 19, 2026 15:15

Oseltamivir and others added 2 commits June 19, 2026 19:04

Merge branch 'main' into feat/minimax-m3-gb300-sweep

8e49bf3

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread ...ti_node/srt-slurm-recipes/vllm/minimax-m3-gb300-fp8/8k1k/disagg-gb300-4p2d-dep2-dep8-6n.yaml

Merge branch 'main' into feat/minimax-m3-gb300-sweep

9707d9a

fix: reduce GB300 DEP CUDA graph capture sizes

cad3e01

cursor Bot reviewed Jun 20, 2026

View reviewed changes

Oseltamivir merged commit cba01a9 into main Jun 20, 2026
48 checks passed

Oseltamivir deleted the feat/minimax-m3-gb300-sweep branch June 20, 2026 18:01

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 20, 2026

Uh oh!

Conversation

Oseltamivir commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Oseltamivir commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 20, 2026

Choose a reason for hiding this comment

TEP8 cudagraph limits too high

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Oseltamivir commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oseltamivir commented Jun 13, 2026 •

edited by cursor Bot

Loading