[Klaud Cold] minimaxm3-fp8-mi355x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI355X recipe by functionstackx · Pull Request #1742 · SemiAnalysisAI/InferenceX

functionstackx · 2026-06-13T16:19:49Z

Summary

Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) sibling of minimaxm3-fp8-mi355x-vllm (#1725): MiniMax-M3 MXFP8 on MI355X (gfx950) single-node vLLM (ROCm), pairing MiniMaxAI/MiniMax-M3-MXFP8 with the Inferact/MiniMax-M3-EAGLE3 draft head. Based on the MI355X non-MTP recipe.

New benchmark script

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x_mtp.sh, based on minimaxm3_fp8_mi355x.sh (mandatory --block-size 128 for MSA, --language-model-only, --kv-cache-dtype fp8, --attention-backend TRITON_ATTN, --enforce-eager, minimax_m3 tool/reasoning parsers, ROCR→HIP device mapping). Additions for MTP:

--speculative-config '{"method": "eagle3", "model": "Inferact/MiniMax-M3-EAGLE3", "num_speculative_tokens": 3}'.
No attention_backend override on the drafter — unlike the CUDA recipes. The FlashInfer "page size 128 requires GQA/MQA" limitation that forced FLASH_ATTN for the MHA EAGLE3 head on Blackwell is FlashInfer/CUDA-specific; here the whole server runs on TRITON_ATTN, which serves the MHA draft fine.
Draft downloaded into the same NFS-mounted HF cache as the (pre-staged) target.
--use-chat-template on the benchmark client so draft acceptance reflects real text rather than random tokens.
--enforce-eager is kept from the non-MTP base, so there's no cudagraph-capture sizing to scale.

Config (`amd-master.yaml`)

minimaxm3-fp8-mi355x-vllm-mtp, same vllm/vllm-openai-rocm:minimax-m3 image and mi355x runner. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (identical to minimaxm3-fp8-b300-vllm-mtp / b200-vllm-mtp), with tp2-ep2 dropped:

1k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP4+EP4 (64–256), TP8+EP8 dp-attn (256–512)
8k1k: TP8 (1–64), TP8+EP8 (1–256), TP4 (1–64), TP8+EP8 dp-attn (128–256)

No launcher change needed — launch_mi355x-amds.sh already resolves minimaxm3_fp8_mi355x_mtp.sh via SPEC_SUFFIX.

perf-changelog

Entry for the new config key.

Validation

generate_sweep_configs.py test-config --config-keys minimaxm3-fp8-mi355x-vllm-mtp generates 53 configs cleanly (scenario-trimmed max-model-len 2304 / 9472, all spec-decoding=mtp on mi355x).
bash -n passes on the new script.
Launcher script-name resolution simulated: falls back from _vllm_mtp to minimaxm3_fp8_mi355x_mtp.sh (exists).

🤖 Generated with Claude Code

Note

Low Risk
Benchmark-only additions (YAML, shell script, changelog); no production runtime or auth changes. Sweeps may fail until the ROCm vLLM image gains EAGLE3 support.

Overview
Adds the EAGLE3 speculative-decoding (spec-decoding: mtp) variant of the existing MiniMax-M3 MXFP8 MI355X vLLM benchmark, pairing MiniMaxAI/MiniMax-M3-MXFP8 with draft Inferact/MiniMax-M3-EAGLE3 (3 speculative tokens).

Registers minimaxm3-fp8-mi355x-vllm-mtp in amd-master.yaml with the same ROCm image as the non-MTP entry and a trimmed search space (lower max concurrency, tp2-ep2 dropped), aligned with the B300/B200 MTP configs. Documents a perf-changelog entry for the new config key.

Introduces minimaxm3_fp8_mi355x_mtp.sh, extending the base MI355X recipe with --speculative-config (eagle3), draft model download, and --use-chat-template on the serving benchmark. Unlike CUDA MTP recipes, no drafter attention_backend override — global TRITON_ATTN is sufficient on ROCm. The script notes a known blocker: the current vllm-openai-rocm:minimax-m3 image lacks EAGLE3 aux-hidden-state support until the ROCm image is rebuilt.

^{Reviewed by Cursor Bugbot for commit f8ec1d0. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds the spec-decoding=mtp sibling of minimaxm3-fp8-mi355x-vllm: same MXFP8 target and ROCm serve shape (--block-size 128, FP8 KV cache, --attention-backend TRITON_ATTN, --enforce-eager, minimax_m3 parsers), plus the Inferact/MiniMax-M3-EAGLE3 draft head via --speculative-config (method eagle3, 3 speculative tokens). Unlike the CUDA recipes the drafter needs no attention_backend override — the FlashInfer page-128/MHA limitation that forced FLASH_ATTN on Blackwell is FlashInfer-specific; the whole server runs on TRITON_ATTN here, which serves the MHA draft fine. Benchmark prompts run through the chat template so acceptance reflects real text. Search space mirrors the non-MTP entry trimmed at the extreme-concurrency end (tp2-ep2 dropped), matching the b300/b200 MTP precedent. Launcher needs no change — launch_mi355x-amds.sh already resolves the _mtp script via SPEC_SUFFIX. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-13T16:19:56Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-13T16:20:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27472217054
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27472217054

github-actions · 2026-06-13T16:28:55Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27472217773
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27472217773

First MI355X EAGLE3 sweep failed engine init with 'Model does not support EAGLE3 interface but aux_hidden_state_outputs was requested'. The proven CUDA EAGLE3 recipes all serve with --trust-remote-code, which the ROCm non-MTP base omits; add it here to match the CUDA serve exactly (the one remaining serve-level difference). If this fails identically the ROCm minimax-m3 image lacks MiniMax-M3 EAGLE3 target support and needs a rebuild. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-13T16:44:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27472704212
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27472704212

github-actions · 2026-06-13T16:59:04Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27472704212
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27472704212

…ge blocker The --trust-remote-code experiment failed identically (sweep 27472704212, trust_remote_code=True confirmed in engine config): same "Model does not support EAGLE3 interface but aux_hidden_state_outputs was requested" across all TP workers. Revert it back to the non-MTP base shape and document the confirmed blocker: the ROCm vllm/vllm-openai-rocm:minimax-m3 image's MiniMaxM3SparseForConditionalGeneration class does not implement vLLM's SupportsEagle3 interface (the CUDA build does). Recipe is correct; held pending a ROCm image rebuild with MiniMax-M3 EAGLE3 target support. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

functionstackx · 2026-06-13T18:21:31Z

⛔ Held — blocked on ROCm image (EAGLE3 target support missing)

CI cannot pass on the current vllm/vllm-openai-rocm:minimax-m3 image. Both MI355X sweeps failed engine init at the canary with the same error across all TP workers:

RuntimeError: Model does not support EAGLE3 interface but aux_hidden_state_outputs was requested

Root cause

The draft (Inferact/MiniMax-M3-EAGLE3 → LlamaForCausalLMEagle3) downloads and loads fine. The failure is the target: the ROCm build's MiniMaxM3SparseForConditionalGeneration class does not implement vLLM's SupportsEagle3 aux-hidden-state interface that EAGLE3 consumes. The CUDA vllm/vllm-openai:minimax-m3 image (a newer vLLM commit — g454b47db8 vs the ROCm g4a560dd8d) does implement it, which is why the B300/B200/H100/H200 EAGLE3 MTP recipes (#1733, #1739, #1741) all pass with the identical config.

Ruled out

--trust-remote-code on serve (matched the CUDA recipe exactly): re-run 27472704212 confirmed trust_remote_code=True in the engine config and failed identically. The SupportsEagle3 interface comes from the image's model class, not remote code — no serve flag can add it.

Status

The recipe itself is correct (config, search space, draft wiring, TRITON_ATTN serve all validated). It is held pending a rebuild of the ROCm minimax-m3 image from a commit that includes MiniMax-M3 EAGLE3 target support. Once that image ships, re-running the sweep on this branch should pass with no recipe changes.

Failed sweeps for reference: 27472217773 (initial), 27472704212 (with --trust-remote-code).

github-actions · 2026-06-13T18:29:17Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27475145738
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27475145738

functionstackx requested a review from a team June 13, 2026 16:19

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 13, 2026 16:19

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

perf-changelog: fill in PR link for minimaxm3-fp8-mi355x-vllm-mtp

4018fba

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

functionstackx added the full-sweep-enabled label Jun 13, 2026

This was referenced Jun 13, 2026

[Klaud Cold][AI draft test] minimaxm3-fp8-mi355x-vllm-mtp: runtime-patch EAGLE3 to validate on MI355X #1744

Closed

Add minimax M3 MXFP8 MI355X vLLM EAGLE3 (related PR for upstreaming patch https://github.com/vllm-project/vllm/pull/45546) #1745

Merged

functionstackx closed this Jun 13, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Klaud Cold] minimaxm3-fp8-mi355x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI355X recipe#1742

[Klaud Cold] minimaxm3-fp8-mi355x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 (MTP) MI355X recipe#1742
functionstackx wants to merge 4 commits into
mainfrom
feat/minimax-m3-mi355-mtp-dayzero

functionstackx commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

functionstackx commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New benchmark script

Config (amd-master.yaml)

perf-changelog

Validation

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

functionstackx commented Jun 13, 2026

⛔ Held — blocked on ROCm image (EAGLE3 target support missing)

Root cause

Ruled out

Status

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented Jun 13, 2026 •

edited by cursor Bot

Loading

Config (`amd-master.yaml`)