Skip to content

fix UCX_ settings to fix nixl handshake failure#1823

Merged
xinli-sw merged 2 commits into
SemiAnalysisAI:nv/jasonli/minimaxm3-fp8-b300-dynamo-vllmfrom
biswapanda:nv/jasonli/minimaxm3-fp8-b300-dynamo-vllm-fix-ucx
Jun 18, 2026
Merged

fix UCX_ settings to fix nixl handshake failure#1823
xinli-sw merged 2 commits into
SemiAnalysisAI:nv/jasonli/minimaxm3-fp8-b300-dynamo-vllmfrom
biswapanda:nv/jasonli/minimaxm3-fp8-b300-dynamo-vllm-fix-ucx

Conversation

@biswapanda

@biswapanda biswapanda commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Note

Low Risk
Changes are limited to benchmark YAML env vars and an srt-slurm git branch for one model/framework combo; no application or auth logic is touched.

Overview
Updates MiniMax-M3 B300 FP8 disaggregated vLLM Slurm recipes so Nixl KV transfer can complete its UCX handshake instead of failing with the previous broad TLS/device settings.

On every affected 1k1k and 8k1k recipe, prefill and decode backend environments drop UCX_NET_DEVICES: "all" and replace UCX_TLS: "rc,cuda_ipc,cuda_copy,sm,self,tcp" with UCX_TLS: "cuda_copy,rc", matching the cluster’s mounted UCX stack used for these jobs.

For dynamo-vllm + minimaxm3 + fp8, launch_b300-nv.sh now checks out sa-submission-q2-2026 on srt-slurm instead of main before copying the in-repo MiniMax-M3 recipes.

Reviewed by Cursor Bugbot for commit 63214b5. Bugbot is set up for automated code reviews on this repo. Configure here.

@xinli-sw xinli-sw merged commit 8abe295 into SemiAnalysisAI:nv/jasonli/minimaxm3-fp8-b300-dynamo-vllm Jun 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants