[Klaud Cold] Update dsv4-fp8-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1470
[Klaud Cold] Update dsv4-fp8-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1470functionstackx wants to merge 2 commits into
Conversation
…35x-20260517 Update SGLang ROCm image from custom rocm/sgl-dev:deepseek-v4-mi35x (23d old) to v0.5.12-rocm720-mi35x-20260517 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009964668 |
There was a problem hiding this comment.
LGTM — single image tag bump with matching changelog entry.
Extended reasoning...
Overview
This PR updates one line in .github/configs/amd-master.yaml to bump the dsv4-fp8-mi355x-sglang recipe's image from rocm/sgl-dev:deepseek-v4-mi35x to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517, plus a matching entry appended to perf-changelog.yaml. No other recipes are touched.
Security risks
None. Both files are config YAML consumed by the perf sweep pipeline; no code paths or auth/permission boundaries are modified. The new image moves from a custom dev tag to an official lmsysorg/sglang-rocm tag, which is the same registry/pattern used by sibling recipes in this file.
Level of scrutiny
Low. This is a mechanical image bump matching the established pattern across the file (compare qwen3.5-bf16-mi355x-sglang, glm5-fp8-mi355x-sglang, etc., which all use lmsysorg/sglang-rocm:vX.Y.Z-rocm720-mi35x-DATE style tags). The full-sweep-enabled label ensures the sweep will validate the new image before merge.
Other factors
The bug hunting system found no issues. The PR is labeled full-sweep-enabled so the sweep job will exercise the new image end-to-end. The changelog entry uses the correct PR number (1470) and config-keys matches the recipe name.
|
@functionstackx According to your comment here: #1255 (comment), we haven't focused on the DSv4 FP8 version for a while. If you want to add it back, some configurations from that closed PR can be reconsidered. |
|
@chunfangamd this was an claude /loop, thanks for correcting my claude |
The v0.5.12-rocm720-mi35x-20260517 image rejects bf16 weights in fp8 slots (ValueError: Downcasting not allowed at deepseek_v4.py:1544 → parameter.py:73 copy_with_check) — DSV4-Pro-FP8 has bf16 shared-expert/ embedding layers and the old rocm/sgl-dev:deepseek-v4-mi35x image tolerated this, the generic v0.5.12 ROCm image doesn't. Bump PR #1470 root-caused via the failing sweep; the recipe isn't viable on any tag in the current sglang ROCm MI355X lineage until upstream relaxes the downcast check or ships a converter. Closing that PR; if the recipe is needed again later it can be re-added with whichever image works at that point. Also removes the orphan launch script benchmarks/single_node/ dsv4_fp8_mi355x.sh (not referenced by any remaining recipe). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Recipe removed entirely in #1501 (admin-merged as The |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009966705 |
Summary
Update SGLang ROCm image from custom rocm/sgl-dev:deepseek-v4-mi35x (23d old) to v0.5.12-rocm720-mi35x-20260517
Recipes touched: `dsv4-fp8-mi355x-sglang`
Test plan
🤖 Generated with Claude Code