Skip to content

[Klaud Cold] Update dsv4-fp8-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1470

Closed
functionstackx wants to merge 2 commits into
mainfrom
update-dsv4-fp8-mi355x-sglang-v0.5.12
Closed

[Klaud Cold] Update dsv4-fp8-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1470
functionstackx wants to merge 2 commits into
mainfrom
update-dsv4-fp8-mi355x-sglang-v0.5.12

Conversation

@functionstackx

Copy link
Copy Markdown
Collaborator

Summary

Update SGLang ROCm image from custom rocm/sgl-dev:deepseek-v4-mi35x (23d old) to v0.5.12-rocm720-mi35x-20260517

Recipes touched: `dsv4-fp8-mi355x-sglang`

Test plan

  • full-sweep-enabled sweep passes.

🤖 Generated with Claude Code

…35x-20260517

Update SGLang ROCm image from custom rocm/sgl-dev:deepseek-v4-mi35x (23d old) to v0.5.12-rocm720-mi35x-20260517

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — single image tag bump with matching changelog entry.

Extended reasoning...

Overview

This PR updates one line in .github/configs/amd-master.yaml to bump the dsv4-fp8-mi355x-sglang recipe's image from rocm/sgl-dev:deepseek-v4-mi35x to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517, plus a matching entry appended to perf-changelog.yaml. No other recipes are touched.

Security risks

None. Both files are config YAML consumed by the perf sweep pipeline; no code paths or auth/permission boundaries are modified. The new image moves from a custom dev tag to an official lmsysorg/sglang-rocm tag, which is the same registry/pattern used by sibling recipes in this file.

Level of scrutiny

Low. This is a mechanical image bump matching the established pattern across the file (compare qwen3.5-bf16-mi355x-sglang, glm5-fp8-mi355x-sglang, etc., which all use lmsysorg/sglang-rocm:vX.Y.Z-rocm720-mi35x-DATE style tags). The full-sweep-enabled label ensures the sweep will validate the new image before merge.

Other factors

The bug hunting system found no issues. The PR is labeled full-sweep-enabled so the sweep job will exercise the new image end-to-end. The changelog entry uses the correct PR number (1470) and config-keys matches the recipe name.

@chunfangamd

Copy link
Copy Markdown
Collaborator

@functionstackx According to your comment here: #1255 (comment), we haven't focused on the DSv4 FP8 version for a while.

If you want to add it back, some configurations from that closed PR can be reconsidered.

@functionstackx

Copy link
Copy Markdown
Collaborator Author

@chunfangamd this was an claude /loop, thanks for correcting my claude

functionstackx added a commit that referenced this pull request May 18, 2026
The v0.5.12-rocm720-mi35x-20260517 image rejects bf16 weights in fp8
slots (ValueError: Downcasting not allowed at deepseek_v4.py:1544 →
parameter.py:73 copy_with_check) — DSV4-Pro-FP8 has bf16 shared-expert/
embedding layers and the old rocm/sgl-dev:deepseek-v4-mi35x image
tolerated this, the generic v0.5.12 ROCm image doesn't.

Bump PR #1470 root-caused via the failing sweep; the recipe isn't
viable on any tag in the current sglang ROCm MI355X lineage until
upstream relaxes the downcast check or ships a converter. Closing
that PR; if the recipe is needed again later it can be re-added with
whichever image works at that point.

Also removes the orphan launch script benchmarks/single_node/
dsv4_fp8_mi355x.sh (not referenced by any remaining recipe).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx

Copy link
Copy Markdown
Collaborator Author

Recipe removed entirely in #1501 (admin-merged as adbaae52).

The v0.5.12-rocm720-mi35x-20260517 image rejects the DSV4-Pro-FP8 checkpoint's bf16 shared-expert / embedding layers with ValueError: Downcasting not allowed: target.dtype=torch.float8_e4m3fn, loaded_weight.dtype=torch.bfloat16 at deepseek_v4.py:1544 → parameter.py:73 copy_with_check. The old custom rocm/sgl-dev:deepseek-v4-mi35x image tolerated this; no current sglang ROCm MI355X release does, so the recipe isn't viable on any bumpable image. Re-add later if the upstream landscape changes.

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

2 participants