Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions KLAUD_DEBUG.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@

## 4. Upstream sglang v0.5.12 B300 regressions

Two distinct upstream regressions on NVIDIA B300 (Blackwell, `sm_120`) shipped in `lmsysorg/sglang:v0.5.12-cu130`:
Three distinct upstream regressions on NVIDIA B300 (Blackwell Ultra, `sm_103` — compute capability 10.3) shipped in `lmsysorg/sglang:v0.5.12-cu130`. (sm_120 is for *consumer* Blackwell / RTX 50 series, not B300 — don't propagate that.)

Check warning on line 69 in KLAUD_DEBUG.md

View check run for this annotation

Claude / Claude Code Review

Stale sm_120 reference in klaud-pr-status-html.md dashboard template

Pre-existing stale reference missed by this PR's cleanup pass: `.claude/commands/klaud-pr-status-html.md:171` still contains a dashboard Reason-cell example keyed to PR #1422 reading "Upstream sglang v0.5.12 `flash_attn` SM-arch regression on B300 (`sm_120`)." The PR description explicitly identifies "dashboard Reason cells" as one of the propagation targets for the bad sm_120 assumption — this template, which agents copy into `/tmp/klaud_pr_diag.json` each `/klaud-pr-status-html` invocation, wa

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Pre-existing stale reference missed by this PR's cleanup pass: .claude/commands/klaud-pr-status-html.md:171 still contains a dashboard Reason-cell example keyed to PR #1422 reading "Upstream sglang v0.5.12 flash_attn SM-arch regression on B300 (sm_120)." The PR description explicitly identifies "dashboard Reason cells" as one of the propagation targets for the bad sm_120 assumption — this template, which agents copy into /tmp/klaud_pr_diag.json each /klaud-pr-status-html invocation, was missed. Suggest updating it to sm_103 in the same PR so the correction is not re-seeded on every run.

Extended reasoning...

What the bug is

The PR corrects KLAUD_DEBUG.md §4 so that B300 (Blackwell Ultra) is described as sm_103 (compute capability 10.3) rather than the consumer-Blackwell sm_120. The PR description states explicitly that this bad assumption "had propagated through agent diagnoses, through several dashboard Reason cells, and into the upstream issue body" and that those have been corrected. However, one such propagated dashboard Reason cell — directly keyed to the same PR #1422 fixed in §4c — was not updated.

Where it lives

.claude/commands/klaud-pr-status-html.md:171 contains the example JSON template that agents copy verbatim into /tmp/klaud_pr_diag.json when generating dashboard HTML (per the surrounding instructions on lines 161–176). That template's Reason cell for PR #1422 still reads:

"Upstream sglang v0.5.12 flash_attn SM-arch regression on B300 (sm_120)."

A repo-wide grep for sm_120 returns exactly two hits: the corrective line in KLAUD_DEBUG.md:69 (introduced by this PR, and which deliberately disambiguates sm_120 as consumer Blackwell) and this stale line in the slash-command template.

Why this matters for this PR specifically

The PR is explicitly doc-only and scoped to scrubbing the bad sm_120 → B300 association from agent-facing materials. The template at klaud-pr-status-html.md:171 is exactly the class of artifact the PR description calls out ("dashboard Reason cells"). Because the template is the seed agents copy when generating fresh status JSON, leaving it as sm_120 causes each future /klaud-pr-status-html run to re-inject the wrong arch string into newly produced dashboards — defeating the purpose of §4c's correction.

Step-by-step propagation proof

  1. A user (or cron) invokes /klaud-pr-status-html.
  2. Per the playbook in .claude/commands/klaud-pr-status-html.md (lines 161–176), the agent copies the example JSON block — including the Reason for PR 1422 — into /tmp/klaud_pr_diag.json as its starting scaffold.
  3. The agent renders the dashboard HTML from that JSON; the Reason cell for PR Update qwen3.5-bf16-b300-sglang and -mtp SGLang image to v0.5.12-cu130 #1422 (and structurally similar new B300 flash_attn rows) carries the literal sm_120 string.
  4. A future debug session reads the dashboard, sees sm_120 attributed to B300, and re-seeds the same wrong assumption that this PR is specifically trying to eradicate.

Impact

Doc/template only — no runtime effect — but it directly undercuts the stated cleanup goal of the PR and will silently reintroduce the wrong arch on each dashboard regeneration until corrected.

Fix

Single-line edit in .claude/commands/klaud-pr-status-html.md:171: change (<code>sm_120</code>) to (<code>sm_103</code>) so the template matches the corrected KLAUD_DEBUG.md §4c. Severity is nit since it is a doc-template fix, but it is in scope for this PR (which is explicitly a doc-only sm_120 → sm_103 cleanup pass).


### 4a. DeepGemm TMA-descriptor crash (GLM-5-FP8)
**Symptom:** CUDA graph capture aborts with `CUDA_ERROR_ILLEGAL_ADDRESS (700)` at `/deepgemm/csrc/.../runtime_utils.hpp:143` on the **first batch size** for **every TP rank**. Server never serves a prompt.
Expand All @@ -86,17 +86,17 @@
2. Comment out the MTP/EAGLE scenarios on B300 in the recipe.
3. Pin to v0.5.11-cu130.

Seen on #1420.
Filed upstream: sgl-project/sglang#25563. Seen on #1420.

### 4c. flash_attn SM-arch assertion (qwen3.5-bf16)
**Symptom:** All 4 TP workers AssertionError on first forward pass:
```
File "/opt/venv/.../sglang/srt/layers/attention/flashattention_backend.py:..."
assert sm_100 <= arch <= sm_110f
```
B300 is `sm_120`, outside the asserted range. Server never becomes healthy; warmup times out at 600s.
B300 is `sm_103` (compute capability 10.3, Blackwell Ultra) — which is *nominally inside* the asserted `sm_100..sm_110f` range, yet the assertion still fires. Best guess is the cute kernel's `Arch.sm_110f` set only matches the architecture-specific feature-flag variants it was compiled for (e.g. `sm_100`, `sm_100f`, `sm_110`, `sm_110f`) and `sm_103` / `sm_103a` isn't in that explicit list. Server never becomes healthy; warmup times out at 600s.

**Fix:** Needs sglang image with flash_attn supporting `sm_120` — no local workaround. Pin to v0.5.11-cu130 in the meantime.
**Fix:** Needs an sglang image with `flash_attn` that recognises `sm_103` / `sm_103a` — no local workaround. Pin to `v0.5.11-cu130` in the meantime.

Seen on #1422.

Expand Down