fix(train): NF4 QLoRA forward self-deadlock — wave of 4 in fused_residual_rmsnorm_forward (unblocks GPU finetune) by noahgift · Pull Request #2249 · paiml/aprender

noahgift · 2026-07-01T20:29:59Z

Summary

Every apr finetune -m qlora run froze on the first transformer block forward (GPU 0% util, cputime frozen, all threads in futex_wait). gdb live capture of the deadlocked process:

#1 std::sys::sync::mutex::futex::Mutex::lock_contended
#2 entrenar::…::elementwise::residual_add_forward
#3 entrenar::…::normalization::fused_residual_rmsnorm_forward
#4 CudaNf4TransformerBlock::forward

fused_residual_rmsnorm_forward held the FORWARD_KERNEL_CACHE mutex guard for its whole body, then called the public residual_add_forward, which re-locks the same non-reentrant std::sync::Mutex on the same thread → permanent self-deadlock.

Wave of 4 (the path had never executed once)

Self-deadlock — nested same-thread lock via residual_add_forward (above).
Single-row kernel launched as batched — FusedResidualRmsNormKernel has no ctaid indexing but was launched with grid.y = batch_size; every block computed row 0, rows 1.. were never written (measured max_diff=2.36 vs CPU reference).
eps not threaded — kernel default 1e-5 (Llama) silently used for Qwen2 (1e-6); same class as FALSIFY-CUDA-RMSNORM-EPS-PARITY-001, one function down.
No pre-warm entry — kernel JIT-compiled mid-training (PMAT-698 Blackwell stream-poisoning class).

Fix

Switch to BatchedFusedResidualRmsNormKernel (PMAT-092): indexes rows via ctaid.y and writes residual_out itself, so the nested public call is gone structurally (not just lock-scope reordering). Thread eps from config.rms_norm_eps with eps bits in the cache key; pre-warm at both Qwen2 (1e-6) and Llama (1e-5) eps.

Verification (RTX 4090, sm_89)

Falsifier FALSIFY-CUDA-FUSED-RMSNORM-DEADLOCK-001 (watchdog thread + CPU-reference oracle over all rows):
- RED on bug: 120s watchdog fires with the exact production signature
- After fix Feature Request: Decision Tree & Random Forest for Classification Tasks #1 alone: oracle still RED at max_diff=2.36 (caught defect Feature Request: Cross-Validation Utilities #2 independently)
- GREEN post-fix: 0.17s, residual_out == residual + input exact, output < 1e-4 vs CPU reference on all rows
Contract: contracts/cuda-fused-residual-rmsnorm-v1.yaml — pv validate + pv lint contracts/ PASS
E2E: apr finetune qwen2.5-coder-1.5b-q4k.apr -m qlora --gpu-backend cuda now trains to completion (previously deadlocked 100% of runs; verified 3× today pre-fix): Epoch 1 complete … Training complete … time=13s, checkpoints saved
Full crate: cargo test -p aprender-train --lib --features cuda → 7695 pass; the 6 failures are pre-existing on the base commit (bf16 ptxas env issue, insta snapshot drift) — verified by running them on base
Pre-push: fmt ✓, contracts lib (1412) ✓, deny advisories ✓, non-cuda cargo check ✓

Known follow-on (separate beat): one training step logs NaN/Inf loss detected — skipping backward on the toy dataset; liveness and kernel parity are discharged here, the NaN loss is the next falsifier in the cascade.

Unblocks GPU QLoRA finetune on RTX 4090 (Pillar 3) and the apr-code tool_call flip.

🤖 Generated with Claude Code

…dual_rmsnorm_forward Every `apr finetune -m qlora` run froze on the FIRST transformer block forward (GPU 0%, cputime frozen, all threads futex_wait). gdb live capture: main thread stuck in Mutex::lock_contended <- residual_add_forward <- fused_residual_rmsnorm_forward <- CudaNf4TransformerBlock::forward. Wave of 4 (path had never executed once — defect 1 fired on first use): 1. Self-deadlock: fn held the FORWARD_KERNEL_CACHE guard for its whole body, then called public residual_add_forward which re-locks the SAME non-reentrant std::sync::Mutex on the same thread. 2. Single-row kernel launched as batched: FusedResidualRmsNormKernel has no ctaid indexing but was launched with grid.y=batch_size — every block computed row 0, rows 1.. never written (max_diff=2.36 vs CPU). 3. eps not threaded: kernel default 1e-5 (Llama) silently used for Qwen2 (1e-6) — same class as FALSIFY-CUDA-RMSNORM-EPS-PARITY-001. 4. No pre-warm entry: kernel JIT-compiled mid-training (PMAT-698 Blackwell stream-poisoning class). Fix: switch to BatchedFusedResidualRmsNormKernel (PMAT-092) which indexes rows via ctaid.y AND writes residual_out itself — the nested call is gone structurally; thread eps from config.rms_norm_eps with eps bits in the cache key; pre-warm both Qwen2 (1e-6) and Llama (1e-5). Falsifier FALSIFY-CUDA-FUSED-RMSNORM-DEADLOCK-001 (watchdog thread + CPU-reference oracle over ALL rows) verified RED on bug (120s timeout; after fix 1 alone, oracle still RED at max_diff=2.36) -> GREEN 0.17s on RTX 4090. Contract: contracts/cuda-fused-residual-rmsnorm-v1.yaml (pv validate + pv lint PASS). Full crate suite: 7695 pass; 6 fails are pre-existing on base (bf16 ptxas env, insta snapshot drift) — verified by running them on the base commit. Unblocks GPU QLoRA finetune on RTX 4090 (Pillar 3) and the apr-code tool_call flip. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…pe pairing, Q/K/V biases, full-warp softmax, causal CPU oracle (FALSIFY-CUDA-NF4-TRAIN-LOSS-PARITY-001) (#2252) Cascade defect #4: after the stream-ordering fix (#2249/#2250/stream-bind) apr finetune -m qlora trained end-to-end but the loss sat FLAT at CE 13-14 (> ln(151936)=11.93 — worse than uniform) on data the base model emits correctly in inference, then adapters blew into permanent NaN at ~step 125. The forward was finite but WRONG. Oracle bisection (pure-CPU CE vs GPU-forward+CPU-CE vs GPU-forward+fused-GPU-CE, then per-op layer-0 bisection vs a manual CPU replay) found FOUR stacked defects: 1. WRONG ROPE PAIRING (dominant): entrenar's batched_rope_neox_forward/ _backward wrappers instantiated BatchedRopeKernel — ADJACENT-pair (GPT-J) rotation, the convention realizar reserves for non-NeoX rope types. Qwen2/LLaMA need NEOX split-half pairs (i, i+d/2). Every layer's Q/K rotated in the wrong basis (post-rope relL2 0.42/0.65 vs oracle while un-roped V matched at quant noise 0.09). Fix: new BatchedRopeNeoxKernel + BatchedRopeNeoxBackwardKernel (precise trig, CORRECTNESS-013) in aprender-gpu; wrappers + pre-warm keys rewired. BatchedRopeKernel semantics preserved for realizar's non-NeoX users. 2. DROPPED Q/K/V BIASES: CudaNf4TransformerBlock had no bias support at all (Qwen2 use_bias=true; CPU path applies them). Bias drop alone shifts toy causal CE 2.13 -> 4.49. Fix: replicated bias buffers + cuda_add_inplace after each projection GEMM (before QK-norm/RoPE), threaded from all three NF4 construction sites + the instruct FP32 site (which passed None despite FALSIFY-CUDA-FORWARD-PARITY-002). 3. PARTIAL-WARP SHFL UB: batched softmax forward/backward launched block=(32.min(row_size)) while the kernels' reductions use shfl.sync membermask 0xFFFFFFFF — undefined with inactive named lanes (PTX ISA). For seq<32 row max/sum picked up garbage data-dependently -> 0/0=NaN rows. Fix: always launch a FULL 32-lane warp (guarded loops carry reduction identities on idle lanes). 4. NON-CAUSAL CPU ORACLE (label leakage): autograd::ops::attention had NO causal mask — the CPU train/eval path for decoder models attended bidirectionally, leaking label tokens backwards (toy causal CE 2.13 reported as 0.17) and corrupting CPU training/eval. Fix: attention_causal (masked scores; shared softmax backward is exact since masked weights are 0) dispatched for ModelArchitecture::Decoder; encoders stay bidirectional. Falsifier FALSIFY-CUDA-NF4-TRAIN-LOSS-PARITY-001 (parity_probe.rs): GPU fused loss vs an NF4-quantization-MATCHED causal CPU oracle (|dCE|<0.5, full-logits relL2<0.10, toy CE<6, fused-vs-CPU CE on identical logits <0.05). Mutation-verified RED per reverted fix: rope -> relL2 0.183; biases -> CE 5.17/relL2 0.97; warp -> NaN. GREEN: CE 0.6820 vs 0.6557 (|d|=0.026), logits relL2 0.047. E2E (RTX 4090, apr_code_sft_balanced @ --max-seq-len 2048): first-step CE now 1.58 (was 13-14), 0 NaN across the epoch; at lr 2e-5 the loss DECREASES 4.31 -> 0.29 by step 50. (The auto-selected lr 2e-4 @ rank 256 still diverges after ~25 steps — separate training-dynamics defect, gradients verified descending.) Contract: contracts/cuda-nf4-train-loss-parity-v1.yaml (pv lint PASS). Probes: instruct_pipeline/parity_probe.rs (loss/logits falsifier + layer bisect), transformer/cuda_block_parity_probe.rs (per-op layer-0). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

noahgift enabled auto-merge July 1, 2026 20:30

noahgift mentioned this pull request Jul 1, 2026

fix(finetune): CUDA loss window honors --max-seq-len — kill hardcoded 512 clamp that silently trained nothing #2250

Merged

Merge branch 'main' into fix/qlora-fused-rmsnorm-self-deadlock

2bb7d5b

noahgift added this pull request to the merge queue Jul 1, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 1, 2026

noahgift added this pull request to the merge queue Jul 1, 2026

Merged via the queue into main with commit 05045a1 Jul 1, 2026
10 checks passed

noahgift deleted the fix/qlora-fused-rmsnorm-self-deadlock branch July 1, 2026 22:56

This was referenced Jul 1, 2026

fix(train): NF4 QLoRA NaN loss — per-call cuBLAS stream binding kills cross-stream data race #2251

Merged

fix(train): NF4 QLoRA CUDA forward computed the WRONG MODEL — NeoX rope pairing + Q/K/V biases + full-warp softmax + causal CPU oracle #2252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(train): NF4 QLoRA forward self-deadlock — wave of 4 in fused_residual_rmsnorm_forward (unblocks GPU finetune)#2249

fix(train): NF4 QLoRA forward self-deadlock — wave of 4 in fused_residual_rmsnorm_forward (unblocks GPU finetune)#2249
noahgift merged 2 commits into
mainfrom
fix/qlora-fused-rmsnorm-self-deadlock

noahgift commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jul 1, 2026

Summary

Wave of 4 (the path had never executed once)

Fix

Verification (RTX 4090, sm_89)

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant