fix(finetune): CUDA loss window honors --max-seq-len — kill hardcoded 512 clamp that silently trained nothing by noahgift · Pull Request #2250 · paiml/aprender

noahgift · 2026-07-01T20:41:54Z

Summary

cuda_train_step had a second, older hardcoded clamp (entrenar#318) — max_position_embeddings.min(512) — independent of #2247's --max-seq-len threading and of the scratch capacity the forward actually uses. Any sample whose prompt exceeded 512 tokens had its entire response clamped out of the loss window: num_loss_tokens == 0 → silent loss=0.0, zero gradient.

Observed live on the apr-code SFT corpus (system prompt ≫ 512 tokens): a full qlora epoch "trained" with no learning — loss=0.0000 floods, avg_loss=93.87 (dominated by NaN sentinels), 559 loss tokens across 160 samples (~3.5/sample for 30–60-token responses).

Fix

Window math extracted to pure fn cuda_loss_window (single correctness surface): effective capacity = GPU scratch capacity when present (sized from InstructConfig::max_seq_len at init), min(max_position_embeddings, 512) only as the no-scratch fallback — mirroring forward_cuda_training exactly.
Prompt-overflow samples now skip loudly (per-sample stderr warning naming the --max-seq-len remedy) instead of averaging silent zeros. Token-weighted epoch aggregation already excludes them.

Verification

Falsifier FALSIFY-CUDA-LOSS-WINDOW-512-001 mutation-verified: restoring the hardcoded clamp → falsify_cuda_loss_window_honors_scratch_capacity goes RED (seq must not be clamped below scratch capacity); GREEN on fix. 3 companion tests pin the fallback and overflow semantics. Pure-fn tests — run in CI without a GPU.
Contract contracts/finetune-cuda-loss-window-v1.yaml — pv lint contracts/ PASS.
fmt ✓, non-cuda cargo check ✓.
E2E (RTX 4090): apr finetune -m qlora --max-seq-len 2048 on datasets/apr_code_sft_balanced.jsonl — in flight, evidence in PR comment.

Cascade position: 2nd falsifier after the NF4 QLoRA deadlock fix (#2249). Together they take apr finetune -m qlora from freezes on first forward → trains long-system-prompt corpora on the 4090. Directly unblocks the apr-code tool_call flip (5th pillar).

🤖 Generated with Claude Code

…oded 512 clamp that silently trained nothing cuda_train_step had a second, older hardcoded clamp (entrenar#318): max_position_embeddings.min(512) — independent of the #2247 --max-seq-len threading and of the scratch capacity the forward actually uses. Any sample whose prompt exceeded 512 tokens had its entire response clamped out of the loss window: num_loss_tokens == 0 -> silent loss=0.0, zero gradient. On the apr-code SFT corpus (system prompt >> 512 tokens) that was ~every sample: a full qlora epoch "trained" with no learning (observed live: loss=0.0000 floods, avg_loss=93.87 from NaN sentinels, 559 loss tokens across 160 samples ~ 3.5/sample for 30-60-token responses). Fix: window math extracted to pure fn cuda_loss_window — effective capacity = GPU scratch capacity when present (sized from InstructConfig::max_seq_len at init), min(max_position_embeddings, 512) only as the no-scratch fallback, mirroring forward_cuda_training exactly. Prompt-overflow samples now skip LOUDLY (per-sample stderr warning naming the --max-seq-len remedy) instead of averaging silent zeros. Falsifier FALSIFY-CUDA-LOSS-WINDOW-512-001 mutation-verified: restoring the hardcoded clamp makes falsify_cuda_loss_window_honors_scratch_capacity go RED; GREEN on fix. Contract finetune-cuda-loss-window-v1.yaml (pv lint PASS). Next falsifier in the cascade after the NF4 QLoRA deadlock fix (#2249) — together they make apr finetune -m qlora actually train long-system-prompt corpora on the 4090. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…pe pairing, Q/K/V biases, full-warp softmax, causal CPU oracle (FALSIFY-CUDA-NF4-TRAIN-LOSS-PARITY-001) (#2252) Cascade defect #4: after the stream-ordering fix (#2249/#2250/stream-bind) apr finetune -m qlora trained end-to-end but the loss sat FLAT at CE 13-14 (> ln(151936)=11.93 — worse than uniform) on data the base model emits correctly in inference, then adapters blew into permanent NaN at ~step 125. The forward was finite but WRONG. Oracle bisection (pure-CPU CE vs GPU-forward+CPU-CE vs GPU-forward+fused-GPU-CE, then per-op layer-0 bisection vs a manual CPU replay) found FOUR stacked defects: 1. WRONG ROPE PAIRING (dominant): entrenar's batched_rope_neox_forward/ _backward wrappers instantiated BatchedRopeKernel — ADJACENT-pair (GPT-J) rotation, the convention realizar reserves for non-NeoX rope types. Qwen2/LLaMA need NEOX split-half pairs (i, i+d/2). Every layer's Q/K rotated in the wrong basis (post-rope relL2 0.42/0.65 vs oracle while un-roped V matched at quant noise 0.09). Fix: new BatchedRopeNeoxKernel + BatchedRopeNeoxBackwardKernel (precise trig, CORRECTNESS-013) in aprender-gpu; wrappers + pre-warm keys rewired. BatchedRopeKernel semantics preserved for realizar's non-NeoX users. 2. DROPPED Q/K/V BIASES: CudaNf4TransformerBlock had no bias support at all (Qwen2 use_bias=true; CPU path applies them). Bias drop alone shifts toy causal CE 2.13 -> 4.49. Fix: replicated bias buffers + cuda_add_inplace after each projection GEMM (before QK-norm/RoPE), threaded from all three NF4 construction sites + the instruct FP32 site (which passed None despite FALSIFY-CUDA-FORWARD-PARITY-002). 3. PARTIAL-WARP SHFL UB: batched softmax forward/backward launched block=(32.min(row_size)) while the kernels' reductions use shfl.sync membermask 0xFFFFFFFF — undefined with inactive named lanes (PTX ISA). For seq<32 row max/sum picked up garbage data-dependently -> 0/0=NaN rows. Fix: always launch a FULL 32-lane warp (guarded loops carry reduction identities on idle lanes). 4. NON-CAUSAL CPU ORACLE (label leakage): autograd::ops::attention had NO causal mask — the CPU train/eval path for decoder models attended bidirectionally, leaking label tokens backwards (toy causal CE 2.13 reported as 0.17) and corrupting CPU training/eval. Fix: attention_causal (masked scores; shared softmax backward is exact since masked weights are 0) dispatched for ModelArchitecture::Decoder; encoders stay bidirectional. Falsifier FALSIFY-CUDA-NF4-TRAIN-LOSS-PARITY-001 (parity_probe.rs): GPU fused loss vs an NF4-quantization-MATCHED causal CPU oracle (|dCE|<0.5, full-logits relL2<0.10, toy CE<6, fused-vs-CPU CE on identical logits <0.05). Mutation-verified RED per reverted fix: rope -> relL2 0.183; biases -> CE 5.17/relL2 0.97; warp -> NaN. GREEN: CE 0.6820 vs 0.6557 (|d|=0.026), logits relL2 0.047. E2E (RTX 4090, apr_code_sft_balanced @ --max-seq-len 2048): first-step CE now 1.58 (was 13-14), 0 NaN across the epoch; at lr 2e-5 the loss DECREASES 4.31 -> 0.29 by step 50. (The auto-selected lr 2e-4 @ rank 256 still diverges after ~25 steps — separate training-dynamics defect, gradients verified descending.) Contract: contracts/cuda-nf4-train-loss-parity-v1.yaml (pv lint PASS). Probes: instruct_pipeline/parity_probe.rs (loss/logits falsifier + layer bisect), transformer/cuda_block_parity_probe.rs (per-op layer-0). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

noahgift enabled auto-merge July 1, 2026 20:41

Merge branch 'main' into fix/finetune-cuda-loss-window-512-clamp

69dc3d7

noahgift added this pull request to the merge queue Jul 1, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jul 1, 2026

noahgift added this pull request to the merge queue Jul 1, 2026

Merged via the queue into main with commit c19f7bb Jul 1, 2026
10 checks passed

noahgift deleted the fix/finetune-cuda-loss-window-512-clamp branch July 1, 2026 23:18

This was referenced Jul 1, 2026

fix(train): NF4 QLoRA NaN loss — per-call cuBLAS stream binding kills cross-stream data race #2251

Merged

fix(train): NF4 QLoRA CUDA forward computed the WRONG MODEL — NeoX rope pairing + Q/K/V biases + full-warp softmax + causal CPU oracle #2252

Merged

noahgift mentioned this pull request Jul 2, 2026

chore(release): 0.57.0 — GPU QLoRA fine-tuning works + runnable merge + 3 enforced beats #2256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(finetune): CUDA loss window honors --max-seq-len — kill hardcoded 512 clamp that silently trained nothing#2250

fix(finetune): CUDA loss window honors --max-seq-len — kill hardcoded 512 clamp that silently trained nothing#2250
noahgift merged 2 commits into
mainfrom
fix/finetune-cuda-loss-window-512-clamp

noahgift commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jul 1, 2026

Summary

Fix

Verification

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant