Skip to content

fix(finetune): truthful gpu-backend banner + honor --max-seq-len + per-step progress#2247

Merged
noahgift merged 1 commit into
mainfrom
fix/finetune-gpu-banner-seqlen-progress
Jul 1, 2026
Merged

fix(finetune): truthful gpu-backend banner + honor --max-seq-len + per-step progress#2247
noahgift merged 1 commit into
mainfrom
fix/finetune-gpu-banner-seqlen-progress

Conversation

@noahgift

@noahgift noahgift commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Fixes three real apr finetune UX/correctness defects, each with a falsifiable, RED→GREEN-verified test.

Defect 1 — false GPU banner (misleading; cost real debugging time)

crates/apr-cli/src/commands/finetune.rs printed [gpu-backend] CUDA selected — using cuBLAS backward path purely off the --gpu-backend flag, independent of whether CUDA was actually initialized. InstructPipeline::from_apr only calls init_cuda when quantize_nf4 == true (crates/aprender-train/src/finetune/instruct_pipeline/constructors.rs:293), and quantize_nf4 is set only for QLoRA (matches!(config.method, Method::QLoRA)). So for plain -m lora, the banner was a false claim — training silently ran on CPU.

Fix: extracted a pure, testable gpu_backend_notice(gpu_backend, quantize_nf4, wgpu_available) -> GpuBackendPlan. It only claims cuBLAS/GPU when NF4 (QLoRA) is active; for plain LoRA it prints a clear WARNING that training runs on the CPU F32 path and to use -m qlora for GPU. Correct for wgpu and auto too.

Defect 2 — --max-seq-len silently ignored on the instruct/LoRA path

The CLI exposes --max-seq-len and plumbs it clap → dispatch.rsfinetune::run, but execute_training hardcoded InstructConfig { max_seq_len: 512, .. } (old finetune.rs:309), so apr finetune ... --max-seq-len 384 was silently dropped for LoRA/QLoRA (only the classify/multi-adapter paths honored it).

Fix: new build_instruct_config(config, lr, epochs, max_seq_len) helper threads the flag through (falls back to entrenar's default 512 when absent); plumbed max_seq_len through runrun_finetune_trainingexecute_training.

Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang)

crates/aprender-train/src/finetune/instruct_trainer.rs (InstructTrainer::train) logged only per-EPOCH, so a multi-minute first epoch looked frozen.

Fix: low-noise per-step stderr progress — emitted on every 10th step, on the final step, or at least every ~10s — with step index/total, loss, and lr. Pure logging; the training math (loss/token accumulation, scheduler) is unchanged.

Tests (RED→GREEN verified by temporarily reintroducing each bug)

  • build_instruct_config_threads_max_seq_len — 384 must reach InstructConfig.max_seq_len (RED = 512 when bug present)
  • build_instruct_config_defaults_max_seq_len_to_512_when_absent
  • gpu_backend_notice_cuda_plain_lora_warns_cpu — RED = false cuBLAS claim
  • gpu_backend_notice_cuda_qlora_claims_cublas, _wgpu_selects_wgpu, _auto_plain_lora_is_cpu, _auto_qlora_prefers_wgpu_when_available

All 62 finetune + 5 instruct_trainer lib tests pass; cargo fmt + clippy clean on touched files. No contract was bumped — these are UX/correctness fixes (not kernel math), so a test-backed fix is used per repo guidance.

🤖 Generated with Claude Code

…r-step progress

Three real `apr finetune` UX/correctness defects, root-caused and fixed with
falsifiable tests.

Defect 1 — false GPU banner (misleading, cost real debug time).
`execute_training` printed `[gpu-backend] CUDA selected — using cuBLAS backward
path` purely off the `--gpu-backend` flag, independent of whether CUDA is
actually initialized. But `InstructPipeline::from_apr` only calls `init_cuda`
when `quantize_nf4 == true`, and `quantize_nf4` is set only for QLoRA
(finetune.rs: `matches!(config.method, Method::QLoRA)`; see
aprender-train/src/finetune/instruct_pipeline/constructors.rs:293). So for
plain `-m lora` the banner was a false claim — training silently ran on CPU.
Fix: extract a pure, testable `gpu_backend_notice(gpu_backend, quantize_nf4,
wgpu_available)` that only claims cuBLAS/GPU when NF4 (QLoRA) is active, and
otherwise WARNS that plain LoRA runs on the CPU F32 path (use `-m qlora` for
GPU). Accurate for wgpu and auto too.

Defect 2 — `--max-seq-len` silently ignored on the instruct/LoRA path.
The CLI exposes `--max-seq-len` and threads it clap→dispatch→run, but
`execute_training` hardcoded `InstructConfig { max_seq_len: 512, .. }`
(finetune.rs:309), so `apr finetune ... --max-seq-len 384` was dropped for
LoRA/QLoRA (only classify/multi-adapter honored it). Fix: new
`build_instruct_config(config, lr, epochs, max_seq_len)` helper threads the
flag (fallback 512 when absent); plumb `max_seq_len` run → run_finetune_training
→ execute_training.

Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang).
`InstructTrainer::train` (instruct_trainer.rs:197) logged only per-EPOCH, so a
multi-minute first epoch looked frozen. Fix: low-noise per-step stderr progress
(every 10th step, the final step, or every ~10s) with step index/total, loss,
and lr. Pure logging — training math unchanged.

Tests (RED→GREEN verified by reintroducing each bug):
- build_instruct_config_threads_max_seq_len (384 must survive; RED=512)
- build_instruct_config_defaults_max_seq_len_to_512_when_absent
- gpu_backend_notice_cuda_plain_lora_warns_cpu (RED=false cuBLAS claim)
- gpu_backend_notice_{cuda_qlora,wgpu,auto_plain_lora,auto_qlora}
All 62 finetune + 5 instruct_trainer lib tests pass; clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge July 1, 2026 11:33
@noahgift noahgift added this pull request to the merge queue Jul 1, 2026
Merged via the queue into main with commit 2a31541 Jul 1, 2026
11 checks passed
@noahgift noahgift deleted the fix/finetune-gpu-banner-seqlen-progress branch July 1, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant