fix(finetune): truthful gpu-backend banner + honor --max-seq-len + per-step progress by noahgift · Pull Request #2247 · paiml/aprender

noahgift · 2026-07-01T11:33:27Z

Fixes three real apr finetune UX/correctness defects, each with a falsifiable, RED→GREEN-verified test.

Defect 1 — false GPU banner (misleading; cost real debugging time)

crates/apr-cli/src/commands/finetune.rs printed [gpu-backend] CUDA selected — using cuBLAS backward path purely off the --gpu-backend flag, independent of whether CUDA was actually initialized. InstructPipeline::from_apr only calls init_cuda when quantize_nf4 == true (crates/aprender-train/src/finetune/instruct_pipeline/constructors.rs:293), and quantize_nf4 is set only for QLoRA (matches!(config.method, Method::QLoRA)). So for plain -m lora, the banner was a false claim — training silently ran on CPU.

Fix: extracted a pure, testable gpu_backend_notice(gpu_backend, quantize_nf4, wgpu_available) -> GpuBackendPlan. It only claims cuBLAS/GPU when NF4 (QLoRA) is active; for plain LoRA it prints a clear WARNING that training runs on the CPU F32 path and to use -m qlora for GPU. Correct for wgpu and auto too.

Defect 2 — `--max-seq-len` silently ignored on the instruct/LoRA path

The CLI exposes --max-seq-len and plumbs it clap → dispatch.rs → finetune::run, but execute_training hardcoded InstructConfig { max_seq_len: 512, .. } (old finetune.rs:309), so apr finetune ... --max-seq-len 384 was silently dropped for LoRA/QLoRA (only the classify/multi-adapter paths honored it).

Fix: new build_instruct_config(config, lr, epochs, max_seq_len) helper threads the flag through (falls back to entrenar's default 512 when absent); plumbed max_seq_len through run → run_finetune_training → execute_training.

Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang)

crates/aprender-train/src/finetune/instruct_trainer.rs (InstructTrainer::train) logged only per-EPOCH, so a multi-minute first epoch looked frozen.

Fix: low-noise per-step stderr progress — emitted on every 10th step, on the final step, or at least every ~10s — with step index/total, loss, and lr. Pure logging; the training math (loss/token accumulation, scheduler) is unchanged.

Tests (RED→GREEN verified by temporarily reintroducing each bug)

build_instruct_config_threads_max_seq_len — 384 must reach InstructConfig.max_seq_len (RED = 512 when bug present)
build_instruct_config_defaults_max_seq_len_to_512_when_absent
gpu_backend_notice_cuda_plain_lora_warns_cpu — RED = false cuBLAS claim
gpu_backend_notice_cuda_qlora_claims_cublas, _wgpu_selects_wgpu, _auto_plain_lora_is_cpu, _auto_qlora_prefers_wgpu_when_available

All 62 finetune + 5 instruct_trainer lib tests pass; cargo fmt + clippy clean on touched files. No contract was bumped — these are UX/correctness fixes (not kernel math), so a test-backed fix is used per repo guidance.

🤖 Generated with Claude Code

…r-step progress Three real `apr finetune` UX/correctness defects, root-caused and fixed with falsifiable tests. Defect 1 — false GPU banner (misleading, cost real debug time). `execute_training` printed `[gpu-backend] CUDA selected — using cuBLAS backward path` purely off the `--gpu-backend` flag, independent of whether CUDA is actually initialized. But `InstructPipeline::from_apr` only calls `init_cuda` when `quantize_nf4 == true`, and `quantize_nf4` is set only for QLoRA (finetune.rs: `matches!(config.method, Method::QLoRA)`; see aprender-train/src/finetune/instruct_pipeline/constructors.rs:293). So for plain `-m lora` the banner was a false claim — training silently ran on CPU. Fix: extract a pure, testable `gpu_backend_notice(gpu_backend, quantize_nf4, wgpu_available)` that only claims cuBLAS/GPU when NF4 (QLoRA) is active, and otherwise WARNS that plain LoRA runs on the CPU F32 path (use `-m qlora` for GPU). Accurate for wgpu and auto too. Defect 2 — `--max-seq-len` silently ignored on the instruct/LoRA path. The CLI exposes `--max-seq-len` and threads it clap→dispatch→run, but `execute_training` hardcoded `InstructConfig { max_seq_len: 512, .. }` (finetune.rs:309), so `apr finetune ... --max-seq-len 384` was dropped for LoRA/QLoRA (only classify/multi-adapter honored it). Fix: new `build_instruct_config(config, lr, epochs, max_seq_len)` helper threads the flag (fallback 512 when absent); plumb `max_seq_len` run → run_finetune_training → execute_training. Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang). `InstructTrainer::train` (instruct_trainer.rs:197) logged only per-EPOCH, so a multi-minute first epoch looked frozen. Fix: low-noise per-step stderr progress (every 10th step, the final step, or every ~10s) with step index/total, loss, and lr. Pure logging — training math unchanged. Tests (RED→GREEN verified by reintroducing each bug): - build_instruct_config_threads_max_seq_len (384 must survive; RED=512) - build_instruct_config_defaults_max_seq_len_to_512_when_absent - gpu_backend_notice_cuda_plain_lora_warns_cpu (RED=false cuBLAS claim) - gpu_backend_notice_{cuda_qlora,wgpu,auto_plain_lora,auto_qlora} All 62 finetune + 5 instruct_trainer lib tests pass; clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

noahgift enabled auto-merge July 1, 2026 11:33

noahgift added this pull request to the merge queue Jul 1, 2026

Merged via the queue into main with commit 2a31541 Jul 1, 2026
11 checks passed

noahgift deleted the fix/finetune-gpu-banner-seqlen-progress branch July 1, 2026 12:13

This was referenced Jul 1, 2026

chore(release): 0.56.0 — apr-format sovereign leaf + f16-RNE sweep + honest finetune GPU + Ollama /api streaming #2248

Merged

fix(finetune): CUDA loss window honors --max-seq-len — kill hardcoded 512 clamp that silently trained nothing #2250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(finetune): truthful gpu-backend banner + honor --max-seq-len + per-step progress#2247

fix(finetune): truthful gpu-backend banner + honor --max-seq-len + per-step progress#2247
noahgift merged 1 commit into
mainfrom
fix/finetune-gpu-banner-seqlen-progress

noahgift commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jul 1, 2026

Defect 1 — false GPU banner (misleading; cost real debugging time)

Defect 2 — --max-seq-len silently ignored on the instruct/LoRA path

Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang)

Tests (RED→GREEN verified by temporarily reintroducing each bug)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Defect 2 — `--max-seq-len` silently ignored on the instruct/LoRA path