fix(finetune): truthful gpu-backend banner + honor --max-seq-len + per-step progress#2247
Merged
Merged
Conversation
…r-step progress
Three real `apr finetune` UX/correctness defects, root-caused and fixed with
falsifiable tests.
Defect 1 — false GPU banner (misleading, cost real debug time).
`execute_training` printed `[gpu-backend] CUDA selected — using cuBLAS backward
path` purely off the `--gpu-backend` flag, independent of whether CUDA is
actually initialized. But `InstructPipeline::from_apr` only calls `init_cuda`
when `quantize_nf4 == true`, and `quantize_nf4` is set only for QLoRA
(finetune.rs: `matches!(config.method, Method::QLoRA)`; see
aprender-train/src/finetune/instruct_pipeline/constructors.rs:293). So for
plain `-m lora` the banner was a false claim — training silently ran on CPU.
Fix: extract a pure, testable `gpu_backend_notice(gpu_backend, quantize_nf4,
wgpu_available)` that only claims cuBLAS/GPU when NF4 (QLoRA) is active, and
otherwise WARNS that plain LoRA runs on the CPU F32 path (use `-m qlora` for
GPU). Accurate for wgpu and auto too.
Defect 2 — `--max-seq-len` silently ignored on the instruct/LoRA path.
The CLI exposes `--max-seq-len` and threads it clap→dispatch→run, but
`execute_training` hardcoded `InstructConfig { max_seq_len: 512, .. }`
(finetune.rs:309), so `apr finetune ... --max-seq-len 384` was dropped for
LoRA/QLoRA (only classify/multi-adapter honored it). Fix: new
`build_instruct_config(config, lr, epochs, max_seq_len)` helper threads the
flag (fallback 512 when absent); plumb `max_seq_len` run → run_finetune_training
→ execute_training.
Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang).
`InstructTrainer::train` (instruct_trainer.rs:197) logged only per-EPOCH, so a
multi-minute first epoch looked frozen. Fix: low-noise per-step stderr progress
(every 10th step, the final step, or every ~10s) with step index/total, loss,
and lr. Pure logging — training math unchanged.
Tests (RED→GREEN verified by reintroducing each bug):
- build_instruct_config_threads_max_seq_len (384 must survive; RED=512)
- build_instruct_config_defaults_max_seq_len_to_512_when_absent
- gpu_backend_notice_cuda_plain_lora_warns_cpu (RED=false cuBLAS claim)
- gpu_backend_notice_{cuda_qlora,wgpu,auto_plain_lora,auto_qlora}
All 62 finetune + 5 instruct_trainer lib tests pass; clippy clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes three real
apr finetuneUX/correctness defects, each with a falsifiable, RED→GREEN-verified test.Defect 1 — false GPU banner (misleading; cost real debugging time)
crates/apr-cli/src/commands/finetune.rsprinted[gpu-backend] CUDA selected — using cuBLAS backward pathpurely off the--gpu-backendflag, independent of whether CUDA was actually initialized.InstructPipeline::from_apronly callsinit_cudawhenquantize_nf4 == true(crates/aprender-train/src/finetune/instruct_pipeline/constructors.rs:293), andquantize_nf4is set only for QLoRA (matches!(config.method, Method::QLoRA)). So for plain-m lora, the banner was a false claim — training silently ran on CPU.Fix: extracted a pure, testable
gpu_backend_notice(gpu_backend, quantize_nf4, wgpu_available) -> GpuBackendPlan. It only claims cuBLAS/GPU when NF4 (QLoRA) is active; for plain LoRA it prints a clear WARNING that training runs on the CPU F32 path and to use-m qlorafor GPU. Correct forwgpuandautotoo.Defect 2 —
--max-seq-lensilently ignored on the instruct/LoRA pathThe CLI exposes
--max-seq-lenand plumbs it clap →dispatch.rs→finetune::run, butexecute_traininghardcodedInstructConfig { max_seq_len: 512, .. }(oldfinetune.rs:309), soapr finetune ... --max-seq-len 384was silently dropped for LoRA/QLoRA (only theclassify/multi-adapter paths honored it).Fix: new
build_instruct_config(config, lr, epochs, max_seq_len)helper threads the flag through (falls back to entrenar's default 512 when absent); plumbedmax_seq_lenthroughrun→run_finetune_training→execute_training.Defect 3 — no per-step training progress (a slow CPU epoch looks like a hang)
crates/aprender-train/src/finetune/instruct_trainer.rs(InstructTrainer::train) logged only per-EPOCH, so a multi-minute first epoch looked frozen.Fix: low-noise per-step stderr progress — emitted on every 10th step, on the final step, or at least every ~10s — with step index/total, loss, and lr. Pure logging; the training math (loss/token accumulation, scheduler) is unchanged.
Tests (RED→GREEN verified by temporarily reintroducing each bug)
build_instruct_config_threads_max_seq_len— 384 must reachInstructConfig.max_seq_len(RED = 512 when bug present)build_instruct_config_defaults_max_seq_len_to_512_when_absentgpu_backend_notice_cuda_plain_lora_warns_cpu— RED = false cuBLAS claimgpu_backend_notice_cuda_qlora_claims_cublas,_wgpu_selects_wgpu,_auto_plain_lora_is_cpu,_auto_qlora_prefers_wgpu_when_availableAll 62 finetune + 5 instruct_trainer lib tests pass;
cargo fmt+ clippy clean on touched files. No contract was bumped — these are UX/correctness fixes (not kernel math), so a test-backed fix is used per repo guidance.🤖 Generated with Claude Code