fix(finetune): apr finetune --merge produces a directly-runnable .apr (fail-closed gate + duplicate-field poison + GGUF↔HF adapter naming) by noahgift · Pull Request #2254 · paiml/aprender

noahgift · 2026-07-02T10:37:32Z

Summary

Merging a LoRA adapter into an import-produced base (apr finetune --merge) produced a file apr run rejected (C-01 missing architecture; tokenizer reported missing) even though both were physically in the container — and, more insidiously, merged zero layers while reporting success. Two stacked root causes:

1. Duplicate-field metadata poison → silent total metadata loss

Import-produced bases stamp HF-alias dim keys (num_hidden_layers, num_attention_heads, num_key_value_heads) into AprV2Metadata.custom. run_merge re-serialized the cloned metadata with both "num_layers": null (typed Option, no skip_serializing_if) and the alias key. Realizar's AprMetadata aliases them (PMAT-111) → serde duplicate field → MappedAprModel::from_mmap swallows it via unwrap_or_default(), silently dropping all metadata including the embedded tokenizer. Fix: skip_serializing_if on the typed dim fields + backfill architecture/dims; fail-closed post-write gate re-opens the output and rejects (deletes) anything not loadable.

2. GGUF↔HF adapter-name mismatch → 0 layers merged

Import bases are GGUF-named (blk.N.attn_q.weight); entrenar adapters are HF-named (lora.N.q_proj.lora_a). adapter_pair_names only generated lora.N.attn_q.* candidates → zero matches → the base was copied through unchanged (a "merged" model that is really the untrained base). Fix: gguf_proj_to_hf() adds the HF-named candidate per GGUF base tensor.

Verification

6 falsifiers FALSIFY-APR-MERGE-RUNNABLE-001..005 (+004b) pass; 004b mutation-verified (neuter the GGUF→HF map → RED: 0 merged, gate rejects; restore → GREEN).
Contract contracts/apr-merge-runnable-v1.yaml — pv validate + pv lint contracts/ PASS.
E2E (RTX 4090): merging the real flip adapter (rank 256, q+v) into qwen2.5-coder-1.5b-instruct-q4k.apr → "Layers merged 56/339" (28×q+v), tokenizer embedded (PMAT-171), fail-closed gate passes, and apr run on the output works — emits {"input":{"command":"cargo test --lib"},"name":"shell"} (coherent, correct-shaped tool-call payload).
Full cargo test -p apr-cli --lib: 6605 pass, 0 fail.

Closes the release-blocking model-surgery slice for v0.57.0; unblocks the apr-code flip smoke (the trained model now merges + runs end-to-end).

Remaining wave items (separate, filed): apr export --format safetensors garbage weights, apr import vocab/hidden transpose, apr convert --quantize q4k drops runnable tokenizer.

🤖 Generated with Claude Code

….apr — fail-closed post-write gate + duplicate-field metadata poison fix + entrenar adapter naming (FALSIFY-APR-MERGE-RUNNABLE-001..005) Merging the flip adapter into qwen2.5-coder-1.5b-instruct-q4k.apr produced a file apr run rejected (C-01 missing architecture; tokenizer 'missing') even though BOTH were physically in the container. ROOT CAUSE (duplicate-field poison): import-produced bases stamp HF-alias dim keys (num_hidden_layers/num_attention_heads/num_key_value_heads) that land in AprV2Metadata.custom (no serde aliases aprender-side). run_merge re-serialized the cloned metadata with BOTH "num_layers": null (typed Option field, no skip_serializing_if) AND the alias key. Realizar's AprMetadata aliases them (PMAT-111), so serde fails with "duplicate field" — and MappedAprModel::from_mmap swallows it via unwrap_or_default(), silently dropping ALL metadata incl. the embedded tokenizer. FIX (fail-closed at every layer): - apr-format: AprV2Metadata skips serializing None fields (except the three C-APR-PROVENANCE keys FALSIFY-SHIP-022 pins to explicit null); new canonicalize_hf_aliases() promotes alias keys into typed fields and strips the alias spellings. - run_merge: canonicalize + backfill arch/C-03 dims from tensor shapes and GH-376-style presets; clear stale quantization markers (output is F32); reject -o *.safetensors (no more APR-in-disguise); post-write gate re-opens the output and validates through realizar's OWN loader (GGUFConfig::from_apr C-01/C-03 + embedded BPE/SP tokenizer) — on failure the output is DELETED and the merge errors loudly. - Adapter resolution: entrenar trainer naming lora.{layer}.{proj}.lora_{a,b} now merges against HF-named bases (live repro: 0/339 silently merged); rank derived per-tensor from the adapter's [rank, d_in] lora_a shape; alpha read from safetensors header OR the trainer's sidecar metadata.json; merged_count==0 with LoRA pairs present is a hard error. Falsifiers (all mutation-verified RED on pre-fix main, GREEN post-fix): - 001 merged output loads through realizar with full metadata + tokenizer - 002 .safetensors output extension rejected, nothing written - 003 tokenizer-less base fails the gate, output deleted - 004 entrenar-style adapter actually merges (shape-derived rank) - 005 zero-match merge errors instead of shipping a base copy Contract: contracts/apr-merge-runnable-v1.yaml (pv validate + pv lint PASS). Suites: apr-cli 6604, apr-format 88, aprender-core 13970, aprender-serve 15454, aprender-contracts 1412 — all green. E2E: real 1.5B flip merge passes the gate ([PMAT-171] tokenizer oracle fires) and apr run loads the merged output directly. Closes the release-blocking model-surgery wave slice for v0.57.0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…pters (else 0 layers merged) An import-produced base names tensors GGUF-style (blk.N.attn_q.weight) while entrenar trainer adapters name them HF-style (lora.N.q_proj.lora_a). adapter_pair_names generated only lora.N.attn_q.lora_a candidates → ZERO matches → run_merge silently copied the base through unchanged (merged_count=0): a "merged" model that is really the untrained base. Fix: gguf_proj_to_hf() maps attn_{q,k,v}/attn_output/ffn_{gate,up,down} → {q,k,v,o}_proj/{gate,up,down}_proj, adding the HF-named candidate for each GGUF-named base tensor. FALSIFY-APR-MERGE-RUNNABLE-004b mutation-verified: neuter the mapping → RED (0 merged, fail-closed gate rejects); restore → GREEN. E2E: merging the flip adapter (lora.N.{q,v}_proj) into qwen2.5-coder-1.5b-instruct-q4k.apr (blk.N.attn_q) now reports "Layers merged 56/339" (28 layers x q+v), embeds the tokenizer (PMAT-171), passes the fail-closed gate, and the output runs — emitting a coherent {"input":{"command":...},"name":"shell"} tool-call payload. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

noahgift and others added 2 commits July 2, 2026 11:24

noahgift enabled auto-merge July 2, 2026 10:37

Merge branch 'main' into fix/apr-merge-runnable

9224714

noahgift added this pull request to the merge queue Jul 2, 2026

Merged via the queue into main with commit 625ef84 Jul 2, 2026
10 checks passed

noahgift deleted the fix/apr-merge-runnable branch July 2, 2026 12:03

noahgift mentioned this pull request Jul 2, 2026

chore(release): 0.57.0 — GPU QLoRA fine-tuning works + runnable merge + 3 enforced beats #2256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(finetune): apr finetune --merge produces a directly-runnable .apr (fail-closed gate + duplicate-field poison + GGUF↔HF adapter naming)#2254

fix(finetune): apr finetune --merge produces a directly-runnable .apr (fail-closed gate + duplicate-field poison + GGUF↔HF adapter naming)#2254
noahgift merged 3 commits into
mainfrom
fix/apr-merge-runnable

noahgift commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jul 2, 2026

Summary

1. Duplicate-field metadata poison → silent total metadata loss

2. GGUF↔HF adapter-name mismatch → 0 layers merged

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant