fix(finetune): apr finetune --merge produces a directly-runnable .apr (fail-closed gate + duplicate-field poison + GGUF↔HF adapter naming)#2254
Merged
Conversation
….apr — fail-closed post-write gate + duplicate-field metadata poison fix + entrenar adapter naming (FALSIFY-APR-MERGE-RUNNABLE-001..005) Merging the flip adapter into qwen2.5-coder-1.5b-instruct-q4k.apr produced a file apr run rejected (C-01 missing architecture; tokenizer 'missing') even though BOTH were physically in the container. ROOT CAUSE (duplicate-field poison): import-produced bases stamp HF-alias dim keys (num_hidden_layers/num_attention_heads/num_key_value_heads) that land in AprV2Metadata.custom (no serde aliases aprender-side). run_merge re-serialized the cloned metadata with BOTH "num_layers": null (typed Option field, no skip_serializing_if) AND the alias key. Realizar's AprMetadata aliases them (PMAT-111), so serde fails with "duplicate field" — and MappedAprModel::from_mmap swallows it via unwrap_or_default(), silently dropping ALL metadata incl. the embedded tokenizer. FIX (fail-closed at every layer): - apr-format: AprV2Metadata skips serializing None fields (except the three C-APR-PROVENANCE keys FALSIFY-SHIP-022 pins to explicit null); new canonicalize_hf_aliases() promotes alias keys into typed fields and strips the alias spellings. - run_merge: canonicalize + backfill arch/C-03 dims from tensor shapes and GH-376-style presets; clear stale quantization markers (output is F32); reject -o *.safetensors (no more APR-in-disguise); post-write gate re-opens the output and validates through realizar's OWN loader (GGUFConfig::from_apr C-01/C-03 + embedded BPE/SP tokenizer) — on failure the output is DELETED and the merge errors loudly. - Adapter resolution: entrenar trainer naming lora.{layer}.{proj}.lora_{a,b} now merges against HF-named bases (live repro: 0/339 silently merged); rank derived per-tensor from the adapter's [rank, d_in] lora_a shape; alpha read from safetensors header OR the trainer's sidecar metadata.json; merged_count==0 with LoRA pairs present is a hard error. Falsifiers (all mutation-verified RED on pre-fix main, GREEN post-fix): - 001 merged output loads through realizar with full metadata + tokenizer - 002 .safetensors output extension rejected, nothing written - 003 tokenizer-less base fails the gate, output deleted - 004 entrenar-style adapter actually merges (shape-derived rank) - 005 zero-match merge errors instead of shipping a base copy Contract: contracts/apr-merge-runnable-v1.yaml (pv validate + pv lint PASS). Suites: apr-cli 6604, apr-format 88, aprender-core 13970, aprender-serve 15454, aprender-contracts 1412 — all green. E2E: real 1.5B flip merge passes the gate ([PMAT-171] tokenizer oracle fires) and apr run loads the merged output directly. Closes the release-blocking model-surgery wave slice for v0.57.0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pters (else 0 layers merged)
An import-produced base names tensors GGUF-style (blk.N.attn_q.weight)
while entrenar trainer adapters name them HF-style (lora.N.q_proj.lora_a).
adapter_pair_names generated only lora.N.attn_q.lora_a candidates → ZERO
matches → run_merge silently copied the base through unchanged
(merged_count=0): a "merged" model that is really the untrained base.
Fix: gguf_proj_to_hf() maps attn_{q,k,v}/attn_output/ffn_{gate,up,down}
→ {q,k,v,o}_proj/{gate,up,down}_proj, adding the HF-named candidate for
each GGUF-named base tensor.
FALSIFY-APR-MERGE-RUNNABLE-004b mutation-verified: neuter the mapping →
RED (0 merged, fail-closed gate rejects); restore → GREEN. E2E: merging
the flip adapter (lora.N.{q,v}_proj) into
qwen2.5-coder-1.5b-instruct-q4k.apr (blk.N.attn_q) now reports
"Layers merged 56/339" (28 layers x q+v), embeds the tokenizer
(PMAT-171), passes the fail-closed gate, and the output runs — emitting
a coherent {"input":{"command":...},"name":"shell"} tool-call payload.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Merging a LoRA adapter into an import-produced base (
apr finetune --merge) produced a fileapr runrejected (C-01 missing architecture; tokenizer reported missing) even though both were physically in the container — and, more insidiously, merged zero layers while reporting success. Two stacked root causes:1. Duplicate-field metadata poison → silent total metadata loss
Import-produced bases stamp HF-alias dim keys (
num_hidden_layers,num_attention_heads,num_key_value_heads) intoAprV2Metadata.custom.run_mergere-serialized the cloned metadata with both"num_layers": null(typedOption, noskip_serializing_if) and the alias key. Realizar'sAprMetadataaliases them (PMAT-111) → serdeduplicate field→MappedAprModel::from_mmapswallows it viaunwrap_or_default(), silently dropping all metadata including the embedded tokenizer. Fix:skip_serializing_ifon the typed dim fields + backfill architecture/dims; fail-closed post-write gate re-opens the output and rejects (deletes) anything not loadable.2. GGUF↔HF adapter-name mismatch → 0 layers merged
Import bases are GGUF-named (
blk.N.attn_q.weight); entrenar adapters are HF-named (lora.N.q_proj.lora_a).adapter_pair_namesonly generatedlora.N.attn_q.*candidates → zero matches → the base was copied through unchanged (a "merged" model that is really the untrained base). Fix:gguf_proj_to_hf()adds the HF-named candidate per GGUF base tensor.Verification
FALSIFY-APR-MERGE-RUNNABLE-001..005(+004b) pass; 004b mutation-verified (neuter the GGUF→HF map → RED: 0 merged, gate rejects; restore → GREEN).contracts/apr-merge-runnable-v1.yaml—pv validate+pv lint contracts/PASS.qwen2.5-coder-1.5b-instruct-q4k.apr→ "Layers merged 56/339" (28×q+v), tokenizer embedded (PMAT-171), fail-closed gate passes, andapr runon the output works — emits{"input":{"command":"cargo test --lib"},"name":"shell"}(coherent, correct-shaped tool-call payload).cargo test -p apr-cli --lib: 6605 pass, 0 fail.Closes the release-blocking model-surgery slice for v0.57.0; unblocks the apr-code flip smoke (the trained model now merges + runs end-to-end).
Remaining wave items (separate, filed):
apr export --format safetensorsgarbage weights,apr importvocab/hidden transpose,apr convert --quantize q4kdrops runnable tokenizer.🤖 Generated with Claude Code