Skip to content

fix(finetune): apr finetune --merge produces a directly-runnable .apr (fail-closed gate + duplicate-field poison + GGUF↔HF adapter naming)#2254

Merged
noahgift merged 3 commits into
mainfrom
fix/apr-merge-runnable
Jul 2, 2026
Merged

fix(finetune): apr finetune --merge produces a directly-runnable .apr (fail-closed gate + duplicate-field poison + GGUF↔HF adapter naming)#2254
noahgift merged 3 commits into
mainfrom
fix/apr-merge-runnable

Conversation

@noahgift

@noahgift noahgift commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Merging a LoRA adapter into an import-produced base (apr finetune --merge) produced a file apr run rejected (C-01 missing architecture; tokenizer reported missing) even though both were physically in the container — and, more insidiously, merged zero layers while reporting success. Two stacked root causes:

1. Duplicate-field metadata poison → silent total metadata loss

Import-produced bases stamp HF-alias dim keys (num_hidden_layers, num_attention_heads, num_key_value_heads) into AprV2Metadata.custom. run_merge re-serialized the cloned metadata with both "num_layers": null (typed Option, no skip_serializing_if) and the alias key. Realizar's AprMetadata aliases them (PMAT-111) → serde duplicate fieldMappedAprModel::from_mmap swallows it via unwrap_or_default(), silently dropping all metadata including the embedded tokenizer. Fix: skip_serializing_if on the typed dim fields + backfill architecture/dims; fail-closed post-write gate re-opens the output and rejects (deletes) anything not loadable.

2. GGUF↔HF adapter-name mismatch → 0 layers merged

Import bases are GGUF-named (blk.N.attn_q.weight); entrenar adapters are HF-named (lora.N.q_proj.lora_a). adapter_pair_names only generated lora.N.attn_q.* candidates → zero matches → the base was copied through unchanged (a "merged" model that is really the untrained base). Fix: gguf_proj_to_hf() adds the HF-named candidate per GGUF base tensor.

Verification

  • 6 falsifiers FALSIFY-APR-MERGE-RUNNABLE-001..005 (+004b) pass; 004b mutation-verified (neuter the GGUF→HF map → RED: 0 merged, gate rejects; restore → GREEN).
  • Contract contracts/apr-merge-runnable-v1.yamlpv validate + pv lint contracts/ PASS.
  • E2E (RTX 4090): merging the real flip adapter (rank 256, q+v) into qwen2.5-coder-1.5b-instruct-q4k.apr"Layers merged 56/339" (28×q+v), tokenizer embedded (PMAT-171), fail-closed gate passes, and apr run on the output works — emits {"input":{"command":"cargo test --lib"},"name":"shell"} (coherent, correct-shaped tool-call payload).
  • Full cargo test -p apr-cli --lib: 6605 pass, 0 fail.

Closes the release-blocking model-surgery slice for v0.57.0; unblocks the apr-code flip smoke (the trained model now merges + runs end-to-end).

Remaining wave items (separate, filed): apr export --format safetensors garbage weights, apr import vocab/hidden transpose, apr convert --quantize q4k drops runnable tokenizer.

🤖 Generated with Claude Code

noahgift and others added 2 commits July 2, 2026 11:24
….apr — fail-closed post-write gate + duplicate-field metadata poison fix + entrenar adapter naming (FALSIFY-APR-MERGE-RUNNABLE-001..005)

Merging the flip adapter into qwen2.5-coder-1.5b-instruct-q4k.apr produced
a file apr run rejected (C-01 missing architecture; tokenizer 'missing')
even though BOTH were physically in the container.

ROOT CAUSE (duplicate-field poison): import-produced bases stamp HF-alias
dim keys (num_hidden_layers/num_attention_heads/num_key_value_heads) that
land in AprV2Metadata.custom (no serde aliases aprender-side). run_merge
re-serialized the cloned metadata with BOTH "num_layers": null (typed
Option field, no skip_serializing_if) AND the alias key. Realizar's
AprMetadata aliases them (PMAT-111), so serde fails with "duplicate
field" — and MappedAprModel::from_mmap swallows it via
unwrap_or_default(), silently dropping ALL metadata incl. the embedded
tokenizer.

FIX (fail-closed at every layer):
- apr-format: AprV2Metadata skips serializing None fields (except the
  three C-APR-PROVENANCE keys FALSIFY-SHIP-022 pins to explicit null);
  new canonicalize_hf_aliases() promotes alias keys into typed fields
  and strips the alias spellings.
- run_merge: canonicalize + backfill arch/C-03 dims from tensor shapes
  and GH-376-style presets; clear stale quantization markers (output is
  F32); reject -o *.safetensors (no more APR-in-disguise); post-write
  gate re-opens the output and validates through realizar's OWN loader
  (GGUFConfig::from_apr C-01/C-03 + embedded BPE/SP tokenizer) — on
  failure the output is DELETED and the merge errors loudly.
- Adapter resolution: entrenar trainer naming lora.{layer}.{proj}.lora_{a,b}
  now merges against HF-named bases (live repro: 0/339 silently merged);
  rank derived per-tensor from the adapter's [rank, d_in] lora_a shape;
  alpha read from safetensors header OR the trainer's sidecar
  metadata.json; merged_count==0 with LoRA pairs present is a hard error.

Falsifiers (all mutation-verified RED on pre-fix main, GREEN post-fix):
- 001 merged output loads through realizar with full metadata + tokenizer
- 002 .safetensors output extension rejected, nothing written
- 003 tokenizer-less base fails the gate, output deleted
- 004 entrenar-style adapter actually merges (shape-derived rank)
- 005 zero-match merge errors instead of shipping a base copy

Contract: contracts/apr-merge-runnable-v1.yaml (pv validate + pv lint PASS).
Suites: apr-cli 6604, apr-format 88, aprender-core 13970, aprender-serve
15454, aprender-contracts 1412 — all green.

E2E: real 1.5B flip merge passes the gate ([PMAT-171] tokenizer oracle
fires) and apr run loads the merged output directly.

Closes the release-blocking model-surgery wave slice for v0.57.0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pters (else 0 layers merged)

An import-produced base names tensors GGUF-style (blk.N.attn_q.weight)
while entrenar trainer adapters name them HF-style (lora.N.q_proj.lora_a).
adapter_pair_names generated only lora.N.attn_q.lora_a candidates → ZERO
matches → run_merge silently copied the base through unchanged
(merged_count=0): a "merged" model that is really the untrained base.

Fix: gguf_proj_to_hf() maps attn_{q,k,v}/attn_output/ffn_{gate,up,down}
→ {q,k,v,o}_proj/{gate,up,down}_proj, adding the HF-named candidate for
each GGUF-named base tensor.

FALSIFY-APR-MERGE-RUNNABLE-004b mutation-verified: neuter the mapping →
RED (0 merged, fail-closed gate rejects); restore → GREEN. E2E: merging
the flip adapter (lora.N.{q,v}_proj) into
qwen2.5-coder-1.5b-instruct-q4k.apr (blk.N.attn_q) now reports
"Layers merged 56/339" (28 layers x q+v), embeds the tokenizer
(PMAT-171), passes the fail-closed gate, and the output runs — emitting
a coherent {"input":{"command":...},"name":"shell"} tool-call payload.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge July 2, 2026 10:37
@noahgift noahgift added this pull request to the merge queue Jul 2, 2026
Merged via the queue into main with commit 625ef84 Jul 2, 2026
10 checks passed
@noahgift noahgift deleted the fix/apr-merge-runnable branch July 2, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant