Skip to content

[trainer,hparams,examples,docs] feat: add Flow-DPPO algorithm#181

Merged
Jayce-Ping merged 8 commits into
mainfrom
feat/dppo-algorithm
Jun 14, 2026
Merged

[trainer,hparams,examples,docs] feat: add Flow-DPPO algorithm#181
Jayce-Ping merged 8 commits into
mainfrom
feat/dppo-algorithm

Conversation

@Jayce-Ping

Copy link
Copy Markdown
Collaborator

Summary

Flow-DPPO: a strict Flow-GRPO variant that replaces the PPO ratio-clip with a KL trust-region mask. A sample's gradient is zeroed when its per-step KL(current ‖ rollout-old) >= kl_mask_threshold and the update would push the wrong way (ratio>1 & adv>0, or ratio<1 & adv<0). Group advantages and the optional KL-vs-reference penalty are inherited from GRPO.

  • DPPOTrainingArguments(GRPOTrainingArguments) — adds kl_mask_threshold and kl_guidance_scale; overrides get_preprocess_guidance_scale() so negative prompts are encoded when the KL-ref CFG > 1.0 (mirrors DGPO kl_cfg).
  • DPPOTrainer(GRPOTrainer)sample() stores rollout-old noise_pred/next_latents_mean; optimize() applies the kl_adv mask and runs the KL-vs-ref forward at kl_guidance_scale. Registered as dppo (a second sanctioned → GRPOTrainer exception, like GRPO-Guard).
  • 4 example configs examples/dppo/lora/{flux2_klein_base,sd3_5}/{single,multi}.yaml, aligned to the reference dppo configs on the three requested groups: scheduler params, kl_guidance_scale (flux2-klein 4.0 / sd3-5 4.5), kl_mask_threshold (1e-6). single = geneval2 drives advantage (others weight 0); multi = PickScore+CLIP+GenEval2+HPSv2.
  • Docs: constraint Optimize Groupwise Reward computation #11 + base-class rule, architecture/AGENTS/algorithms/README/examples.

Stacked on feat/geneval2-hpsv2-rewards (PR #180) because the multi configs reference geneval2/hpsv2. Retarget to main after #180 merges.

Test plan

  • black/isort clean on new modules; byte-compile OK
  • All 4 YAMLs parse; sampler geometry (#9a) satisfied; aligned params verified
  • Trainer/args registry entries + re-exports present
  • Runtime get_trainer_class('dppo') / get_training_args_class('dppo') + end-to-end sampling→optimize in a full training env (torch/transformers not on dev box)

Made with Cursor

@Jayce-Ping Jayce-Ping marked this pull request as draft June 11, 2026 02:21
@Jayce-Ping Jayce-Ping force-pushed the feat/dppo-algorithm branch 3 times, most recently from a34f32f to 2a60447 Compare June 11, 2026 15:06
Jayce-Ping and others added 6 commits June 13, 2026 11:20
Flow-DPPO is a strict Flow-GRPO variant: it keeps GRPO's group advantages and
the optional KL-vs-reference penalty, but replaces the PPO ratio-clip with a KL
trust-region mask. A sample's gradient is zeroed when its per-step
KL(current || rollout-old) >= kl_mask_threshold and the update pushes the wrong
way (ratio>1 & adv>0, or ratio<1 & adv<0).

- DPPOTrainingArguments(GRPOTrainingArguments): kl_mask_threshold +
  kl_guidance_scale; get_preprocess_guidance_scale() override so negative
  prompts are encoded when the KL-ref CFG > 1.0 (DGPO kl_cfg pattern).
- DPPOTrainer(GRPOTrainer): sample() stores rollout-old noise_pred/
  next_latents_mean; optimize() applies the kl_adv mask and runs the KL-vs-ref
  forward at kl_guidance_scale. Registered as 'dppo'.
- 4 example configs (flux2_klein_base / sd3_5 x single / multi) aligned to the
  reference dppo configs on scheduler params, kl_guidance_scale, kl_mask_threshold.
- Docs: constraint #11 + base-class rule add DPPOTrainer->GRPOTrainer as a
  sanctioned exception; architecture/AGENTS/algorithms/README/examples updated.

Co-authored-by: Cursor <cursoragent@cursor.com>
…uple KL spaces

- DPPOTrainingArguments no longer inherits GRPOTrainingArguments (drops the unused
  PPO clip_range); the field set is minimal (advantage/adv_clip/ref + KL knobs).
- Decouple the two KL computations: kl_type controls the KL-vs-reference penalty
  space; new kl_mask_type controls the DPPO trust-region mask space. Trainer
  requests both forward outputs when the spaces differ.
- Rename example configs single/multi -> geneval2_single/geneval2_multi (the
  training set is GenEval2); add kl_mask_type, drop clip_range from configs.
- README: fix DPPO paper name to Flow-DPPO.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Value-level alignment with Jayce-Ping/Mask Flow-Factory configs
(flux2-klein-base_config/dppo_{single,multi}.yaml, sd3-5_config dito):

- log.verbose: false in all four configs (local default is true)
- sd3-5: config_file -> config/accelerate_configs/multi_gpu.yaml
  (flux2 keeps deepspeed_zero2, matching Mask)
- sd3-5 multi: preprocessing_batch_size 8 -> 32

Kept intentionally divergent from Mask: dataset/GenEval2 paths (local
flat train/test.jsonl corresponds to Mask's synthetic split), unified
eval source name `geneval2`, project name "Flow-Factory", and keys the
refactored DPPO trainer no longer has (mask_type, clip_range, etc.).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
In Mask's old schema, a test_set without eval_reward_names runs ALL
eval rewards, so the geneval2 test set was scored by all four rewards
while pickscore was explicitly restricted to the three general ones.
The schema translation pinned pick_score / clip_text_align / hpsv2 to
applicable_datasets: [pickscore], silently dropping them from geneval2
eval.

Drop the applicable_datasets restriction on the three general eval
rewards (omitted = applies to every eval dataset, mirroring the old
default-all rule); keep [geneval2] on the geneval2 Soft-TIFA reward
since pickscore prompts carry no vqa_list. Resulting routing matches
Mask: geneval2 -> 4 rewards, pickscore -> 3.

Audited the remaining config+code semantic pairs against the Mask fork
(training reward routing, weight-0 logging rewards, kl_adv mask vs DPPO
trust-region mask incl. the inert clip_range, adv clamp, KL-vs-ref
penalty, eval split defaults) - no other mismatches.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Jayce-Ping Jayce-Ping force-pushed the feat/dppo-algorithm branch from 6f54142 to aedad2a Compare June 13, 2026 03:20
@Jayce-Ping Jayce-Ping changed the base branch from feat/geneval2-hpsv2-rewards to main June 13, 2026 03:20
@Jayce-Ping Jayce-Ping marked this pull request as ready for review June 13, 2026 21:08
Jayce-Ping and others added 2 commits June 14, 2026 05:20
Link the README DPPO row to arXiv:2606.11025, and add reference [15] in
guidance/algorithms.md (paper + code link) with a brief intro framing DPPO as a
divergence proximal constraint (exact per-step Gaussian KL, asymmetric mask).

Co-authored-by: Cursor <cursoragent@cursor.com>
- sample()/optimize() store and gather only the kl_mask_type-space rollout
  tensor (the ref-KL penalty compares current vs reference, never the old
  policy), removing a latent-sized unused tensor from rollout storage/H2D.
- Request std_dev_t/dt only for the x-based mask; keep return_kwargs minimal.
- Drop the redundant per-timestep negative_* .to(device) loop (sample.to(device)
  already moves negative embeds into batch; the policy forward relies on this).
- Micro: dedup keep_mask.mean(), drop unused enumerate index and redundant
  float cast, trim low-information comments, note the asymmetric x-based mask
  (exact Gaussian KL) vs ref KL (GRPO unscaled) convention.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Jayce-Ping Jayce-Ping merged commit 7912095 into main Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant