[trainer,hparams,examples,docs] feat: add Flow-DPPO algorithm by Jayce-Ping · Pull Request #181 · X-GenGroup/Flow-Factory

Jayce-Ping · 2026-06-11T02:17:02Z

Summary

Flow-DPPO: a strict Flow-GRPO variant that replaces the PPO ratio-clip with a KL trust-region mask. A sample's gradient is zeroed when its per-step KL(current ‖ rollout-old) >= kl_mask_threshold and the update would push the wrong way (ratio>1 & adv>0, or ratio<1 & adv<0). Group advantages and the optional KL-vs-reference penalty are inherited from GRPO.

DPPOTrainingArguments(GRPOTrainingArguments) — adds kl_mask_threshold and kl_guidance_scale; overrides get_preprocess_guidance_scale() so negative prompts are encoded when the KL-ref CFG > 1.0 (mirrors DGPO kl_cfg).
DPPOTrainer(GRPOTrainer) — sample() stores rollout-old noise_pred/next_latents_mean; optimize() applies the kl_adv mask and runs the KL-vs-ref forward at kl_guidance_scale. Registered as dppo (a second sanctioned → GRPOTrainer exception, like GRPO-Guard).
4 example configs examples/dppo/lora/{flux2_klein_base,sd3_5}/{single,multi}.yaml, aligned to the reference dppo configs on the three requested groups: scheduler params, kl_guidance_scale (flux2-klein 4.0 / sd3-5 4.5), kl_mask_threshold (1e-6). single = geneval2 drives advantage (others weight 0); multi = PickScore+CLIP+GenEval2+HPSv2.
Docs: constraint Optimize Groupwise Reward computation #11 + base-class rule, architecture/AGENTS/algorithms/README/examples.

Stacked on feat/geneval2-hpsv2-rewards (PR #180) because the multi configs reference geneval2/hpsv2. Retarget to main after #180 merges.

Test plan

black/isort clean on new modules; byte-compile OK
All 4 YAMLs parse; sampler geometry (#9a) satisfied; aligned params verified
Trainer/args registry entries + re-exports present
Runtime get_trainer_class('dppo') / get_training_args_class('dppo') + end-to-end sampling→optimize in a full training env (torch/transformers not on dev box)

Made with Cursor

Flow-DPPO is a strict Flow-GRPO variant: it keeps GRPO's group advantages and the optional KL-vs-reference penalty, but replaces the PPO ratio-clip with a KL trust-region mask. A sample's gradient is zeroed when its per-step KL(current || rollout-old) >= kl_mask_threshold and the update pushes the wrong way (ratio>1 & adv>0, or ratio<1 & adv<0). - DPPOTrainingArguments(GRPOTrainingArguments): kl_mask_threshold + kl_guidance_scale; get_preprocess_guidance_scale() override so negative prompts are encoded when the KL-ref CFG > 1.0 (DGPO kl_cfg pattern). - DPPOTrainer(GRPOTrainer): sample() stores rollout-old noise_pred/ next_latents_mean; optimize() applies the kl_adv mask and runs the KL-vs-ref forward at kl_guidance_scale. Registered as 'dppo'. - 4 example configs (flux2_klein_base / sd3_5 x single / multi) aligned to the reference dppo configs on scheduler params, kl_guidance_scale, kl_mask_threshold. - Docs: constraint #11 + base-class rule add DPPOTrainer->GRPOTrainer as a sanctioned exception; architecture/AGENTS/algorithms/README/examples updated. Co-authored-by: Cursor <cursoragent@cursor.com>

…uple KL spaces - DPPOTrainingArguments no longer inherits GRPOTrainingArguments (drops the unused PPO clip_range); the field set is minimal (advantage/adv_clip/ref + KL knobs). - Decouple the two KL computations: kl_type controls the KL-vs-reference penalty space; new kl_mask_type controls the DPPO trust-region mask space. Trainer requests both forward outputs when the spaces differ. - Rename example configs single/multi -> geneval2_single/geneval2_multi (the training set is GenEval2); add kl_mask_type, drop clip_range from configs. - README: fix DPPO paper name to Flow-DPPO. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Value-level alignment with Jayce-Ping/Mask Flow-Factory configs (flux2-klein-base_config/dppo_{single,multi}.yaml, sd3-5_config dito): - log.verbose: false in all four configs (local default is true) - sd3-5: config_file -> config/accelerate_configs/multi_gpu.yaml (flux2 keeps deepspeed_zero2, matching Mask) - sd3-5 multi: preprocessing_batch_size 8 -> 32 Kept intentionally divergent from Mask: dataset/GenEval2 paths (local flat train/test.jsonl corresponds to Mask's synthetic split), unified eval source name `geneval2`, project name "Flow-Factory", and keys the refactored DPPO trainer no longer has (mask_type, clip_range, etc.). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

In Mask's old schema, a test_set without eval_reward_names runs ALL eval rewards, so the geneval2 test set was scored by all four rewards while pickscore was explicitly restricted to the three general ones. The schema translation pinned pick_score / clip_text_align / hpsv2 to applicable_datasets: [pickscore], silently dropping them from geneval2 eval. Drop the applicable_datasets restriction on the three general eval rewards (omitted = applies to every eval dataset, mirroring the old default-all rule); keep [geneval2] on the geneval2 Soft-TIFA reward since pickscore prompts carry no vqa_list. Resulting routing matches Mask: geneval2 -> 4 rewards, pickscore -> 3. Audited the remaining config+code semantic pairs against the Mask fork (training reward routing, weight-0 logging rewards, kl_adv mask vs DPPO trust-region mask incl. the inert clip_range, adv clamp, KL-vs-ref penalty, eval split defaults) - no other mismatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Link the README DPPO row to arXiv:2606.11025, and add reference [15] in guidance/algorithms.md (paper + code link) with a brief intro framing DPPO as a divergence proximal constraint (exact per-step Gaussian KL, asymmetric mask). Co-authored-by: Cursor <cursoragent@cursor.com>

- sample()/optimize() store and gather only the kl_mask_type-space rollout tensor (the ref-KL penalty compares current vs reference, never the old policy), removing a latent-sized unused tensor from rollout storage/H2D. - Request std_dev_t/dt only for the x-based mask; keep return_kwargs minimal. - Drop the redundant per-timestep negative_* .to(device) loop (sample.to(device) already moves negative embeds into batch; the policy forward relies on this). - Micro: dedup keep_mask.mean(), drop unused enumerate index and redundant float cast, trim low-information comments, note the asymmetric x-based mask (exact Gaussian KL) vs ref KL (GRPO unscaled) convention. Co-authored-by: Cursor <cursoragent@cursor.com>

Jayce-Ping marked this pull request as draft June 11, 2026 02:21

Jayce-Ping force-pushed the feat/dppo-algorithm branch 3 times, most recently from a34f32f to 2a60447 Compare June 11, 2026 15:06

Jayce-Ping and others added 6 commits June 13, 2026 11:20

[examples] chore: set dppo sd3_5 scheduler dynamics_type to Flow-SDE

6d089f0

Co-authored-by: Cursor <cursoragent@cursor.com>

update sde

aedad2a

Jayce-Ping force-pushed the feat/dppo-algorithm branch from 6f54142 to aedad2a Compare June 13, 2026 03:20

Jayce-Ping changed the base branch from feat/geneval2-hpsv2-rewards to main June 13, 2026 03:20

Jayce-Ping marked this pull request as ready for review June 13, 2026 21:08

Jayce-Ping and others added 2 commits June 14, 2026 05:20

Jayce-Ping merged commit 7912095 into main Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[trainer,hparams,examples,docs] feat: add Flow-DPPO algorithm#181

[trainer,hparams,examples,docs] feat: add Flow-DPPO algorithm#181
Jayce-Ping merged 8 commits into
mainfrom
feat/dppo-algorithm

Jayce-Ping commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Jayce-Ping commented Jun 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant