[trainer,hparams,examples,docs] feat: add Flow-DPPO algorithm#181
Merged
Conversation
a34f32f to
2a60447
Compare
Flow-DPPO is a strict Flow-GRPO variant: it keeps GRPO's group advantages and the optional KL-vs-reference penalty, but replaces the PPO ratio-clip with a KL trust-region mask. A sample's gradient is zeroed when its per-step KL(current || rollout-old) >= kl_mask_threshold and the update pushes the wrong way (ratio>1 & adv>0, or ratio<1 & adv<0). - DPPOTrainingArguments(GRPOTrainingArguments): kl_mask_threshold + kl_guidance_scale; get_preprocess_guidance_scale() override so negative prompts are encoded when the KL-ref CFG > 1.0 (DGPO kl_cfg pattern). - DPPOTrainer(GRPOTrainer): sample() stores rollout-old noise_pred/ next_latents_mean; optimize() applies the kl_adv mask and runs the KL-vs-ref forward at kl_guidance_scale. Registered as 'dppo'. - 4 example configs (flux2_klein_base / sd3_5 x single / multi) aligned to the reference dppo configs on scheduler params, kl_guidance_scale, kl_mask_threshold. - Docs: constraint #11 + base-class rule add DPPOTrainer->GRPOTrainer as a sanctioned exception; architecture/AGENTS/algorithms/README/examples updated. Co-authored-by: Cursor <cursoragent@cursor.com>
…uple KL spaces - DPPOTrainingArguments no longer inherits GRPOTrainingArguments (drops the unused PPO clip_range); the field set is minimal (advantage/adv_clip/ref + KL knobs). - Decouple the two KL computations: kl_type controls the KL-vs-reference penalty space; new kl_mask_type controls the DPPO trust-region mask space. Trainer requests both forward outputs when the spaces differ. - Rename example configs single/multi -> geneval2_single/geneval2_multi (the training set is GenEval2); add kl_mask_type, drop clip_range from configs. - README: fix DPPO paper name to Flow-DPPO. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Value-level alignment with Jayce-Ping/Mask Flow-Factory configs
(flux2-klein-base_config/dppo_{single,multi}.yaml, sd3-5_config dito):
- log.verbose: false in all four configs (local default is true)
- sd3-5: config_file -> config/accelerate_configs/multi_gpu.yaml
(flux2 keeps deepspeed_zero2, matching Mask)
- sd3-5 multi: preprocessing_batch_size 8 -> 32
Kept intentionally divergent from Mask: dataset/GenEval2 paths (local
flat train/test.jsonl corresponds to Mask's synthetic split), unified
eval source name `geneval2`, project name "Flow-Factory", and keys the
refactored DPPO trainer no longer has (mask_type, clip_range, etc.).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
In Mask's old schema, a test_set without eval_reward_names runs ALL eval rewards, so the geneval2 test set was scored by all four rewards while pickscore was explicitly restricted to the three general ones. The schema translation pinned pick_score / clip_text_align / hpsv2 to applicable_datasets: [pickscore], silently dropping them from geneval2 eval. Drop the applicable_datasets restriction on the three general eval rewards (omitted = applies to every eval dataset, mirroring the old default-all rule); keep [geneval2] on the geneval2 Soft-TIFA reward since pickscore prompts carry no vqa_list. Resulting routing matches Mask: geneval2 -> 4 rewards, pickscore -> 3. Audited the remaining config+code semantic pairs against the Mask fork (training reward routing, weight-0 logging rewards, kl_adv mask vs DPPO trust-region mask incl. the inert clip_range, adv clamp, KL-vs-ref penalty, eval split defaults) - no other mismatches. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6f54142 to
aedad2a
Compare
Link the README DPPO row to arXiv:2606.11025, and add reference [15] in guidance/algorithms.md (paper + code link) with a brief intro framing DPPO as a divergence proximal constraint (exact per-step Gaussian KL, asymmetric mask). Co-authored-by: Cursor <cursoragent@cursor.com>
- sample()/optimize() store and gather only the kl_mask_type-space rollout tensor (the ref-KL penalty compares current vs reference, never the old policy), removing a latent-sized unused tensor from rollout storage/H2D. - Request std_dev_t/dt only for the x-based mask; keep return_kwargs minimal. - Drop the redundant per-timestep negative_* .to(device) loop (sample.to(device) already moves negative embeds into batch; the policy forward relies on this). - Micro: dedup keep_mask.mean(), drop unused enumerate index and redundant float cast, trim low-information comments, note the asymmetric x-based mask (exact Gaussian KL) vs ref KL (GRPO unscaled) convention. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flow-DPPO: a strict Flow-GRPO variant that replaces the PPO ratio-clip with a KL trust-region mask. A sample's gradient is zeroed when its per-step KL(current ‖ rollout-old) >=
kl_mask_thresholdand the update would push the wrong way (ratio>1 & adv>0, orratio<1 & adv<0). Group advantages and the optional KL-vs-reference penalty are inherited from GRPO.DPPOTrainingArguments(GRPOTrainingArguments)— addskl_mask_thresholdandkl_guidance_scale; overridesget_preprocess_guidance_scale()so negative prompts are encoded when the KL-ref CFG > 1.0 (mirrors DGPOkl_cfg).DPPOTrainer(GRPOTrainer)—sample()stores rollout-oldnoise_pred/next_latents_mean;optimize()applies the kl_adv mask and runs the KL-vs-ref forward atkl_guidance_scale. Registered asdppo(a second sanctioned→ GRPOTrainerexception, like GRPO-Guard).examples/dppo/lora/{flux2_klein_base,sd3_5}/{single,multi}.yaml, aligned to the reference dppo configs on the three requested groups: scheduler params,kl_guidance_scale(flux2-klein 4.0 / sd3-5 4.5),kl_mask_threshold(1e-6).single= geneval2 drives advantage (others weight 0);multi= PickScore+CLIP+GenEval2+HPSv2.Groupwise Rewardcomputation #11 + base-class rule, architecture/AGENTS/algorithms/README/examples.Test plan
black/isortclean on new modules; byte-compile OKget_trainer_class('dppo')/get_training_args_class('dppo')+ end-to-end sampling→optimize in a full training env (torch/transformers not on dev box)Made with Cursor