Skip to content

feat: GenEval reward support + unified trainer sampling pipeline#165

Merged
Jayce-Ping merged 13 commits into
mainfrom
feat/geneval-support
May 26, 2026
Merged

feat: GenEval reward support + unified trainer sampling pipeline#165
Jayce-Ping merged 13 commits into
mainfrom
feat/geneval-support

Conversation

@Jayce-Ping

Copy link
Copy Markdown
Collaborator

Summary

  • GenEval reward model: Mask2Former object detection + CLIP color classification for evaluating compositional T2I generation (object count, color, spatial position). Heavy deps (mmcv/mmdetection) are fully optional with lazy import guards.
  • Unified sampling pipeline: Extract sample_batch() / generate_samples() / evaluate() into BaseTrainer, eliminating ~460 lines of duplicated code across all 6 trainers.
  • Metadata passthrough: _inject_batch_metadata() automatically passes dataset-level metadata (e.g. geneval_metadata) from dataloader batches into samples' extra_kwargs, enabling metadata-dependent rewards without modifying adapters.
  • Full GenEval dataset: 33,199 train / 553 test prompts (deduplicated from DiffusionNFT's 50K), stored as JSON strings to avoid Arrow heterogeneous struct serialization issues.

Key Design Decisions

Decision Rationale
sample_batch() is public Subclasses can override for custom per-batch pipeline
evaluate() removed from @abstractmethod Concrete default in BaseTrainer; subclasses override only if needed
Metadata as JSON string in JSONL Arrow cannot serialize heterogeneous nested structs (learned from DiffusionNFT experience)
GenEval deps fully optional mmcv 1.x requires Python 3.10 + CUDA compilation; main framework stays Python 3.12 compatible
reward_type: "score" for training Continuous 0-1 reward for policy gradient (aligned with DiffusionNFT); strict accuracy logged in extra_info

Files Changed

Area Files
Core architecture trainers/abc.py (+200), trainers/{grpo,nft,dpo,awm,crd,dgpo}.py (-461 total)
GenEval reward rewards/geneval.py, rewards/registry.py
Dataset dataset/geneval/{train,test}.jsonl
Config/Scripts examples/grpo/lora/sd3_5/geneval.yaml, scripts/install_geneval_deps.sh, scripts/convert_geneval_dataset.py, pyproject.toml

Test plan

  • Run existing GRPO training (e.g. examples/grpo/lora/sd3_5/default.yaml) — verify no regression from sampling pipeline refactor
  • Run NFT training — verify generate_samples() delegation works
  • Verify GenEval reward loads correctly with mmcv/mmdetection installed (Python 3.10 env)
  • Verify graceful error message when GenEval deps are missing
  • Run examples/grpo/lora/sd3_5/geneval.yaml end-to-end with GenEval dataset

🤖 Generated with Claude Code

Jayce-Ping and others added 9 commits May 25, 2026 16:08
- Add GenEval reward model (Mask2Former object detection + CLIP color
  classification) with lazy optional dependencies (mmcv/mmdetection)
- Extract `sample_batch()`, `generate_samples()`, and `evaluate()` into
  BaseTrainer, eliminating ~460 lines of duplicated sampling/eval loops
  across all 6 trainers (GRPO, NFT, DPO, AWM, CRD, DGPO)
- Add `_inject_batch_metadata()` to automatically pass dataset metadata
  (e.g. geneval_metadata) from dataloader batches into sample.extra_kwargs
- Include full deduplicated GenEval dataset (33,199 train / 553 test)
  with metadata stored as JSON strings to avoid Arrow serialization issues
- Add install script and example training config for SD3.5 + GenEval

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Upgrade from mmdet 2.x to mmdet 3.x (DetDataSample API, auto config
  resolution via .mim/configs/, checkpoint auto-download from model zoo)
- Replace clip_benchmark dependency with direct open_clip encode_image/
  encode_text — eliminates one heavy optional dependency
- Add torch.amp.autocast("cuda", enabled=False) guard to prevent bf16
  crashes in mmdet's ms_deform_attn CUDA kernel during eval
- Flatten dataset format: separate include/exclude/tag fields instead of
  single geneval_metadata blob (matches reward model's required_fields)
- Add dataset/geneval/object_names.txt (80 COCO classes)
- Simplify install script: mmdet 3.x installs via pip (no CUDA compile)
- Now supports Python 3.10-3.12 (no longer limited to 3.10)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add "Dataset Metadata Convention" section to guidance/rewards.md and
update docstrings in abc.py, dataset.py, my_reward.py to document the
implicit contract: complex metadata stored as JSON strings in JSONL,
reward models responsible for json.loads() parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
openmim relies on pkg_resources which removed pkgutil.ImpImporter in
Python 3.12, causing AttributeError. Replace mim-based install with
direct pip/uv using OpenMMLab prebuilt wheel index URL (auto-detected
from torch+CUDA version). Also remove openmim from pyproject.toml
optional deps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mmcv's setup.py depends on pkg_resources (removed in Python 3.12+) and
prebuilt wheels don't exist for cutting-edge torch+CUDA combos. Simplify
the script to: warn if not Python 3.10, then source-compile mmcv/mmdet
with --no-build-isolation. Remove openmim from pyproject.toml (broken
on 3.12).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mmdet 3.3.0 requires mmcv>=2.0.0rc4,<2.2.0. The main branch of mmcv
is already 2.2.0 which exceeds the upper bound. Pin to v2.1.0 tag.
Also pin mmdetection to v3.3.0 tag for reproducibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update geneval.py docstring to recommend install_geneval_deps.sh
- Add GenEval to built-in reward models table in README and guidance
- Add install note for GenEval dependencies (Python 3.10 recommended)
- List GenEval in AGENTS.md reward models

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jayce-Ping Jayce-Ping marked this pull request as ready for review May 25, 2026 14:35
Copilot AI review requested due to automatic review settings May 25, 2026 14:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new GenEval reward model (Mask2Former detection + CLIP color classification) and refactors the trainer stack to use a unified sampling / evaluation pipeline in BaseTrainer, reducing duplicated code across trainers and enabling dataset metadata passthrough into reward-model kwargs.

Changes:

  • Introduces GenEvalRewardModel and registers it in the reward registry, plus docs/config/scripts for installing optional deps and using the dataset.
  • Extracts sample_batch(), generate_samples(), and a default evaluate() into BaseTrainer, and updates trainers to delegate sampling to the shared implementation.
  • Adds a dataset metadata convention (metadata column → sample.extra_kwargs → reward __call__ kwargs) and ships GenEval JSONL + class list.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/flow_factory/trainers/abc.py Adds unified sampling (sample_batch/generate_samples) and default evaluation loop with metadata injection.
src/flow_factory/trainers/grpo.py Delegates sampling to generate_samples() (keeps GRPO trajectory/logprob settings).
src/flow_factory/trainers/nft.py Delegates sampling to generate_samples() (final-latent-only rollouts).
src/flow_factory/trainers/dpo.py Delegates sampling to generate_samples() (final-latent-only rollouts).
src/flow_factory/trainers/dgpo.py Delegates sampling to generate_samples() (final-latent-only rollouts).
src/flow_factory/trainers/crd.py Delegates sampling to generate_samples() (final-latent-only rollouts).
src/flow_factory/trainers/awm.py Delegates sampling to generate_samples() (final-latent-only rollouts).
src/flow_factory/rewards/geneval.py Implements GenEval reward model with optional mmdet/mmcv + open_clip deps.
src/flow_factory/rewards/registry.py Registers the geneval reward model name → implementation path.
src/flow_factory/rewards/abc.py Clarifies that **kwargs comes from sample.extra_kwargs / dataset metadata.
src/flow_factory/rewards/my_reward.py Documents how to receive metadata fields in custom reward signatures.
src/flow_factory/data_utils/dataset.py Packs non-preprocess JSONL fields into a metadata column for passthrough.
guidance/rewards.md Documents the dataset metadata convention and adds GenEval to the reward list.
README.md Adds GenEval to the reward table and notes the optional dependency install path.
examples/grpo/lora/sd3_5/geneval.yaml Provides an end-to-end example config for training/eval with GenEval.
scripts/install_geneval_deps.sh Adds a helper script to install/compile mmcv + install mmdet/open_clip.
scripts/convert_geneval_dataset.py Adds a conversion/dedup script for GenEval JSONL (store metadata as JSON strings).
pyproject.toml Adds a geneval optional-dependency extra.
dataset/geneval/test.jsonl Adds the GenEval test split in JSONL (include/exclude stored as JSON strings).
dataset/geneval/object_names.txt Adds the COCO-ish class-name list used for label mapping.
AGENTS.md Updates the documented reward list to include GenEval.
Comments suppressed due to low confidence (3)

src/flow_factory/rewards/geneval.py:219

  • GenEvalRewardModel hard-codes CUDA device selection via cuda:{accelerator.local_process_index}. This ignores the reward config’s device (and BaseRewardModel’s self.device) and will break if the reward is configured for CPU or non-CUDA backends, and can also diverge from Accelerate’s chosen device in more complex setups. Prefer initializing both mmdet and open_clip using self.device (or self.device.type/index) rather than constructing a CUDA string manually.
        device_str = f"cuda:{self.accelerator.local_process_index}"
        self._detector = init_detector(
            detector_config,
            detector_checkpoint,
            device=device_str,
        )
        logger.info(f"Mask2Former loaded on {device_str}.")

src/flow_factory/rewards/geneval.py:185

  • The ImportError guidance for missing GenEval deps is incomplete: Mask2Former inference requires mmcv ops in addition to mmdet/mmengine, and this repo provides a dedicated installer script / optional extra. Consider updating the error message to point users to pip install flow-factory[geneval] and/or bash scripts/install_geneval_deps.sh, so failures are actionable.
        try:
            from mmdet.apis import init_detector, inference_detector
        except ImportError:
            raise ImportError(
                "mmdet is required for GenEval reward. "
                "Install with: pip install mmdet mmengine"
            )

src/flow_factory/rewards/geneval.py:498

  • torch.amp.autocast("cuda", enabled=False) is always entered, even if the reward model is configured to run on CPU. If you switch GenEval to use self.device, consider also keying this autocast guard off self.device.type (or using a no-op context on non-CUDA) to avoid backend mismatches.
        # Disable AMP: mmdet's ms_deform_attn CUDA kernel does not support bf16,
        # and CLIP weights are loaded as fp32. Without this guard, bf16 autocast
        # from the trainer's eval loop would cause runtime errors.
        with torch.amp.autocast("cuda", enabled=False):
            for i in range(batch_size):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/flow_factory/rewards/geneval.py Outdated
self._counting_threshold = getattr(
config, "counting_threshold", DEFAULT_COUNTING_THRESHOLD
)
self._nms_threshold = getattr(config, "nms_threshold", DEFAULT_NMS_THRESHOLD)
Comment thread pyproject.toml Outdated
Comment on lines +77 to +79
"mmcv",
"mmengine",
"mmdet>=3.3.0",
Jayce-Ping and others added 2 commits May 25, 2026 22:44
- Remove unused nms_threshold parameter (read but never applied)
- Pin mmcv>=2.0.0,<2.2.0 and mmdet>=3.3.0,<4.0.0 in pyproject.toml
  for reproducible `pip install .[geneval]`
- Update ImportError messages to point to install_geneval_deps.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Comment thread src/flow_factory/trainers/abc.py Outdated
Comment on lines +506 to +512
self.adapter.rollout()
if reward_buffer is not None:
reward_buffer.clear()

if trajectory_indices is None:
trajectory_indices = [-1]

Comment on lines +145 to +148
if self.device.type != "cuda":
logger.warning(
"GenEval is configured on CPU but requires CUDA for Mask2Former inference. "
"Set `device: cuda` in your reward config."
# Disable AMP: mmdet's ms_deform_attn CUDA kernel does not support bf16,
# and CLIP weights are loaded as fp32. Without this guard, bf16 autocast
# from the trainer's eval loop would cause runtime errors.
with torch.amp.autocast("cuda", enabled=False):
Jayce-Ping and others added 2 commits May 25, 2026 23:08
- generate_samples: preserve None semantics for trajectory_indices
  (None = no recording, used in evaluate; no longer coerced to [-1])
- geneval: hard-error on non-CUDA device instead of warning
  (Mask2Former CUDA kernels will fail anyway, fail fast with clear msg)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jayce-Ping Jayce-Ping merged commit 8a97410 into main May 26, 2026
87003697 pushed a commit to 87003697/Flow-Factory that referenced this pull request Jun 2, 2026
Brings trellis2 branch up-to-date with origin/main, integrating:
- PR X-GenGroup#170: DiffusionOPD on-policy distillation trainer
- PR X-GenGroup#168: Multi-dataset training with per-source reward routing
- PR X-GenGroup#165: GenEval reward + unified trainer sampling pipeline
- PR X-GenGroup#163: hparams/training_args.py → package split
- PR X-GenGroup#162: Reconcile config with runtime distributed state
- PRs X-GenGroup#121,X-GenGroup#146-X-GenGroup#161: CRD algorithm, Docker CUDA 12.9, HF resume, etc.

Conflict resolutions:
- trainers/registry.py: merged both sides (trellis2_grpo/nft + crd/opd)
- rewards/registry.py: merged both sides (trellis2 rewards + geneval)
- hparams/training_args.py: deleted (accept main's package split), added
  trellis2_grpo/nft entries to _registry.py
- trainers/grpo.py: removed duplicate evaluate() (unified in BaseTrainer)
- trainers/nft.py: adopted generate_samples(), removed duplicate evaluate()
- trainers/abc.py: added _extra_eval_inference_kwargs() hook to BaseTrainer
  evaluate() so Trellis2TrainerMixin can inject stages/render_kwargs
- samples/samples.py: merged source/source_id fields with trellis2 additions
- rewards/reward_processor.py: merged source-aware gating with trellis2's
  _store_reward_extra_info
- data_utils/loader.py: merged multi-source pipeline with image_3d dataset
- data_utils/sampler_loader.py: merged refactored parameters
- hparams/args.py: merged multi-source alignment with trellis2's group-aligned
- guidance/rewards.md: merged both (PickScore_TextImage_Sum + GenEval)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants