feat: GenEval reward support + unified trainer sampling pipeline by Jayce-Ping · Pull Request #165 · X-GenGroup/Flow-Factory

Jayce-Ping · 2026-05-25T08:08:54Z

Summary

GenEval reward model: Mask2Former object detection + CLIP color classification for evaluating compositional T2I generation (object count, color, spatial position). Heavy deps (mmcv/mmdetection) are fully optional with lazy import guards.
Unified sampling pipeline: Extract sample_batch() / generate_samples() / evaluate() into BaseTrainer, eliminating ~460 lines of duplicated code across all 6 trainers.
Metadata passthrough: _inject_batch_metadata() automatically passes dataset-level metadata (e.g. geneval_metadata) from dataloader batches into samples' extra_kwargs, enabling metadata-dependent rewards without modifying adapters.
Full GenEval dataset: 33,199 train / 553 test prompts (deduplicated from DiffusionNFT's 50K), stored as JSON strings to avoid Arrow heterogeneous struct serialization issues.

Key Design Decisions

Decision	Rationale
`sample_batch()` is public	Subclasses can override for custom per-batch pipeline
`evaluate()` removed from `@abstractmethod`	Concrete default in BaseTrainer; subclasses override only if needed
Metadata as JSON string in JSONL	Arrow cannot serialize heterogeneous nested structs (learned from DiffusionNFT experience)
GenEval deps fully optional	mmcv 1.x requires Python 3.10 + CUDA compilation; main framework stays Python 3.12 compatible
`reward_type: "score"` for training	Continuous 0-1 reward for policy gradient (aligned with DiffusionNFT); strict accuracy logged in extra_info

Files Changed

Area	Files
Core architecture	`trainers/abc.py` (+200), `trainers/{grpo,nft,dpo,awm,crd,dgpo}.py` (-461 total)
GenEval reward	`rewards/geneval.py`, `rewards/registry.py`
Dataset	`dataset/geneval/{train,test}.jsonl`
Config/Scripts	`examples/grpo/lora/sd3_5/geneval.yaml`, `scripts/install_geneval_deps.sh`, `scripts/convert_geneval_dataset.py`, `pyproject.toml`

Test plan

Run existing GRPO training (e.g. examples/grpo/lora/sd3_5/default.yaml) — verify no regression from sampling pipeline refactor
Run NFT training — verify generate_samples() delegation works
Verify GenEval reward loads correctly with mmcv/mmdetection installed (Python 3.10 env)
Verify graceful error message when GenEval deps are missing
Run examples/grpo/lora/sd3_5/geneval.yaml end-to-end with GenEval dataset

🤖 Generated with Claude Code

- Add GenEval reward model (Mask2Former object detection + CLIP color classification) with lazy optional dependencies (mmcv/mmdetection) - Extract `sample_batch()`, `generate_samples()`, and `evaluate()` into BaseTrainer, eliminating ~460 lines of duplicated sampling/eval loops across all 6 trainers (GRPO, NFT, DPO, AWM, CRD, DGPO) - Add `_inject_batch_metadata()` to automatically pass dataset metadata (e.g. geneval_metadata) from dataloader batches into sample.extra_kwargs - Include full deduplicated GenEval dataset (33,199 train / 553 test) with metadata stored as JSON strings to avoid Arrow serialization issues - Add install script and example training config for SD3.5 + GenEval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Upgrade from mmdet 2.x to mmdet 3.x (DetDataSample API, auto config resolution via .mim/configs/, checkpoint auto-download from model zoo) - Replace clip_benchmark dependency with direct open_clip encode_image/ encode_text — eliminates one heavy optional dependency - Add torch.amp.autocast("cuda", enabled=False) guard to prevent bf16 crashes in mmdet's ms_deform_attn CUDA kernel during eval - Flatten dataset format: separate include/exclude/tag fields instead of single geneval_metadata blob (matches reward model's required_fields) - Add dataset/geneval/object_names.txt (80 COCO classes) - Simplify install script: mmdet 3.x installs via pip (no CUDA compile) - Now supports Python 3.10-3.12 (no longer limited to 3.10) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add "Dataset Metadata Convention" section to guidance/rewards.md and update docstrings in abc.py, dataset.py, my_reward.py to document the implicit contract: complex metadata stored as JSON strings in JSONL, reward models responsible for json.loads() parsing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openmim relies on pkg_resources which removed pkgutil.ImpImporter in Python 3.12, causing AttributeError. Replace mim-based install with direct pip/uv using OpenMMLab prebuilt wheel index URL (auto-detected from torch+CUDA version). Also remove openmim from pyproject.toml optional deps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mmcv's setup.py depends on pkg_resources (removed in Python 3.12+) and prebuilt wheels don't exist for cutting-edge torch+CUDA combos. Simplify the script to: warn if not Python 3.10, then source-compile mmcv/mmdet with --no-build-isolation. Remove openmim from pyproject.toml (broken on 3.12). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mmdet 3.3.0 requires mmcv>=2.0.0rc4,<2.2.0. The main branch of mmcv is already 2.2.0 which exceeds the upper bound. Pin to v2.1.0 tag. Also pin mmdetection to v3.3.0 tag for reproducibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Update geneval.py docstring to recommend install_geneval_deps.sh - Add GenEval to built-in reward models table in README and guidance - Add install note for GenEval dependencies (Python 3.10 recommended) - List GenEval in AGENTS.md reward models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a new GenEval reward model (Mask2Former detection + CLIP color classification) and refactors the trainer stack to use a unified sampling / evaluation pipeline in BaseTrainer, reducing duplicated code across trainers and enabling dataset metadata passthrough into reward-model kwargs.

Changes:

Introduces GenEvalRewardModel and registers it in the reward registry, plus docs/config/scripts for installing optional deps and using the dataset.
Extracts sample_batch(), generate_samples(), and a default evaluate() into BaseTrainer, and updates trainers to delegate sampling to the shared implementation.
Adds a dataset metadata convention (metadata column → sample.extra_kwargs → reward __call__ kwargs) and ships GenEval JSONL + class list.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/flow_factory/trainers/abc.py	Adds unified sampling (`sample_batch`/`generate_samples`) and default evaluation loop with metadata injection.
src/flow_factory/trainers/grpo.py	Delegates sampling to `generate_samples()` (keeps GRPO trajectory/logprob settings).
src/flow_factory/trainers/nft.py	Delegates sampling to `generate_samples()` (final-latent-only rollouts).
src/flow_factory/trainers/dpo.py	Delegates sampling to `generate_samples()` (final-latent-only rollouts).
src/flow_factory/trainers/dgpo.py	Delegates sampling to `generate_samples()` (final-latent-only rollouts).
src/flow_factory/trainers/crd.py	Delegates sampling to `generate_samples()` (final-latent-only rollouts).
src/flow_factory/trainers/awm.py	Delegates sampling to `generate_samples()` (final-latent-only rollouts).
src/flow_factory/rewards/geneval.py	Implements GenEval reward model with optional mmdet/mmcv + open_clip deps.
src/flow_factory/rewards/registry.py	Registers the `geneval` reward model name → implementation path.
src/flow_factory/rewards/abc.py	Clarifies that `**kwargs` comes from `sample.extra_kwargs` / dataset metadata.
src/flow_factory/rewards/my_reward.py	Documents how to receive metadata fields in custom reward signatures.
src/flow_factory/data_utils/dataset.py	Packs non-preprocess JSONL fields into a `metadata` column for passthrough.
guidance/rewards.md	Documents the dataset metadata convention and adds GenEval to the reward list.
README.md	Adds GenEval to the reward table and notes the optional dependency install path.
examples/grpo/lora/sd3_5/geneval.yaml	Provides an end-to-end example config for training/eval with GenEval.
scripts/install_geneval_deps.sh	Adds a helper script to install/compile mmcv + install mmdet/open_clip.
scripts/convert_geneval_dataset.py	Adds a conversion/dedup script for GenEval JSONL (store metadata as JSON strings).
pyproject.toml	Adds a `geneval` optional-dependency extra.
dataset/geneval/test.jsonl	Adds the GenEval test split in JSONL (include/exclude stored as JSON strings).
dataset/geneval/object_names.txt	Adds the COCO-ish class-name list used for label mapping.
AGENTS.md	Updates the documented reward list to include GenEval.

Comments suppressed due to low confidence (3)

src/flow_factory/rewards/geneval.py:219

GenEvalRewardModel hard-codes CUDA device selection via cuda:{accelerator.local_process_index}. This ignores the reward config’s device (and BaseRewardModel’s self.device) and will break if the reward is configured for CPU or non-CUDA backends, and can also diverge from Accelerate’s chosen device in more complex setups. Prefer initializing both mmdet and open_clip using self.device (or self.device.type/index) rather than constructing a CUDA string manually.

        device_str = f"cuda:{self.accelerator.local_process_index}"
        self._detector = init_detector(
            detector_config,
            detector_checkpoint,
            device=device_str,
        )
        logger.info(f"Mask2Former loaded on {device_str}.")

src/flow_factory/rewards/geneval.py:185

The ImportError guidance for missing GenEval deps is incomplete: Mask2Former inference requires mmcv ops in addition to mmdet/mmengine, and this repo provides a dedicated installer script / optional extra. Consider updating the error message to point users to pip install flow-factory[geneval] and/or bash scripts/install_geneval_deps.sh, so failures are actionable.

        try:
            from mmdet.apis import init_detector, inference_detector
        except ImportError:
            raise ImportError(
                "mmdet is required for GenEval reward. "
                "Install with: pip install mmdet mmengine"
            )

src/flow_factory/rewards/geneval.py:498

torch.amp.autocast("cuda", enabled=False) is always entered, even if the reward model is configured to run on CPU. If you switch GenEval to use self.device, consider also keying this autocast guard off self.device.type (or using a no-op context on non-CUDA) to avoid backend mismatches.

        # Disable AMP: mmdet's ms_deform_attn CUDA kernel does not support bf16,
        # and CLIP weights are loaded as fp32. Without this guard, bf16 autocast
        # from the trainer's eval loop would cause runtime errors.
        with torch.amp.autocast("cuda", enabled=False):
            for i in range(batch_size):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        self._counting_threshold = getattr(
+            config, "counting_threshold", DEFAULT_COUNTING_THRESHOLD
+        )
+        self._nms_threshold = getattr(config, "nms_threshold", DEFAULT_NMS_THRESHOLD)


+    "mmcv",
+    "mmengine",
+    "mmdet>=3.3.0",


- Remove unused nms_threshold parameter (read but never applied) - Pin mmcv>=2.0.0,<2.2.0 and mmdet>=3.3.0,<4.0.0 in pyproject.toml for reproducible `pip install .[geneval]` - Update ImportError messages to point to install_geneval_deps.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

+        self.adapter.rollout()
+        if reward_buffer is not None:
+            reward_buffer.clear()
+
+        if trajectory_indices is None:
+            trajectory_indices = [-1]
+


+        if self.device.type != "cuda":
+            logger.warning(
+                "GenEval is configured on CPU but requires CUDA for Mask2Former inference. "
+                "Set `device: cuda` in your reward config."


+        # Disable AMP: mmdet's ms_deform_attn CUDA kernel does not support bf16,
+        # and CLIP weights are loaded as fp32. Without this guard, bf16 autocast
+        # from the trainer's eval loop would cause runtime errors.
+        with torch.amp.autocast("cuda", enabled=False):


- generate_samples: preserve None semantics for trajectory_indices (None = no recording, used in evaluate; no longer coerced to [-1]) - geneval: hard-error on non-CUDA device instead of warning (Mask2Former CUDA kernels will fail anyway, fail fast with clear msg) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Brings trellis2 branch up-to-date with origin/main, integrating: - PR X-GenGroup#170: DiffusionOPD on-policy distillation trainer - PR X-GenGroup#168: Multi-dataset training with per-source reward routing - PR X-GenGroup#165: GenEval reward + unified trainer sampling pipeline - PR X-GenGroup#163: hparams/training_args.py → package split - PR X-GenGroup#162: Reconcile config with runtime distributed state - PRs X-GenGroup#121,X-GenGroup#146-X-GenGroup#161: CRD algorithm, Docker CUDA 12.9, HF resume, etc. Conflict resolutions: - trainers/registry.py: merged both sides (trellis2_grpo/nft + crd/opd) - rewards/registry.py: merged both sides (trellis2 rewards + geneval) - hparams/training_args.py: deleted (accept main's package split), added trellis2_grpo/nft entries to _registry.py - trainers/grpo.py: removed duplicate evaluate() (unified in BaseTrainer) - trainers/nft.py: adopted generate_samples(), removed duplicate evaluate() - trainers/abc.py: added _extra_eval_inference_kwargs() hook to BaseTrainer evaluate() so Trellis2TrainerMixin can inject stages/render_kwargs - samples/samples.py: merged source/source_id fields with trellis2 additions - rewards/reward_processor.py: merged source-aware gating with trellis2's _store_reward_extra_info - data_utils/loader.py: merged multi-source pipeline with image_3d dataset - data_utils/sampler_loader.py: merged refactored parameters - hparams/args.py: merged multi-source alignment with trellis2's group-aligned - guidance/rewards.md: merged both (PickScore_TextImage_Sum + GenEval) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Jayce-Ping and others added 9 commits May 25, 2026 16:08

update

a8a6695

docs: update GenEval reference to upstream repo

c6ca1f5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Jayce-Ping marked this pull request as ready for review May 25, 2026 14:35

Copilot AI review requested due to automatic review settings May 25, 2026 14:35

Copilot started reviewing on behalf of Jayce-Ping May 25, 2026 14:36 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Jayce-Ping and others added 2 commits May 25, 2026 22:44

fix(geneval): warn if device is not CUDA during init

bab6b76

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Jayce-Ping requested a review from Copilot May 25, 2026 14:56

Copilot started reviewing on behalf of Jayce-Ping May 25, 2026 14:56 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Jayce-Ping and others added 2 commits May 25, 2026 23:08

docs(geneval): update install script to note Python 3.12 is tested

f87c36a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Jayce-Ping merged commit 8a97410 into main May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GenEval reward support + unified trainer sampling pipeline#165

feat: GenEval reward support + unified trainer sampling pipeline#165
Jayce-Ping merged 13 commits into
mainfrom
feat/geneval-support

Jayce-Ping commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jayce-Ping commented May 25, 2026

Summary

Key Design Decisions

Files Changed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants