feat: GenEval reward support + unified trainer sampling pipeline#165
Merged
Conversation
- Add GenEval reward model (Mask2Former object detection + CLIP color classification) with lazy optional dependencies (mmcv/mmdetection) - Extract `sample_batch()`, `generate_samples()`, and `evaluate()` into BaseTrainer, eliminating ~460 lines of duplicated sampling/eval loops across all 6 trainers (GRPO, NFT, DPO, AWM, CRD, DGPO) - Add `_inject_batch_metadata()` to automatically pass dataset metadata (e.g. geneval_metadata) from dataloader batches into sample.extra_kwargs - Include full deduplicated GenEval dataset (33,199 train / 553 test) with metadata stored as JSON strings to avoid Arrow serialization issues - Add install script and example training config for SD3.5 + GenEval Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Upgrade from mmdet 2.x to mmdet 3.x (DetDataSample API, auto config
resolution via .mim/configs/, checkpoint auto-download from model zoo)
- Replace clip_benchmark dependency with direct open_clip encode_image/
encode_text — eliminates one heavy optional dependency
- Add torch.amp.autocast("cuda", enabled=False) guard to prevent bf16
crashes in mmdet's ms_deform_attn CUDA kernel during eval
- Flatten dataset format: separate include/exclude/tag fields instead of
single geneval_metadata blob (matches reward model's required_fields)
- Add dataset/geneval/object_names.txt (80 COCO classes)
- Simplify install script: mmdet 3.x installs via pip (no CUDA compile)
- Now supports Python 3.10-3.12 (no longer limited to 3.10)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add "Dataset Metadata Convention" section to guidance/rewards.md and update docstrings in abc.py, dataset.py, my_reward.py to document the implicit contract: complex metadata stored as JSON strings in JSONL, reward models responsible for json.loads() parsing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
openmim relies on pkg_resources which removed pkgutil.ImpImporter in Python 3.12, causing AttributeError. Replace mim-based install with direct pip/uv using OpenMMLab prebuilt wheel index URL (auto-detected from torch+CUDA version). Also remove openmim from pyproject.toml optional deps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mmcv's setup.py depends on pkg_resources (removed in Python 3.12+) and prebuilt wheels don't exist for cutting-edge torch+CUDA combos. Simplify the script to: warn if not Python 3.10, then source-compile mmcv/mmdet with --no-build-isolation. Remove openmim from pyproject.toml (broken on 3.12). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mmdet 3.3.0 requires mmcv>=2.0.0rc4,<2.2.0. The main branch of mmcv is already 2.2.0 which exceeds the upper bound. Pin to v2.1.0 tag. Also pin mmdetection to v3.3.0 tag for reproducibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update geneval.py docstring to recommend install_geneval_deps.sh - Add GenEval to built-in reward models table in README and guidance - Add install note for GenEval dependencies (Python 3.10 recommended) - List GenEval in AGENTS.md reward models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a new GenEval reward model (Mask2Former detection + CLIP color classification) and refactors the trainer stack to use a unified sampling / evaluation pipeline in BaseTrainer, reducing duplicated code across trainers and enabling dataset metadata passthrough into reward-model kwargs.
Changes:
- Introduces
GenEvalRewardModeland registers it in the reward registry, plus docs/config/scripts for installing optional deps and using the dataset. - Extracts
sample_batch(),generate_samples(), and a defaultevaluate()intoBaseTrainer, and updates trainers to delegate sampling to the shared implementation. - Adds a dataset metadata convention (
metadatacolumn →sample.extra_kwargs→ reward__call__kwargs) and ships GenEval JSONL + class list.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/flow_factory/trainers/abc.py | Adds unified sampling (sample_batch/generate_samples) and default evaluation loop with metadata injection. |
| src/flow_factory/trainers/grpo.py | Delegates sampling to generate_samples() (keeps GRPO trajectory/logprob settings). |
| src/flow_factory/trainers/nft.py | Delegates sampling to generate_samples() (final-latent-only rollouts). |
| src/flow_factory/trainers/dpo.py | Delegates sampling to generate_samples() (final-latent-only rollouts). |
| src/flow_factory/trainers/dgpo.py | Delegates sampling to generate_samples() (final-latent-only rollouts). |
| src/flow_factory/trainers/crd.py | Delegates sampling to generate_samples() (final-latent-only rollouts). |
| src/flow_factory/trainers/awm.py | Delegates sampling to generate_samples() (final-latent-only rollouts). |
| src/flow_factory/rewards/geneval.py | Implements GenEval reward model with optional mmdet/mmcv + open_clip deps. |
| src/flow_factory/rewards/registry.py | Registers the geneval reward model name → implementation path. |
| src/flow_factory/rewards/abc.py | Clarifies that **kwargs comes from sample.extra_kwargs / dataset metadata. |
| src/flow_factory/rewards/my_reward.py | Documents how to receive metadata fields in custom reward signatures. |
| src/flow_factory/data_utils/dataset.py | Packs non-preprocess JSONL fields into a metadata column for passthrough. |
| guidance/rewards.md | Documents the dataset metadata convention and adds GenEval to the reward list. |
| README.md | Adds GenEval to the reward table and notes the optional dependency install path. |
| examples/grpo/lora/sd3_5/geneval.yaml | Provides an end-to-end example config for training/eval with GenEval. |
| scripts/install_geneval_deps.sh | Adds a helper script to install/compile mmcv + install mmdet/open_clip. |
| scripts/convert_geneval_dataset.py | Adds a conversion/dedup script for GenEval JSONL (store metadata as JSON strings). |
| pyproject.toml | Adds a geneval optional-dependency extra. |
| dataset/geneval/test.jsonl | Adds the GenEval test split in JSONL (include/exclude stored as JSON strings). |
| dataset/geneval/object_names.txt | Adds the COCO-ish class-name list used for label mapping. |
| AGENTS.md | Updates the documented reward list to include GenEval. |
Comments suppressed due to low confidence (3)
src/flow_factory/rewards/geneval.py:219
- GenEvalRewardModel hard-codes CUDA device selection via
cuda:{accelerator.local_process_index}. This ignores the reward config’sdevice(and BaseRewardModel’sself.device) and will break if the reward is configured for CPU or non-CUDA backends, and can also diverge from Accelerate’s chosen device in more complex setups. Prefer initializing both mmdet and open_clip usingself.device(orself.device.type/index) rather than constructing a CUDA string manually.
device_str = f"cuda:{self.accelerator.local_process_index}"
self._detector = init_detector(
detector_config,
detector_checkpoint,
device=device_str,
)
logger.info(f"Mask2Former loaded on {device_str}.")
src/flow_factory/rewards/geneval.py:185
- The ImportError guidance for missing GenEval deps is incomplete: Mask2Former inference requires mmcv ops in addition to mmdet/mmengine, and this repo provides a dedicated installer script / optional extra. Consider updating the error message to point users to
pip install flow-factory[geneval]and/orbash scripts/install_geneval_deps.sh, so failures are actionable.
try:
from mmdet.apis import init_detector, inference_detector
except ImportError:
raise ImportError(
"mmdet is required for GenEval reward. "
"Install with: pip install mmdet mmengine"
)
src/flow_factory/rewards/geneval.py:498
torch.amp.autocast("cuda", enabled=False)is always entered, even if the reward model is configured to run on CPU. If you switch GenEval to useself.device, consider also keying this autocast guard offself.device.type(or using a no-op context on non-CUDA) to avoid backend mismatches.
# Disable AMP: mmdet's ms_deform_attn CUDA kernel does not support bf16,
# and CLIP weights are loaded as fp32. Without this guard, bf16 autocast
# from the trainer's eval loop would cause runtime errors.
with torch.amp.autocast("cuda", enabled=False):
for i in range(batch_size):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self._counting_threshold = getattr( | ||
| config, "counting_threshold", DEFAULT_COUNTING_THRESHOLD | ||
| ) | ||
| self._nms_threshold = getattr(config, "nms_threshold", DEFAULT_NMS_THRESHOLD) |
Comment on lines
+77
to
+79
| "mmcv", | ||
| "mmengine", | ||
| "mmdet>=3.3.0", |
- Remove unused nms_threshold parameter (read but never applied) - Pin mmcv>=2.0.0,<2.2.0 and mmdet>=3.3.0,<4.0.0 in pyproject.toml for reproducible `pip install .[geneval]` - Update ImportError messages to point to install_geneval_deps.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines
+506
to
+512
| self.adapter.rollout() | ||
| if reward_buffer is not None: | ||
| reward_buffer.clear() | ||
|
|
||
| if trajectory_indices is None: | ||
| trajectory_indices = [-1] | ||
|
|
Comment on lines
+145
to
+148
| if self.device.type != "cuda": | ||
| logger.warning( | ||
| "GenEval is configured on CPU but requires CUDA for Mask2Former inference. " | ||
| "Set `device: cuda` in your reward config." |
| # Disable AMP: mmdet's ms_deform_attn CUDA kernel does not support bf16, | ||
| # and CLIP weights are loaded as fp32. Without this guard, bf16 autocast | ||
| # from the trainer's eval loop would cause runtime errors. | ||
| with torch.amp.autocast("cuda", enabled=False): |
- generate_samples: preserve None semantics for trajectory_indices (None = no recording, used in evaluate; no longer coerced to [-1]) - geneval: hard-error on non-CUDA device instead of warning (Mask2Former CUDA kernels will fail anyway, fail fast with clear msg) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
87003697
pushed a commit
to 87003697/Flow-Factory
that referenced
this pull request
Jun 2, 2026
Brings trellis2 branch up-to-date with origin/main, integrating: - PR X-GenGroup#170: DiffusionOPD on-policy distillation trainer - PR X-GenGroup#168: Multi-dataset training with per-source reward routing - PR X-GenGroup#165: GenEval reward + unified trainer sampling pipeline - PR X-GenGroup#163: hparams/training_args.py → package split - PR X-GenGroup#162: Reconcile config with runtime distributed state - PRs X-GenGroup#121,X-GenGroup#146-X-GenGroup#161: CRD algorithm, Docker CUDA 12.9, HF resume, etc. Conflict resolutions: - trainers/registry.py: merged both sides (trellis2_grpo/nft + crd/opd) - rewards/registry.py: merged both sides (trellis2 rewards + geneval) - hparams/training_args.py: deleted (accept main's package split), added trellis2_grpo/nft entries to _registry.py - trainers/grpo.py: removed duplicate evaluate() (unified in BaseTrainer) - trainers/nft.py: adopted generate_samples(), removed duplicate evaluate() - trainers/abc.py: added _extra_eval_inference_kwargs() hook to BaseTrainer evaluate() so Trellis2TrainerMixin can inject stages/render_kwargs - samples/samples.py: merged source/source_id fields with trellis2 additions - rewards/reward_processor.py: merged source-aware gating with trellis2's _store_reward_extra_info - data_utils/loader.py: merged multi-source pipeline with image_3d dataset - data_utils/sampler_loader.py: merged refactored parameters - hparams/args.py: merged multi-source alignment with trellis2's group-aligned - guidance/rewards.md: merged both (PickScore_TextImage_Sum + GenEval) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sample_batch()/generate_samples()/evaluate()into BaseTrainer, eliminating ~460 lines of duplicated code across all 6 trainers._inject_batch_metadata()automatically passes dataset-level metadata (e.g.geneval_metadata) from dataloader batches into samples'extra_kwargs, enabling metadata-dependent rewards without modifying adapters.Key Design Decisions
sample_batch()is publicevaluate()removed from@abstractmethodreward_type: "score"for trainingFiles Changed
trainers/abc.py(+200),trainers/{grpo,nft,dpo,awm,crd,dgpo}.py(-461 total)rewards/geneval.py,rewards/registry.pydataset/geneval/{train,test}.jsonlexamples/grpo/lora/sd3_5/geneval.yaml,scripts/install_geneval_deps.sh,scripts/convert_geneval_dataset.py,pyproject.tomlTest plan
examples/grpo/lora/sd3_5/default.yaml) — verify no regression from sampling pipeline refactorgenerate_samples()delegation worksexamples/grpo/lora/sd3_5/geneval.yamlend-to-end with GenEval dataset🤖 Generated with Claude Code