Skip to content

[Optimization] Add selective trajectory storage#28

Merged
Jayce-Ping merged 6 commits into
mainfrom
opt_trajectory
Feb 9, 2026
Merged

[Optimization] Add selective trajectory storage#28
Jayce-Ping merged 6 commits into
mainfrom
opt_trajectory

Conversation

@Jayce-Ping

Copy link
Copy Markdown
Collaborator

No description provided.

@Jayce-Ping Jayce-Ping merged commit b9e47bc into main Feb 9, 2026
@Jayce-Ping Jayce-Ping deleted the opt_trajectory branch February 9, 2026 12:03
Jayce-Ping added a commit that referenced this pull request Apr 25, 2026
…, sync agent knowledge

- Restructure examples/ to algorithm/ft/model/variant.yaml with examples/README.md
- Add LTX-2/2.3 to README (News, model table, install note)
- Add .scratch/ constraint for agent temp files (#28), examples convention (#29)
- Sync agent knowledge: GroupDistributedSampler in samplers.md, LTX2 + RationalRewards in architecture.md
- Clean up .docs/ltx2-research/ dev artifacts
- Update LTX2 configs: guidance_scale=1.0, comment out attn_backend

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jayce-Ping added a commit that referenced this pull request Apr 25, 2026
… support (#118)

* docs: comprehensive analysis of samples, rewards, and video adapter architecture

Add detailed documentation covering:
- Complete BaseSample dataclass fields and sample type hierarchy (T2ISample, T2VSample, etc.)
- RewardModelOutput and reward model abstract classes (PointwiseRewardModel, GroupwiseRewardModel)
- WAN2 text-to-video adapter implementation and video generation pipeline
- Sample canonicalization and unique_id computation via SHA256 hashing
- Video format specifications (Tensor(T, C, H, W) with frame/dimension constraints)
- Reward model input/output modality handling and tensor input configuration
- Data flow examples from generation through sampling to reward computation
- Quick reference cards and integration patterns for implementing new adapters

Files created:
- START_HERE.md: Quick navigation guide
- ANALYSIS_REPORT.md: Comprehensive technical analysis
- QUICK_REFERENCE.md: Copy-paste templates and patterns
- MODALITY_FLOW.md: Complete data flow walkthrough
- CODEBASE_EXPLORATION.md: File structure and discovery process
- DOCUMENTATION_INDEX.md: Structured index of all findings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: move research docs to .docs/ltx2-research/

Move auto-generated analysis documents out of project root
into a dedicated temporary folder for LTX2 integration research.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [utils] feat: add audio utility module with type aliases, standardization, and loading

Add `utils/audio.py` following the exact pattern of `utils/image.py` and
`utils/video.py`. Provides a complete audio waveform toolkit:

- Type aliases: `AudioSingle` (C,T), `AudioBatch` (B,C,T)
- Validation: `is_audio()`, `is_audio_batch()`
- Loading/saving: `load_audio()`, `save_audio()` with 3-tier backend
  fallback (torchaudio → soundfile → stdlib wave)
- Conversion: `audio_to_tensor()`, `audio_to_numpy()`, `convert_audio()`
  for resampling and mono/stereo conversion
- Standardization: `standardize_audio_batch()` with output_type='pt'|'np'
- Hashing: `hash_audio()`, `hash_audio_list()` with int16 quantization

Design conventions follow diffusers/audiocraft:
- Channel-first (C, T) tensor layout
- [-1.0, 1.0] float32 value range
- Channel conversion: downmix=mean, upmix=repeat

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [samples] feat: add audio field to BaseSample and T2AVSample class

- Add `audio: Optional[torch.Tensor]` field to BaseSample with
  automatic promotion from 1D (T,) to 2D (C, T) via audio_to_tensor()
- Add T2AVSample(BaseSample) for text-to-audio-video generation tasks
- Update __init__.py exports

The audio field follows the same pattern as image/video: stored without
batch dimension, standardized in __post_init__, supports stack/to_dict.
Fully backward compatible — defaults to None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [data] feat: support audio loading and preprocessing in dataset pipeline

- Add `audio_dir` field to DataArguments (defaults to 'audios' subfolder)
- Add audio loading block in GeneralDataset._preprocess_batch() that
  mirrors the existing video loading pattern: detect 'audio' column in
  JSONL, load via load_audio(), pass to preprocess_func(audios=...)
- Add 'audios' to PREPROCESS_KEYS for metadata exclusion
- Pass audio_dir through fn_kwargs to .map() call

The audio_dir parameter flows automatically through loader.py via
filter_kwargs — no changes needed in loader.py. Fully backward
compatible: datasets without audio columns are unaffected.

JSONL format: {"prompt": "...", "audio": "file.wav"}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models] feat: add encode_audio interface and audio_vae support to BaseAdapter

- Add `encode_audio()` default method that returns None — non-abstract,
  so existing adapters need no changes
- Add `audio_vae` property with getter and setter (mirrors vae pattern)
- Update `preprocess_func()` to accept `audios` parameter and route it
  through `encode_audio()` in the same loop as prompt/image/video
- Update `_freeze_vae()` to also freeze audio_vae when present, keeping
  `_freeze_components()` clean

Fully backward compatible: encode_audio returns None by default,
audio_vae returns None when pipeline has no audio_vae component.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [rewards] feat: add audio parameter to reward model interfaces

- Add `audio: Optional[List[torch.Tensor]]` parameter to both
  PointwiseRewardModel.__call__ and GroupwiseRewardModel.__call__,
  positioned after `video` and before `condition_images`
- Add 'audio' to RewardProcessor.MEDIA_FIELDS
- Add audio branch in _convert_media_format(): tensor passthrough
  when use_tensor_inputs=True, numpy conversion otherwise

Audio is always tensor-based (no PIL equivalent). Existing reward
models (PickScore, CLIP, etc.) are unaffected — their concrete
__call__ signatures don't include `audio`, so filter_kwargs strips
it automatically with zero overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update Step 6 sub-plan to v4 with verified diffusers 0.38.0.dev0 API

Key corrections from runtime verification:
- Connectors use additive mask, not padding_side
- Transformer forward has no sigma/audio_timestep/STG/modality params
- CFG is velocity-space with [uncond, cond] chunk order
- Compression ratios are instance attributes, not class properties
- Audio params from pipeline attributes, not transformer config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clean up obsolete research docs, consolidate plan

Remove 8 early exploration documents that have been fully superseded
by actual code implementation and the two remaining plan files:
- COMMIT_PLAN.md: overall progress tracker (updated with final status)
- STEP6_SUBPLAN.md: detailed LTX2 adapter plan (v4, API-verified)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: scaffold LTX2 adapter with sample dataclass and pipeline loading

Add LTX2 text-to-audio-video adapter scaffold:

- LTX2Sample(T2AVSample): dataclass with audio trajectory fields
  (audio_all_latents, audio_latent_index_map), connector embedding
  fields for both video/audio streams, and negative prompt fields
  for CFG during training

- LTX2_T2AV_Adapter(BaseAdapter): skeleton with:
  - load_pipeline(): LTX2Pipeline.from_pretrained with low_cpu_mem_usage=False
  - _create_audio_scheduler(): ODE-only FlowMatchEulerDiscreteSDEScheduler
    (separate instance to avoid step_index collision with video)
  - default_target_modules: 28 Linear layers per block, verified against
    LTX2VideoTransformerBlock.named_modules()
  - preprocessing_modules: ['text_encoders', 'connectors']
  - inference_modules: ['transformer', 'vae', 'audio_vae', 'connectors', 'vocoder']
  - Stub methods for encode/decode/forward/inference (NotImplementedError)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: implement encode_prompt and decode_latents for LTX2

encode_prompt():
- Delegates to pipeline.encode_prompt() for Gemma3 all-layer encoding
  + _pack_text_embeds normalization
- Passes through connectors with additive attention mask
  (1 - binary_mask) * -1e6, additive_mask=True
- Splits [negative, positive] connector outputs for CFG
- Returns prompt_ids + video/audio connector embeddings + masks

decode_latents():
- Video: unpack → denormalize → optional timestep conditioning with
  decode noise injection → VAE decode → postprocess
- Audio: denormalize → unpack → audio_vae decode → vocoder
  (denormalize BEFORE unpack — order differs from video!)
- All operations match pipeline source L1172-1218 exactly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: implement forward() with CFG and dual scheduler steps

Single denoising step matching pipeline L1097-1154:
1. Prepare CFG inputs: duplicate latents + concat [neg, pos] embeddings
2. Joint transformer forward with cache_context("cond_uncond")
3. CFG in velocity-space: uncond + gs * (cond - uncond), with
   optional guidance_rescale
4. Video: SDE scheduler step (stochastic, with log_prob for RL)
5. Audio: ODE scheduler step (deterministic, no log_prob)
6. Attach audio_next_latents on video output for trajectory tracking

RoPE coords are computed on demand if not cached from inference loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: implement inference loop and register ltx2_t2av adapter

inference():
- Full denoising loop following pipeline L908-1226
- Encode prompts (with fallback to pre-encoded inputs)
- Compute latent dimensions: video (32x spatial, 8x temporal), audio
  (4x mel, 4x temporal, sr=16kHz, hop=160)
- Prepare video + audio latents via pipeline.prepare_latents/audio_latents
- Timestep shift with LTX2-specific mu (base_seq=1024, shift=0.95-2.05)
- Positional coords via transformer.rope / audio_rope
- Dual trajectory collection: video (for RL) + audio (for reconstruction)
- Decode both modalities and construct LTX2Sample per batch element

Registry:
- Add 'ltx2_t2av' entry mapping to LTX2_T2AV_Adapter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: improve type safety and code cleanup for LTX2 adapter

- Add self.scheduler type declaration (consistent with all other adapters)
- Replace SDESchedulerOutput with FlowMatchEulerDiscreteSDESchedulerOutput
- Change forward() return type to Tuple instead of dynamic attribute
- Fix Optional parameter annotations for forward() signature
- Remove unused imports (Any, DISTILLED_SIGMA_VALUES) and variables
- Replace unicode arrows with ASCII in comments for encoding safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN with completed steps summary and next steps (7-9)

Steps 1-6 + type cleanup are all committed. Remaining work:
- Step 7: Design multi-modal forward() return pattern for optimize()
- Step 8: Example YAML configs (GRPO, NFT, AWM)
- Step 9: Audio-video dataset integration for testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: promote num_frames and frame_rate to explicit LTX2Sample fields

Move num_frames and frame_rate from extra_kwargs to explicit dataclass
fields on LTX2Sample, consistent with height/width on BaseSample.
Add num_frames to _shared_fields (shared across batch, not stacked).
Keep duration_s in extra_kwargs as a derived value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: unified latent interface for forward()

Redesign forward() to accept concatenated video+audio latents as a single
tensor (B, video_seq + audio_seq, C). Internally splits by video_seq_len,
runs the joint transformer, steps video SDE and audio ODE schedulers
separately, then concatenates next_latents back into a unified output.

This makes the trainer interface identical to single-modality adapters:
- forward() returns a single FlowMatchEulerDiscreteSDESchedulerOutput
- Trainers access output.next_latents and output.log_prob directly
- No trainer changes needed for multi-modal generation

Key changes:
- forward(): accept unified latents, split/cat internally, return single output
- inference(): cat video+audio before loop, use single latent_collector
- LTX2Sample: replace audio_all_latents with video_seq_len split point
- Remove Tuple return type, keep dual scheduler (video SDE + audio ODE)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN — Step 7 complete, Steps 8-9 remaining

Step 7 resolved via unified latent interface:
- 7a: Type safety cleanup (FlowMatchEulerDiscreteSDESchedulerOutput)
- 7b: Promote num_frames/frame_rate to explicit LTX2Sample fields
- 7c: Unified forward() — cat(video,audio) input, single output, dual scheduler internal

Remaining: Step 8 (example YAML configs), Step 9 (test dataset)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN to reflect current state, remove obsolete STEP6_SUBPLAN

- Step 7 fully completed (7a-7c): type cleanup, explicit fields, unified latents
- Steps 8-9 remain: example configs + test dataset
- Add deferred features table with diffusers source availability
- Delete STEP6_SUBPLAN.md (diverged from implementation, no longer useful)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: correct deferred features — STG, prompt enhancement, x0-guidance all exist in diffusers

All three features are available in installed diffusers 0.38.0.dev0:
- STG: extra transformer forward with spatio_temporal_guidance_blocks
- Modality Isolation: extra forward with isolate_modalities=True
- Prompt Enhancement: Gemma3 text_encoder.generate() with system prompt
- x0-space guidance: convert_velocity_to_x0/convert_x0_to_velocity

Note: current adapter uses velocity-space CFG; x0-space is prerequisite
for STG and Modality Isolation Guidance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: align forward() with official x0-space multi-guidance pipeline

Rewrite the guidance section to match official diffusers pipeline_ltx2.py:

1. x0-space guidance: convert velocity predictions to x0-space, compute all
   guidance deltas (CFG + STG + Modality Isolation) in x0-space, then convert
   back to velocity. This replaces the previous velocity-space CFG.

2. STG (Spatio-Temporal Guidance): optional extra transformer forward with
   spatio_temporal_guidance_blocks to perturb specific blocks. Separate
   stg_scale / audio_stg_scale for video and audio.

3. Modality Isolation Guidance: optional extra transformer forward with
   isolate_modalities=True (disables A2V/V2A cross-attention). Separate
   modality_scale / audio_modality_scale.

4. Prompt Enhancement: inference() supports system_prompt parameter to
   rewrite prompts via Gemma3 text_encoder.generate().

5. LTX-2.3 compatibility: pass sigma=timestep and use_cross_timestep
   to all transformer forward calls.

6. Independent audio guidance: separate audio_guidance_scale,
   audio_guidance_rescale, audio_stg_scale, audio_modality_scale
   (all default to their video counterparts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN with refined next steps (8a/8b/9)

Restructured remaining steps:
- 8a: Inference alignment refinements (validation, num_frames rounding,
  coord pre-duplication, distilled sigmas, embed concatenation)
- 8b: Example YAML configs (GRPO/NFT/AWM x lora/full)
- 9: VGGSound-50k test dataset integration

Updated design decisions to reflect x0-space guidance, sigma-based
conversion, and adapter design philosophy (inference=__call__, forward=step).
Removed obsolete deferred features (STG, prompt enhancement, x0-guidance
now implemented).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: align inference() with official pipeline

- Add _check_inputs(): validate height/width divisibility, STG block spec,
  auto-round num_frames to VAE-temporal-compatible value
- Pre-duplicate RoPE coords for CFG before denoising loop (official L1201)
- Pre-concatenate [neg, pos] connector embeds before loop to avoid
  re-catting every step; forward() receives _cfg_prepared flag
- Extract positive-only embeds from pre-catted tensors for STG/modality
  passes when _cfg_prepared=True

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [examples] feat: add LTX2 T2AV GRPO+LoRA example config

Verified with ff-train: config parses correctly, model architecture
resolves to ltx2_t2av, all parameters (resolution, num_frames, frame_rate,
audio_dir, guidance_scale, LoRA settings) are properly propagated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update config and dataset

* [logger,samples] feat: mux audio into video when logging T2AV samples

T2AV samples (e.g. LTX2) carry both video and audio, but the logging
pipeline silently dropped the audio track. This commit adds audio-video
muxing so wandb/other loggers receive a single MP4 with sound.

Changes:
- BaseSample: add `audio_sample_rate` field alongside `audio`
- LTX2 adapter: populate `audio_sample_rate` from pipeline vocoder config
- LogVideo: add optional `audio`/`audio_sample_rate` fields; when present,
  `get_value('mp4')` muxes H.264 video + AAC audio via PyAV (mirrors
  diffusers' encode_video); falls back to silent MP4 if PyAV unavailable
- LogFormatter: add T2AVSample dispatch → `_process_t2av_samples()` that
  creates LogVideo with audio attached and correct fps from frame_rate

Made-with: Cursor

* Fix _resolve_component_names

* [models/ltx2] fix: align connectors call with current diffusers API

Pass binary attention mask directly to LTX2TextConnectors.forward()
instead of pre-computing additive mask. The current diffusers version
handles binary-to-additive conversion internally and no longer accepts
the `additive_mask` keyword argument.

Made-with: Cursor

* update

* Fix

* [models] refactor: unify CFG control via guidance_scale across all adapters

Remove explicit `do_classifier_free_guidance` parameter from encode_prompt,
inference, and forward signatures. Instead, derive the CFG flag internally
from `guidance_scale` (>1.0 standard, >0.0 for Z-Image), ensuring
data_preprocessing and inference stages always use the same CFG decision.

Affected adapters: SD3.5, Z-Image, Wan T2V/I2V/V2V, LTX2 T2AV, Flux2 Klein.

Made-with: Cursor

* update

* Fix cfg

* Update config

* [models/ltx2] refactor: inline Gemma3 encoding into adapter encode_prompt

Eliminate delegation to pipeline.encode_prompt() and redundant tokenizer
call by inlining _get_gemma_prompt_embeds logic into a new _encode_text()
helper. This produces prompt_ids and negative_prompt_ids from the same
tokenization pass used for embeddings.

Made-with: Cursor

* [samples] refactor: include negative_prompt in unique_id and extract _hash_id_fields

Add negative_prompt/negative_prompt_ids to _id_fields and hash
computation so samples with different negative prompts are correctly
distinguished. Extract _hash_id_fields(hasher) to eliminate duplicated
prompt hashing in ImageConditionSample and VideoConditionSample.
Also parameterize digest length via num_bytes (default 16 = 128-bit).

Made-with: Cursor

* Update config

* Fix bytes

* Fix audio_sample_rate

* [models/ltx2,logger,utils] fix: address PR review — input validation, mux guard, ndim check

- Extend _check_inputs with prompt/embedding presence and CFG+negative
  consistency validation (Copilot comments #3, #5)
- Narrow audio mux guard to format == 'mp4' (comment #6)
- Add ndim validation in standardize_audio_batch (comment #7)

Made-with: Cursor

* [models/ltx2,utils] feat: integrate prompt enhancement with RNG isolation

Add isolated_rng context manager to utils/base.py for safe global RNG
seeding. Implement _enhance_prompt_batch in LTX2 adapter using official
pipeline.enhance_prompt with deterministic seed and RNG state isolation,
preventing seed leakage into downstream noise sampling. Enhancement is
opt-in via system_prompt config (null=disabled, "default"=Lightricks
prompt). Clean up placeholder enhancement code from inference().

Made-with: Cursor

* Reorder unique_id priority

* [reward] feat: add CLAP and ImageBind audio reward models

Add two audio-specific reward models for LTX2 audio-video generation:

- CLAPRewardModel: audio-text alignment via LAION CLAP (48 kHz mono,
  cosine similarity, zero new deps via transformers.ClapModel)
- ImageBindRewardModel: audio-video semantic alignment via Meta ImageBind
  (16 kHz mel-spectrogram, video spatial crops, multi-mode scoring)

Register both in reward registry, replace PickScore with CLAP + ImageBind
in ltx2_t2av.yaml config, and update COMMIT_PLAN.md Step 11 as done.

Made-with: Cursor

* [docs] feat: add flat inheritance rules for adapters and trainers

- Constraint #11: trainers MUST inherit from BaseTrainer directly
  (only GRPOGuardTrainer → GRPOTrainer sanctioned)
- Constraint #12: adapters MUST inherit from BaseAdapter directly;
  shared logic uses helpers, code duplication, or mixins
- Update architecture.md and cursor rule to reference both rules

Made-with: Cursor

* [docs] feat: add I2AV adapter and audio reward research plans

- I2AV_PLAN: LTX2 Image-to-Audio-Video adapter design (BaseAdapter
  flat hierarchy, conditioning mask, first-frame preservation)
- AUDIO_REWARD_PLAN: CLAP and ImageBind reward model integration
- COMMIT_PLAN: update GPU validation status and prompt enhancement
  analysis

Made-with: Cursor

* chore: update diffusers submodule

Made-with: Cursor

* [docs] feat: add two-layer sample hierarchy constraint

Model-specific samples must inherit from task-level samples (e.g.
LTX2I2AVSample → I2AVSample), never from other model-specific samples.
Updated constraint #14, architecture.md, and base-class-contract rule.

Made-with: Cursor

* [docs] update: expand I2AV plan with full implementation details

- Complete LTX2I2AVSample dataclass with all duplicated LTX2 fields
- Code duplication strategy (no _common.py shared module)
- encode_image with condition_image_size (flux2/qwen pattern)
- Dual-path inference: raw images or pre-encoded condition_images
- Full forward()/inference() structure with conditioning mask semantics
- enhance_prompt I2AV multimodal difference documented

Made-with: Cursor

* [reward] fix: add transformers v5 compatibility for PickScore get_*_features() API

In transformers >=5.0, get_text_features()/get_image_features() return
BaseModelOutputWithPooling instead of a tensor. Add _extract_feature_tensor()
helper to handle both v4 (tensor) and v5 (ModelOutput) return types.

Made-with: Cursor

* [samples,logger] feat: add I2AVSample dataclass and logger support

Introduce I2AVSample (Image-to-Audio-Video) task-level sample and
its logger handler, combining I2V condition-image table layout with
T2AV audio-muxed LogVideo for complete I2AV logging across backends.

Made-with: Cursor

* [adapter,registry] feat: add LTX2 I2AV adapter for image-conditioned audio-video generation

- LTX2_I2AV_Adapter(BaseAdapter) with conditioning_mask for frame-0 preservation
- forward(): CFG-doubles conditioning_mask internally, per-token video timestep
  masking, scheduler.step on generated frames only
- inference(): dual-path image input (raw PIL via encode_image or pre-encoded tensor)
- _enhance_prompt_batch(): multimodal Gemma3 enhancement with raw PIL images
- _standardize_image_input(): MultiImageBatch flattening for single-condition model
- Registry entry 'ltx2_i2av' and example YAML config

Made-with: Cursor

* [adapter] fix: align MultiImageBatch/MultiVideoBatch type annotations across adapters

Wan2 I2V and Wan2 V2V had correct runtime logic (is_multi_image_batch check
in _standardize_*_input) but inference() and encode_image() annotations used
ImageBatch/VideoBatch instead of MultiImageBatch/MultiVideoBatch.

Made-with: Cursor

* [docs] update: add nested batch convention to adapter docs

- adapter_conventions.md: document that inference() receives MultiImageBatch/
  MultiVideoBatch from collator; single-condition adapters must flatten via
  _standardize_*_input; append Gotcha #5; add cross-refs
- ff-new-model/SKILL.md: extend Pitfall #6 with single-condition handling
  guidance and back-ref to adapter_conventions.md Gotcha #5

Made-with: Cursor

* [style] fix: apply PR review fixes — logging, section-divider policy, unique_id

- Replace bare print() with logger.warning() in formatting.py
- Update section-divider rule: allow between methods, forbid inside functions
- Convert in-function decorative dividers to plain numbered comments
- Include negative_prompt in unique_id hash via _hash_id_fields() refactor
- Fix compute_unique_id default to 8 bytes (fits torch.int64)

Made-with: Cursor

* [reward,models/ltx2] fix: CLAP BatchNorm dtype + I2AV forward bugs

- CLAP: load model in float32 (BatchNorm requires it); fix audios->audio deprecation
- I2AV: remove conditioning_mask from _shared_fields to preserve batch dim
- I2AV: ensure next_latents in return_kwargs for frame-slicing logic
- Remove dtype: bfloat16 from CLAP reward config in example YAMLs

Made-with: Cursor

* Update config

* update config

* Update config

* Fix dtype

* Update

* [diffusers] fix: cherry-pick flash_3_varlen_hub mask dtype fix

Cherry-pick commit 0f8a83fa6 from diffusers to support additive
(bfloat16) attention masks in _flash_attention_3_varlen_hub. This
fixes the ValueError when LTX2 connectors pass non-bool masks.

Made-with: Cursor

* [data_utils] perf: eliminate redundant preprocessed-dataset writes

Each rank's preprocessed Arrow shard is now written exactly once: the
orchestrator routes Dataset.map output directly to the final per-rank
location via cache_file_name= (under {merged_cache_path}.tmp/_parts/),
and the consolidator writes only state.json + dataset_info.json before
atomically renaming .tmp -> merged_cache_path. No row data is re-copied
during the merge, no duplicate cache lands under ~/.cache/huggingface,
and the build-dir sentinel _build_meta.json enables crash recovery for
unchanged num_shards while wiping cleanly on num_shards changes.

Single-process and distributed paths are unified through the same flow
(N=1 case for single-process); enable_preprocess=False bypasses the
consolidate pipeline entirely to preserve pre-refactor behavior. I/O
budget per cache build drops from ~4*N*S to ~N*S bytes touched.

Made-with: Cursor

* [docs] update: comprehensive audio + R7 no-op-default encoder docs sweep

Bring agent docs and the developer guides on feat/ltx2-audio-video-support
in line with the R6/R7 BaseAdapter contract changes (PR #129) AND extend
every audio-aware section to cover the new modality.

Critical contract fixes (mirror Wave A on data-utils-perf so the 0-hit
verification gate passes on this branch independently of merge order):
- constraints.md #12: 7 abstract methods -> 4 (load_pipeline,
  decode_latents, forward, inference); new "Optional encoder overrides
  (no-op default)" subsection naming all 4 encoders incl. encode_audio;
  preprocess_func note now explains the audios dispatch and "skip when
  None" semantics. Trailing "Adapter hierarchy" paragraph also fixed
  ("7-method contract" -> "4-abstract-method contract" with explanation).
- ff-new-model/SKILL.md: frontmatter description fixed; Phase-1 step-3
  mapping gains audio encoder/VAE row; Phase-2 step-3 implementation
  table reorders so the 4 truly-abstract methods are on top, marks all
  4 encoders as Abstract? No (no-op default; override if your model
  consumes this modality), adds the encode_audio row; Pitfall #6
  extended to include audios + the []-for-empty / never-unwrap contract
  (preserved the existing _standardize_*_input guidance and added a
  cross-ref to adapter_conventions.md Gotcha #6).
- guidance/new_model.md: Step 4 narrative replaced "three encoding
  methods" with "Override the encoders your model consumes"; updated
  the dispatcher pseudocode to enumerate all 4 modalities with the
  "if encoded is not None" skip; new #### encode_audio block parallel
  to encode_video (MultiAudioBatch signature, default return None);
  Step-5 inference signature comment adds "audios: Optional[MultiAudioBatch]"
  (commented as opt-in); checklist gains explicit encode_video() /
  encode_audio() items framed as override-only.

Audio-aware sweep (Wave-B-only, lives here because LTX-2 actually
consumes audio):
- adapter_conventions.md: Batch Dimension Convention extended to
  include audios/MultiAudioBatch on inference(); new bullet codifying
  the multi-media batch homogeneity guarantee; new Gotcha #6 for the
  []-for-empty / no-unwrap rule (applies symmetrically to images,
  videos, audios).
- guidance/new_model.md Data Format Conventions: new ### Audio table
  with audios/condition_audios/audio_features rows and a callout
  pointing to flow_factory.utils.audio.MultiAudioBatch; cross-cutting
  batch-boundary callout extended to include encode_audio().
- guidance/workflow.md Stage 1: goal narrative + Input row extended
  to include "audio files"; new audio-symmetry callout after the
  Flux.2 example explaining audio_dir is the third optional input
  handled by _preprocess_batch.
- architecture.md: Stage-1 ASCII box now lists audio + audio_features;
  Adapter Pattern subsection gains a one-liner that all 4 encoders are
  no-op by default (override only the modalities your model consumes).
- ff-develop/SKILL.md sec.2 Adapter Hierarchy: appended the R7 design
  lesson bullet (non-abstract no-op default + opt-in override over
  @AbstractMethod for new modalities; the 4 abstract methods are
  intentionally minimal).
- fix_patterns.md Recorded Fix Patterns: replaced "(No records yet)"
  with two full entries using the documented template — R6 multi-modal
  batch homogeneity and R7 non-abstract encoder defaults; both link
  back to the actual code locations.

Pure docs change; no Python touched. Verified: zero hits across
.agents/ + guidance/ for "7 abstract methods", "7-method contract",
"three encoding methods", "Implement the three encoding"; encode_audio
+ MultiAudioBatch present in every doc per the matrix; ReadLints clean
on all 8 touched files.

Made-with: Cursor

* [diffusers] sync: bump submodule to upstream main 77f8cf8bf

Drops the locally cherry-picked commit 620286eb5 ("support ltx-2 type
masking in flash_3_hub_varlen") and resets the submodule to the
official huggingface/diffusers main HEAD as of 2026-04-18.

Functional consequence: _flash_attention_3_varlen_hub no longer casts
non-bool attn_mask via `attn_mask > -1`. Upstream already carries the
`isinstance(result, tuple)` defensive unpack, so only the bool-cast is
missing. Waiting for upstream to land an equivalent fix; until then,
LTX2 paths that pass a non-bool attn_mask to flash-attn-3 varlen-hub
may misbehave inside _normalize_attn_mask.

Made-with: Cursor

* [utils] feat: add move_tensors_to_device recursive helper

Adds a shape-agnostic device-move utility in utils/base.py that walks
list / tuple / dict containers depth-first, copying torch.Tensor leaves
to the target device. Non-tensor leaves (PIL, str, int, np.ndarray)
pass through unchanged. Containers are reconstructed immutably; the
input is not modified.

Signature:
  move_tensors_to_device(value, device, max_depth=None)

The optional max_depth bounds recursion (None = unbounded; 0 = only
move when value itself is a Tensor; N = walk N levels).

Designed for the upcoming reward path device adaptation, but kept as
a general utility so future callers (e.g., a future BaseSample.to
refactor delegating with max_depth=1) can reuse it.

Pure addition; no consumers in this commit.

Made-with: Cursor

* [reward] refactor: route reward inputs through move_tensors_to_device

Inserts a move_tensors_to_device call between _convert_media_format
and model(**batch_input) in three reward computation sites:
  - _compute_pointwise_batch
  - _compute_groupwise_group
  - _compute_groupwise_local (inner per-group loop)

The recursive helper walks list/tuple/dict containers and copies tensor
leaves to model.device. The local batch_input dict is reconstructed;
sample objects are NOT mutated.

Behavior with current GPU-resident samples: same-device .to() is a no-op,
so reward outputs remain bit-identical. The change is defensive prep for
the upcoming sample-loop CPU offload (commit 6) where samples will arrive
on CPU and reward models still run on their declared device.

The distributed groupwise path (_compute_groupwise_distributed) needs no
change: it already passes device=self.accelerator.device to gather_samples,
so its inputs are GPU-resident regardless of caller-side device.

Made-with: Cursor

* [hparams,trainer] feat: add offload_samples_to_cpu config and BaseTrainer helper

Adds the configuration switch and the producer-side helper for the
upcoming sample CPU-offload + lazy-reload pipeline.

hparams/training_args.py:
  TrainingArguments gains a new field
    offload_samples_to_cpu: bool = False
  placed next to enable_gradient_checkpointing (sibling memory switch).
  The help string documents the trade-off (D2H per sample + per-reward
  H2D ~100ms/epoch vs sample/optimize GPU peak reduction).

trainers/abc.py:
  BaseTrainer gains _maybe_offload_samples_to_cpu(samples), a non-
  abstract helper that no-ops when the config is False and otherwise
  walks the sample list calling BaseSample.to('cpu'). The docstring
  records the ordering invariant required by the consumer trainers
  (must be called BEFORE reward_buffer.add_samples) and points to
  RewardProcessor's move_tensors_to_device for the consumer-side H2D.

No call sites yet -- behaviour is unchanged in this commit. The helper
is wired into the five trainers' sample() loops in commit 6.

Made-with: Cursor

* [trainer] refactor: lazy per-batch reload in GRPO/GRPO-Guard/DPO optimize()

Replaces the eager pre-stacked sample_batches list with a single per-batch
loop that lazily reconstructs each micro-batch:

  for batch_idx in range(num_batches):
      batch_samples = [sample.to(device) for sample in shuffled_samples[...]]
      batch = BaseSample.stack(batch_samples)
      ...

Affected sites:
  - trainers/grpo.py GRPOTrainer.optimize()
  - trainers/grpo.py GRPOGuardTrainer.optimize()
  - trainers/dpo.py DPOTrainer.optimize() (chosen_samples / rejected_samples
    extraction inside the per-pair-batch loop)

Behaviour-preserving for the current GPU-resident sample buffer: every
sample.to(device) is a same-device no-op, so loss values, gradients, and
optimizer steps remain bit-identical to HEAD~1.

The change is the consumer-side prerequisite for the upcoming sample-loop
CPU offload (commit 6): once samples may be CPU-resident, this lazy reload
is what keeps optimize()'s GPU footprint bounded by a single micro-batch
instead of the full epoch.

NFT/AWM are deliberately not touched in this commit -- their optimize()
has an extra eager precompute layer that requires structural restructuring
(commit 5).

Made-with: Cursor

* [trainer] refactor: NFT/AWM optimize() per-batch precompute interleave

Restructures NFT and AWM optimize() from the previous double-pass design
(eager precompute over ALL batches under sampling_context, then training
over ALL batches under current params) to a single-pass per-batch
interleave that matches the official DiffusionNFT and AWM implementations:

  for each micro-batch:
    1. lazy reload sample tensors to GPU and stack into a batch dict
    2. precompute under sampling policy:
         adapter.rollout()
         with sampling_context():
             compute (_all_timesteps, _all_random_noise, _old_v_pred_list
                      or _old_log_probs) for THIS batch only
    3. train under current policy:
         adapter.train()
         with self.autocast():
             for t_idx in range(num_train_timesteps):
                 forward / loss / backward / optimizer step

Memory savings: only the current batch's _all_random_noise plus
_old_v_pred_list (NFT) or _old_log_probs (AWM) lives on GPU at any time.
The previous design held all num_batches_per_epoch batches' precompute
output simultaneously, costing ~5+ GB on FLUX1 1024^2 LoRA at B=4 / T=40
and tens of GB on Wan video models (often the OOM trigger).

Train-inference consistency (philosophy #1; see
.agents/knowledge/topics/train_inference_consistency.md item #4):

  - Rollout (sample()/adapter.inference()) is unchanged.
  - EMA params are loaded via sampling_context() and restored before each
    batch's training forward, identical to the per-batch behavior of the
    previous design.
  - ema_step() runs only once per outer epoch in start(), so every batch
    within an optimize() call sees the SAME EMA snapshot regardless of
    interleave timing -> per-batch and eager designs are equivalent on
    the EMA invariant.

Note on RNG (regression test guidance):

  randn_tensor for batch K is now called after batch K-1's backward step
  (vs all noises sampled upfront). The CUDA RNG consumption order
  changes; under the same seed, the per-batch noise sequences are NOT
  bit-identical to the eager design. The algorithm is unchanged (noise
  is augmentation; equivalent in expectation). Regression tests should
  use statistical metrics (loss mean / reward trend across an epoch),
  NOT a numeric diff of loss values, when comparing against HEAD~1.

Sample-level lazy reload (`[sample.to(device) for sample in slice]`) is
folded into this same restructure -- NFT/AWM now share the same lazy
reload pattern as GRPO/DPO from commit 4.

The KL paths (use_ref_parameters / use_ema_parameters per timestep) and
all loss / backward / optimizer logic are unchanged in body and order.

Made-with: Cursor

* [trainer] feat: wire sample() loop CPU offload across all trainers

Inserts self._maybe_offload_samples_to_cpu(sample_batch) into every
trainer's sample() loop, immediately after adapter.inference() and
BEFORE both samples.extend() and reward_buffer.add_samples():

  sample_batch = self.adapter.inference(...)
  self._maybe_offload_samples_to_cpu(sample_batch)  # synchronous D2H
  samples.extend(sample_batch)
  self.reward_buffer.add_samples(sample_batch)

Affected sites (5 total):
  - GRPOTrainer.sample()        (trainers/grpo.py)
  - GRPOGuardTrainer.sample()   (trainers/grpo.py)
  - DiffusionNFTTrainer.sample() (trainers/nft.py)
  - AWMTrainer.sample()         (trainers/awm.py)
  - DPOTrainer.sample()         (trainers/dpo.py)

Why BEFORE add_samples (not after):
  reward_buffer.add_samples() in the async-reward path records a CUDA
  sync_event and dispatches workers that read sample.image / sample.video
  / etc. Calling the offload BEFORE add_samples guarantees the recorded
  event captures "D2H complete + data ready on CPU"; workers wait on the
  event and then deterministically see CPU-resident samples. Inverse
  order would race the worker thread's getattr against the main thread's
  in-place setattr that BaseSample.to('cpu') performs.

Behaviour gating:
  The helper short-circuits when training_args.offload_samples_to_cpu is
  False (the default), so the entire pipeline is wired but inert. Setting
  the flag to True in any trainer YAML now activates the producer side
  (D2H here), and the previously-landed pieces handle the consumer side:
    * commit 2: reward_processor moves the CPU input dict to model.device
                via move_tensors_to_device.
    * commits 4 & 5: optimize() loops lazily reload [sample.to(device)
                     for sample in slice] per micro-batch.

End-to-end VRAM saving (roughly num_batches_per_epoch x per_batch_size
of sample tensors) is unlocked for the first time in this commit. Wan
video YAMLs that opt in (commit 7) will exercise it.

Evaluate paths are intentionally NOT touched -- eval samples are usually
small and one-shot, and eval logging may rely on tensors being on the
adapter device.

Made-with: Cursor

* [examples] feat: enable offload_samples_to_cpu for Wan video models

Adds `offload_samples_to_cpu: true` to all 13 Wan video example configs
(GRPO LoRA / Full and NFT LoRA / Full across Wan2.1 and Wan2.2, T2V /
I2V / V2V variants). Inserted next to `enable_gradient_checkpointing`
so the two memory switches sit together.

Why required for video models:
  per-sample tensors (all_latents, condition videos, image_embeds, ...)
  are GB-scale on Wan; without the offload, sample()/optimize() OOMs as
  soon as num_batches_per_epoch > 1. The plumbing wired in commits 1-6
  is now actually exercised on these configs.

Files (13 total):
  examples/grpo/lora/wan21_t2v.yaml
  examples/grpo/lora/wan21_i2v.yaml
  examples/grpo/lora/wan21_v2v.yaml
  examples/grpo/lora/wan22_t2v.yaml
  examples/grpo/lora/wan22_i2v.yaml
  examples/grpo/full/wan21_t2v.yaml
  examples/grpo/full/wan21_i2v.yaml
  examples/grpo/full/wan22_t2v.yaml
  examples/grpo/full/wan22_i2v.yaml
  examples/nft/lora/wan21_t2v.yaml
  examples/nft/lora/wan21_i2v.yaml
  examples/nft/lora/wan22_t2v.yaml
  examples/nft/full/wan22_t2v.yaml

Non-video model YAMLs are intentionally not touched in this commit;
moderate-VRAM-pressure image models (Flux2, Qwen-Image-Edit-Plus) get
an explicit `false` + pros/cons comment in commit 8 so users see the
option as a documented decision point, while small/standard image
models (FLUX1 / SD3 / Qwen-Image / Z-Image / DPO / etc.) rely on the
code default `False` to avoid YAML noise.

Made-with: Cursor

* [examples] docs: expose offload_samples_to_cpu option in Flux2 and Qwen-Image-Edit-Plus configs

Adds an explicit `offload_samples_to_cpu: false` to 13 example configs
(11 Flux2 variants + 2 Qwen-Image-Edit-Plus) preceded by a multi-line
comment that documents the parameter, its pros/cons, and the conditions
under which a user should flip it to true:

  # offload_samples_to_cpu: CPU-offload sample tensor fields between
  #   sample() and optimize() to reduce GPU peak memory.
  # Pros (true): saves N x per_batch_size GPU memory ... no correctness
  #   or convergence impact.
  # Cons (true): adds ~100ms/epoch H2D in reward path; tiny per-batch
  #   H2D in optimize (<5ms each).
  # Recommended (true) for higher resolutions, larger batch sizes, or
  #   any sample()/optimize() OOM. Default false works for current
  #   example settings.
  offload_samples_to_cpu: false

Tier rationale (3-tier YAML strategy, deviating intentionally from the
strict ALL-YAML rule in .cursor/rules/examples-yaml-sync.mdc):

  T1 (commit 7, Wan video, 13 YAMLs): explicit `true` -- required to
                                      avoid OOM.
  T2 (this commit, Flux2 + Qwen-Edit, 13 YAMLs): explicit `false` +
                                      pros/cons comment -- moderate
                                      VRAM pressure, decision left to
                                      the user with documentation right
                                      next to it.
  T3 (untouched, 23 YAMLs of FLUX1 / SD3 / Qwen-Image / Z-Image / DPO /
      AWM-non-Flux2 / template): no field added, sane defaults via the
                                 code-level default. The field is
                                 documented in the upcoming
                                 topics/sample_lifecycle.md (commit 9).

This three-tier policy keeps T3 YAMLs noise-free while making T1/T2
both behaviour-correct (T1) and discoverable (T2).

Files (13 total):
  examples/grpo/lora/flux2_t2i.yaml
  examples/grpo/lora/flux2_i2i.yaml
  examples/grpo/lora/flux2_klein.yaml
  examples/grpo/lora/flux2_klein_base.yaml
  examples/grpo/full/flux2_t2i.yaml
  examples/grpo/full/flux2_i2i.yaml
  examples/grpo/full/flux2_klein.yaml
  examples/grpo/full/flux2_klein_base.yaml
  examples/awm/lora/flux2_klein_base.yaml
  examples/nft/lora/flux2_klein_base.yaml
  examples/nft/full/flux2_klein_base.yaml
  examples/grpo/lora/qwen_image_edit_plus.yaml
  examples/grpo/full/qwen_image_edit_plus.yaml

Made-with: Cursor

* [docs] feat: add sample lifecycle topic and README routing

New leaf doc .agents/knowledge/topics/sample_lifecycle.md (per
.cursor/rules/agents-docs-maintenance.mdc), covering:
  - default sample lifecycle with the offload pipeline
  - the offload_samples_to_cpu switch and its effect at each stage
  - the 3-tier example YAML adoption matrix (T1 Wan video / T2 Flux2 +
    Qwen-Image-Edit-Plus / T3 the rest), including the rationale for
    intentionally deviating from the strict ALL-YAML rule in
    .cursor/rules/examples-yaml-sync.mdc
  - reward path device responsibility (move_tensors_to_device contract)
  - async-reward race-free argument for offload-before-add_samples order
  - NFT/AWM per-batch precompute interleave summary (memory savings,
    train-inference consistency, RNG-order caveat)
  - extra_kwargs device asymmetry caveat
    (rewards on CPU, advantage on GPU, neither moved by BaseSample.to)
  - Cross-refs to constraints #11, #14, #15, train_inference_consistency
    item #4, dtype_precision

README.md routing table gains a corresponding row pointing at the new
topic with explicit triggers (sample/optimize data flow changes,
debugging sample/optimize OOM, adding high-resolution / video example
configs). The new topic is the authoritative reference for the
offload_samples_to_cpu switch -- T3 YAMLs do not list the field, so
users discover it through this routing entry.

This is the documentation closing the 9-commit refactor of the sample
CPU offload + lazy reload pipeline (commits 1 through 8). No code
changes in this commit.

Made-with: Cursor

* [trainer] docs: simplify NFT/AWM optimize() docstrings

The previous docstrings (introduced in commit 49e7b49) inlined the full
memory analysis, train-inference consistency proof, and RNG-order caveat
-- ~30 lines each. All of that material lives in the authoritative
.agents/knowledge/topics/sample_lifecycle.md (commit 8dc4a78), so the
docstrings now keep only the essential "what":

  - one-line summary
  - per-batch interleave shape (3-step pipeline)
  - AWM-specific note on decoupled sampling/training timesteps
  - pointer to topics/sample_lifecycle.md for "why"

Net change: -33 lines across the two files; bodies unchanged.

Made-with: Cursor

* [ltx2] fix: handle batched per-sample timestep in adapter forward()

`t.expand(batch_size * 2)` raised RuntimeError when forward() was called
during training with `t` of shape (B,) (distinct per-sample timesteps
from `batch['timesteps'][:, ti]`). `expand` cannot stretch a non-singleton
dim from B to 2B.

Normalize `t` to (B,) at function entry with fail-fast shape validation,
then use `torch.cat([t, t])` for the CFG-doubled batch (matching the
`torch.cat([lat, lat])` ordering of `[t0..tB-1, t0..tB-1]`). Accepts
0-D scalar (inference), (1,) singleton, and (B,) per-sample inputs;
other shapes raise an informative ValueError.

Applied identically to LTX2_T2AV_Adapter and LTX2_I2AV_Adapter.

Made-with: Cursor

* [data_utils] fix: stabilize preprocess cache key via deep signature collection

compute_cache_path previously hashed ALL preprocess_kwargs — including
training-infrastructure fields like num_batches_per_epoch and
gradient_accumulation_steps that leak through **training_args unpacking
and filter_kwargs pass-through. These fields have non-deterministic
values across launches (world-size-derived, __post_init__ timing, etc.),
causing merged_cache_path to differ even with identical YAML and
force_reprocess=False.

Fix: add _select_cache_relevant_kwargs() which uses "deep signature
collection" — it inspects the named parameters of preprocess_func AND
(when preprocess_func accepts **kwargs and is a bound adapter method)
the named parameters of all encode_* forwarding targets on the same
adapter instance. Only kwargs whose key appears in this union are
included in kwargs_hash. Training-only fields that no encoder declares
are excluded.

Safety: over-hash (hashing an encoder param that doesn't run at
runtime) is harmless; under-hash (missing a param that affects output)
is prevented by collecting from all four encoder methods regardless of
runtime data presence.

Made-with: Cursor

* [examples] switch LoRA configs from DDP to DeepSpeed ZeRO-2

DeepSpeed ZeRO-2 provides optimizer-state sharding with negligible
overhead, making plain multi_gpu (DDP) redundant for LoRA training.

Made-with: Cursor

* [examples] docs: note LTX-2.3-Diffusers as an option in LTX2 configs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [examples] fix: enable offload_samples_to_cpu for LTX2 video configs

Matches Wan video-model configs. Without this, per-sample audio+video
tensors are GB-scale and sample()/optimize() OOMs at
num_batches_per_epoch > 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [examples] docs: rewrite stale Qwen attn_backend comment for LTX2

Replace misleading "for Qwen-Image Series" comment with a
model-agnostic description of available backend options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [models/ltx2,rewards] style: apply black and isort to new PR files

Cosmetic-only changes (import reorder, string-quote normalization,
long-line wrapping). Scoped to the 5 files this PR adds; existing
unclean files on main are out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix dtype source

* Update submodule

* [models,docs] refactor: align CFG handling across all adapters with forward-stage warning

Ensure every CFG-capable adapter follows a consistent two-stage pattern:

1. encode_prompt: derive do_classifier_free_guidance from guidance_scale
   (>1.0 standard, >0.0 for Z-Image); default negative_prompt to "" when None.
2. forward: if guidance_scale > threshold but negative_prompt_embeds is None,
   emit logger.warning and gracefully fallback to the no-CFG path.

Adapter-specific changes:
- flux2_klein: extract do_classifier_free_guidance variable in _forward
- sd3_5: add forward warning; migrate to setup_logger; drop unused import
- z_image: add forward warning (threshold > 0.0)
- wan2_t2v/v2v/i2v: add forward warning; unify negative_prompt expansion
- qwen_image: add guidance_scale to encode_prompt; add forward warning
- qwen_image_edit_plus: add guidance_scale to encode_prompt; rename
  true_cfg_scale/do_true_cfg to guidance_scale/do_classifier_free_guidance;
  add _forward warning
- ltx2_t2av/i2av: add forward warning for multi-guidance (video + audio)

Document the CFG convention in .agents/knowledge/topics/adapter_conventions.md
with reference implementation, model-specific extensions table, and gotcha #7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [examples,docs,agents] refactor: restructure examples, add LTX-2 docs, sync agent knowledge

- Restructure examples/ to algorithm/ft/model/variant.yaml with examples/README.md
- Add LTX-2/2.3 to README (News, model table, install note)
- Add .scratch/ constraint for agent temp files (#28), examples convention (#29)
- Sync agent knowledge: GroupDistributedSampler in samplers.md, LTX2 + RationalRewards in architecture.md
- Clean up .docs/ltx2-research/ dev artifacts
- Update LTX2 configs: guidance_scale=1.0, comment out attn_backend

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jayce-Ping added a commit that referenced this pull request Jun 14, 2026
Resync .agents/, .cursor/, guidance/, AGENTS.md and CLAUDE.md with the
current code after plugin growth (9 trainers, 14 model adapters, 13 reward
models). Fixes registry drift, wrong config/API facts and broken
cross-references found in a full audit.

- architecture.md/AGENTS.md: add diffusion-opd trainer, clap/imagebind/
  geneval rewards, Bagel/LTX2 models; fix RationalRewards* class names
- constraints.md: evaluate() is concrete (not abstract); index #28-29;
  paradigm (#7) and training-args (#16) lists; de-numbered line refs
- philosophy.md: Accelerate (DDP/DeepSpeed ZeRO-1-2/FSDP) backend; fix #27 ref
- guidance: scheduler.* config keys, real sample()/compute_advantages
  snippets, GenEval metadata convention, audio reward param, Bagel link
- skills: model_name_or_path, default_target_modules, data.datasets,
  rewards-as-list, 9 trainers; CLAUDE.md imports AGENTS.md to avoid drift
- topics/samplers.md: correct _resolve_sampler_type + AdvantageProcessor
  group_distributed paths; parity_testing set_scheduler_timesteps
- hparams/model_args.py: model_type Literal now matches registry keys

Co-authored-by: Cursor <cursoragent@cursor.com>
Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request Jul 2, 2026
* Move `samples`

* Update trajectory collector

* update extra_callbacks selective

* Remove undefined log_prob

* Fix grpo guard
Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request Jul 2, 2026
… support (X-GenGroup#118)

* docs: comprehensive analysis of samples, rewards, and video adapter architecture

Add detailed documentation covering:
- Complete BaseSample dataclass fields and sample type hierarchy (T2ISample, T2VSample, etc.)
- RewardModelOutput and reward model abstract classes (PointwiseRewardModel, GroupwiseRewardModel)
- WAN2 text-to-video adapter implementation and video generation pipeline
- Sample canonicalization and unique_id computation via SHA256 hashing
- Video format specifications (Tensor(T, C, H, W) with frame/dimension constraints)
- Reward model input/output modality handling and tensor input configuration
- Data flow examples from generation through sampling to reward computation
- Quick reference cards and integration patterns for implementing new adapters

Files created:
- START_HERE.md: Quick navigation guide
- ANALYSIS_REPORT.md: Comprehensive technical analysis
- QUICK_REFERENCE.md: Copy-paste templates and patterns
- MODALITY_FLOW.md: Complete data flow walkthrough
- CODEBASE_EXPLORATION.md: File structure and discovery process
- DOCUMENTATION_INDEX.md: Structured index of all findings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: move research docs to .docs/ltx2-research/

Move auto-generated analysis documents out of project root
into a dedicated temporary folder for LTX2 integration research.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [utils] feat: add audio utility module with type aliases, standardization, and loading

Add `utils/audio.py` following the exact pattern of `utils/image.py` and
`utils/video.py`. Provides a complete audio waveform toolkit:

- Type aliases: `AudioSingle` (C,T), `AudioBatch` (B,C,T)
- Validation: `is_audio()`, `is_audio_batch()`
- Loading/saving: `load_audio()`, `save_audio()` with 3-tier backend
  fallback (torchaudio → soundfile → stdlib wave)
- Conversion: `audio_to_tensor()`, `audio_to_numpy()`, `convert_audio()`
  for resampling and mono/stereo conversion
- Standardization: `standardize_audio_batch()` with output_type='pt'|'np'
- Hashing: `hash_audio()`, `hash_audio_list()` with int16 quantization

Design conventions follow diffusers/audiocraft:
- Channel-first (C, T) tensor layout
- [-1.0, 1.0] float32 value range
- Channel conversion: downmix=mean, upmix=repeat

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [samples] feat: add audio field to BaseSample and T2AVSample class

- Add `audio: Optional[torch.Tensor]` field to BaseSample with
  automatic promotion from 1D (T,) to 2D (C, T) via audio_to_tensor()
- Add T2AVSample(BaseSample) for text-to-audio-video generation tasks
- Update __init__.py exports

The audio field follows the same pattern as image/video: stored without
batch dimension, standardized in __post_init__, supports stack/to_dict.
Fully backward compatible — defaults to None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [data] feat: support audio loading and preprocessing in dataset pipeline

- Add `audio_dir` field to DataArguments (defaults to 'audios' subfolder)
- Add audio loading block in GeneralDataset._preprocess_batch() that
  mirrors the existing video loading pattern: detect 'audio' column in
  JSONL, load via load_audio(), pass to preprocess_func(audios=...)
- Add 'audios' to PREPROCESS_KEYS for metadata exclusion
- Pass audio_dir through fn_kwargs to .map() call

The audio_dir parameter flows automatically through loader.py via
filter_kwargs — no changes needed in loader.py. Fully backward
compatible: datasets without audio columns are unaffected.

JSONL format: {"prompt": "...", "audio": "file.wav"}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models] feat: add encode_audio interface and audio_vae support to BaseAdapter

- Add `encode_audio()` default method that returns None — non-abstract,
  so existing adapters need no changes
- Add `audio_vae` property with getter and setter (mirrors vae pattern)
- Update `preprocess_func()` to accept `audios` parameter and route it
  through `encode_audio()` in the same loop as prompt/image/video
- Update `_freeze_vae()` to also freeze audio_vae when present, keeping
  `_freeze_components()` clean

Fully backward compatible: encode_audio returns None by default,
audio_vae returns None when pipeline has no audio_vae component.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [rewards] feat: add audio parameter to reward model interfaces

- Add `audio: Optional[List[torch.Tensor]]` parameter to both
  PointwiseRewardModel.__call__ and GroupwiseRewardModel.__call__,
  positioned after `video` and before `condition_images`
- Add 'audio' to RewardProcessor.MEDIA_FIELDS
- Add audio branch in _convert_media_format(): tensor passthrough
  when use_tensor_inputs=True, numpy conversion otherwise

Audio is always tensor-based (no PIL equivalent). Existing reward
models (PickScore, CLIP, etc.) are unaffected — their concrete
__call__ signatures don't include `audio`, so filter_kwargs strips
it automatically with zero overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update Step 6 sub-plan to v4 with verified diffusers 0.38.0.dev0 API

Key corrections from runtime verification:
- Connectors use additive mask, not padding_side
- Transformer forward has no sigma/audio_timestep/STG/modality params
- CFG is velocity-space with [uncond, cond] chunk order
- Compression ratios are instance attributes, not class properties
- Audio params from pipeline attributes, not transformer config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: clean up obsolete research docs, consolidate plan

Remove 8 early exploration documents that have been fully superseded
by actual code implementation and the two remaining plan files:
- COMMIT_PLAN.md: overall progress tracker (updated with final status)
- STEP6_SUBPLAN.md: detailed LTX2 adapter plan (v4, API-verified)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: scaffold LTX2 adapter with sample dataclass and pipeline loading

Add LTX2 text-to-audio-video adapter scaffold:

- LTX2Sample(T2AVSample): dataclass with audio trajectory fields
  (audio_all_latents, audio_latent_index_map), connector embedding
  fields for both video/audio streams, and negative prompt fields
  for CFG during training

- LTX2_T2AV_Adapter(BaseAdapter): skeleton with:
  - load_pipeline(): LTX2Pipeline.from_pretrained with low_cpu_mem_usage=False
  - _create_audio_scheduler(): ODE-only FlowMatchEulerDiscreteSDEScheduler
    (separate instance to avoid step_index collision with video)
  - default_target_modules: 28 Linear layers per block, verified against
    LTX2VideoTransformerBlock.named_modules()
  - preprocessing_modules: ['text_encoders', 'connectors']
  - inference_modules: ['transformer', 'vae', 'audio_vae', 'connectors', 'vocoder']
  - Stub methods for encode/decode/forward/inference (NotImplementedError)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: implement encode_prompt and decode_latents for LTX2

encode_prompt():
- Delegates to pipeline.encode_prompt() for Gemma3 all-layer encoding
  + _pack_text_embeds normalization
- Passes through connectors with additive attention mask
  (1 - binary_mask) * -1e6, additive_mask=True
- Splits [negative, positive] connector outputs for CFG
- Returns prompt_ids + video/audio connector embeddings + masks

decode_latents():
- Video: unpack → denormalize → optional timestep conditioning with
  decode noise injection → VAE decode → postprocess
- Audio: denormalize → unpack → audio_vae decode → vocoder
  (denormalize BEFORE unpack — order differs from video!)
- All operations match pipeline source L1172-1218 exactly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: implement forward() with CFG and dual scheduler steps

Single denoising step matching pipeline L1097-1154:
1. Prepare CFG inputs: duplicate latents + concat [neg, pos] embeddings
2. Joint transformer forward with cache_context("cond_uncond")
3. CFG in velocity-space: uncond + gs * (cond - uncond), with
   optional guidance_rescale
4. Video: SDE scheduler step (stochastic, with log_prob for RL)
5. Audio: ODE scheduler step (deterministic, no log_prob)
6. Attach audio_next_latents on video output for trajectory tracking

RoPE coords are computed on demand if not cached from inference loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: implement inference loop and register ltx2_t2av adapter

inference():
- Full denoising loop following pipeline L908-1226
- Encode prompts (with fallback to pre-encoded inputs)
- Compute latent dimensions: video (32x spatial, 8x temporal), audio
  (4x mel, 4x temporal, sr=16kHz, hop=160)
- Prepare video + audio latents via pipeline.prepare_latents/audio_latents
- Timestep shift with LTX2-specific mu (base_seq=1024, shift=0.95-2.05)
- Positional coords via transformer.rope / audio_rope
- Dual trajectory collection: video (for RL) + audio (for reconstruction)
- Decode both modalities and construct LTX2Sample per batch element

Registry:
- Add 'ltx2_t2av' entry mapping to LTX2_T2AV_Adapter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: improve type safety and code cleanup for LTX2 adapter

- Add self.scheduler type declaration (consistent with all other adapters)
- Replace SDESchedulerOutput with FlowMatchEulerDiscreteSDESchedulerOutput
- Change forward() return type to Tuple instead of dynamic attribute
- Fix Optional parameter annotations for forward() signature
- Remove unused imports (Any, DISTILLED_SIGMA_VALUES) and variables
- Replace unicode arrows with ASCII in comments for encoding safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN with completed steps summary and next steps (7-9)

Steps 1-6 + type cleanup are all committed. Remaining work:
- Step 7: Design multi-modal forward() return pattern for optimize()
- Step 8: Example YAML configs (GRPO, NFT, AWM)
- Step 9: Audio-video dataset integration for testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: promote num_frames and frame_rate to explicit LTX2Sample fields

Move num_frames and frame_rate from extra_kwargs to explicit dataclass
fields on LTX2Sample, consistent with height/width on BaseSample.
Add num_frames to _shared_fields (shared across batch, not stacked).
Keep duration_s in extra_kwargs as a derived value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: unified latent interface for forward()

Redesign forward() to accept concatenated video+audio latents as a single
tensor (B, video_seq + audio_seq, C). Internally splits by video_seq_len,
runs the joint transformer, steps video SDE and audio ODE schedulers
separately, then concatenates next_latents back into a unified output.

This makes the trainer interface identical to single-modality adapters:
- forward() returns a single FlowMatchEulerDiscreteSDESchedulerOutput
- Trainers access output.next_latents and output.log_prob directly
- No trainer changes needed for multi-modal generation

Key changes:
- forward(): accept unified latents, split/cat internally, return single output
- inference(): cat video+audio before loop, use single latent_collector
- LTX2Sample: replace audio_all_latents with video_seq_len split point
- Remove Tuple return type, keep dual scheduler (video SDE + audio ODE)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN — Step 7 complete, Steps 8-9 remaining

Step 7 resolved via unified latent interface:
- 7a: Type safety cleanup (FlowMatchEulerDiscreteSDESchedulerOutput)
- 7b: Promote num_frames/frame_rate to explicit LTX2Sample fields
- 7c: Unified forward() — cat(video,audio) input, single output, dual scheduler internal

Remaining: Step 8 (example YAML configs), Step 9 (test dataset)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN to reflect current state, remove obsolete STEP6_SUBPLAN

- Step 7 fully completed (7a-7c): type cleanup, explicit fields, unified latents
- Steps 8-9 remain: example configs + test dataset
- Add deferred features table with diffusers source availability
- Delete STEP6_SUBPLAN.md (diverged from implementation, no longer useful)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: correct deferred features — STG, prompt enhancement, x0-guidance all exist in diffusers

All three features are available in installed diffusers 0.38.0.dev0:
- STG: extra transformer forward with spatio_temporal_guidance_blocks
- Modality Isolation: extra forward with isolate_modalities=True
- Prompt Enhancement: Gemma3 text_encoder.generate() with system prompt
- x0-space guidance: convert_velocity_to_x0/convert_x0_to_velocity

Note: current adapter uses velocity-space CFG; x0-space is prerequisite
for STG and Modality Isolation Guidance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] feat: align forward() with official x0-space multi-guidance pipeline

Rewrite the guidance section to match official diffusers pipeline_ltx2.py:

1. x0-space guidance: convert velocity predictions to x0-space, compute all
   guidance deltas (CFG + STG + Modality Isolation) in x0-space, then convert
   back to velocity. This replaces the previous velocity-space CFG.

2. STG (Spatio-Temporal Guidance): optional extra transformer forward with
   spatio_temporal_guidance_blocks to perturb specific blocks. Separate
   stg_scale / audio_stg_scale for video and audio.

3. Modality Isolation Guidance: optional extra transformer forward with
   isolate_modalities=True (disables A2V/V2A cross-attention). Separate
   modality_scale / audio_modality_scale.

4. Prompt Enhancement: inference() supports system_prompt parameter to
   rewrite prompts via Gemma3 text_encoder.generate().

5. LTX-2.3 compatibility: pass sigma=timestep and use_cross_timestep
   to all transformer forward calls.

6. Independent audio guidance: separate audio_guidance_scale,
   audio_guidance_rescale, audio_stg_scale, audio_modality_scale
   (all default to their video counterparts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update COMMIT_PLAN with refined next steps (8a/8b/9)

Restructured remaining steps:
- 8a: Inference alignment refinements (validation, num_frames rounding,
  coord pre-duplication, distilled sigmas, embed concatenation)
- 8b: Example YAML configs (GRPO/NFT/AWM x lora/full)
- 9: VGGSound-50k test dataset integration

Updated design decisions to reflect x0-space guidance, sigma-based
conversion, and adapter design philosophy (inference=__call__, forward=step).
Removed obsolete deferred features (STG, prompt enhancement, x0-guidance
now implemented).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [models/ltx2] refactor: align inference() with official pipeline

- Add _check_inputs(): validate height/width divisibility, STG block spec,
  auto-round num_frames to VAE-temporal-compatible value
- Pre-duplicate RoPE coords for CFG before denoising loop (official L1201)
- Pre-concatenate [neg, pos] connector embeds before loop to avoid
  re-catting every step; forward() receives _cfg_prepared flag
- Extract positive-only embeds from pre-catted tensors for STG/modality
  passes when _cfg_prepared=True

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [examples] feat: add LTX2 T2AV GRPO+LoRA example config

Verified with ff-train: config parses correctly, model architecture
resolves to ltx2_t2av, all parameters (resolution, num_frames, frame_rate,
audio_dir, guidance_scale, LoRA settings) are properly propagated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update config and dataset

* [logger,samples] feat: mux audio into video when logging T2AV samples

T2AV samples (e.g. LTX2) carry both video and audio, but the logging
pipeline silently dropped the audio track. This commit adds audio-video
muxing so wandb/other loggers receive a single MP4 with sound.

Changes:
- BaseSample: add `audio_sample_rate` field alongside `audio`
- LTX2 adapter: populate `audio_sample_rate` from pipeline vocoder config
- LogVideo: add optional `audio`/`audio_sample_rate` fields; when present,
  `get_value('mp4')` muxes H.264 video + AAC audio via PyAV (mirrors
  diffusers' encode_video); falls back to silent MP4 if PyAV unavailable
- LogFormatter: add T2AVSample dispatch → `_process_t2av_samples()` that
  creates LogVideo with audio attached and correct fps from frame_rate

Made-with: Cursor

* Fix _resolve_component_names

* [models/ltx2] fix: align connectors call with current diffusers API

Pass binary attention mask directly to LTX2TextConnectors.forward()
instead of pre-computing additive mask. The current diffusers version
handles binary-to-additive conversion internally and no longer accepts
the `additive_mask` keyword argument.

Made-with: Cursor

* update

* Fix

* [models] refactor: unify CFG control via guidance_scale across all adapters

Remove explicit `do_classifier_free_guidance` parameter from encode_prompt,
inference, and forward signatures. Instead, derive the CFG flag internally
from `guidance_scale` (>1.0 standard, >0.0 for Z-Image), ensuring
data_preprocessing and inference stages always use the same CFG decision.

Affected adapters: SD3.5, Z-Image, Wan T2V/I2V/V2V, LTX2 T2AV, Flux2 Klein.

Made-with: Cursor

* update

* Fix cfg

* Update config

* [models/ltx2] refactor: inline Gemma3 encoding into adapter encode_prompt

Eliminate delegation to pipeline.encode_prompt() and redundant tokenizer
call by inlining _get_gemma_prompt_embeds logic into a new _encode_text()
helper. This produces prompt_ids and negative_prompt_ids from the same
tokenization pass used for embeddings.

Made-with: Cursor

* [samples] refactor: include negative_prompt in unique_id and extract _hash_id_fields

Add negative_prompt/negative_prompt_ids to _id_fields and hash
computation so samples with different negative prompts are correctly
distinguished. Extract _hash_id_fields(hasher) to eliminate duplicated
prompt hashing in ImageConditionSample and VideoConditionSample.
Also parameterize digest length via num_bytes (default 16 = 128-bit).

Made-with: Cursor

* Update config

* Fix bytes

* Fix audio_sample_rate

* [models/ltx2,logger,utils] fix: address PR review — input validation, mux guard, ndim check

- Extend _check_inputs with prompt/embedding presence and CFG+negative
  consistency validation (Copilot comments X-GenGroup#3, X-GenGroup#5)
- Narrow audio mux guard to format == 'mp4' (comment X-GenGroup#6)
- Add ndim validation in standardize_audio_batch (comment X-GenGroup#7)

Made-with: Cursor

* [models/ltx2,utils] feat: integrate prompt enhancement with RNG isolation

Add isolated_rng context manager to utils/base.py for safe global RNG
seeding. Implement _enhance_prompt_batch in LTX2 adapter using official
pipeline.enhance_prompt with deterministic seed and RNG state isolation,
preventing seed leakage into downstream noise sampling. Enhancement is
opt-in via system_prompt config (null=disabled, "default"=Lightricks
prompt). Clean up placeholder enhancement code from inference().

Made-with: Cursor

* Reorder unique_id priority

* [reward] feat: add CLAP and ImageBind audio reward models

Add two audio-specific reward models for LTX2 audio-video generation:

- CLAPRewardModel: audio-text alignment via LAION CLAP (48 kHz mono,
  cosine similarity, zero new deps via transformers.ClapModel)
- ImageBindRewardModel: audio-video semantic alignment via Meta ImageBind
  (16 kHz mel-spectrogram, video spatial crops, multi-mode scoring)

Register both in reward registry, replace PickScore with CLAP + ImageBind
in ltx2_t2av.yaml config, and update COMMIT_PLAN.md Step 11 as done.

Made-with: Cursor

* [docs] feat: add flat inheritance rules for adapters and trainers

- Constraint X-GenGroup#11: trainers MUST inherit from BaseTrainer directly
  (only GRPOGuardTrainer → GRPOTrainer sanctioned)
- Constraint X-GenGroup#12: adapters MUST inherit from BaseAdapter directly;
  shared logic uses helpers, code duplication, or mixins
- Update architecture.md and cursor rule to reference both rules

Made-with: Cursor

* [docs] feat: add I2AV adapter and audio reward research plans

- I2AV_PLAN: LTX2 Image-to-Audio-Video adapter design (BaseAdapter
  flat hierarchy, conditioning mask, first-frame preservation)
- AUDIO_REWARD_PLAN: CLAP and ImageBind reward model integration
- COMMIT_PLAN: update GPU validation status and prompt enhancement
  analysis

Made-with: Cursor

* chore: update diffusers submodule

Made-with: Cursor

* [docs] feat: add two-layer sample hierarchy constraint

Model-specific samples must inherit from task-level samples (e.g.
LTX2I2AVSample → I2AVSample), never from other model-specific samples.
Updated constraint X-GenGroup#14, architecture.md, and base-class-contract rule.

Made-with: Cursor

* [docs] update: expand I2AV plan with full implementation details

- Complete LTX2I2AVSample dataclass with all duplicated LTX2 fields
- Code duplication strategy (no _common.py shared module)
- encode_image with condition_image_size (flux2/qwen pattern)
- Dual-path inference: raw images or pre-encoded condition_images
- Full forward()/inference() structure with conditioning mask semantics
- enhance_prompt I2AV multimodal difference documented

Made-with: Cursor

* [reward] fix: add transformers v5 compatibility for PickScore get_*_features() API

In transformers >=5.0, get_text_features()/get_image_features() return
BaseModelOutputWithPooling instead of a tensor. Add _extract_feature_tensor()
helper to handle both v4 (tensor) and v5 (ModelOutput) return types.

Made-with: Cursor

* [samples,logger] feat: add I2AVSample dataclass and logger support

Introduce I2AVSample (Image-to-Audio-Video) task-level sample and
its logger handler, combining I2V condition-image table layout with
T2AV audio-muxed LogVideo for complete I2AV logging across backends.

Made-with: Cursor

* [adapter,registry] feat: add LTX2 I2AV adapter for image-conditioned audio-video generation

- LTX2_I2AV_Adapter(BaseAdapter) with conditioning_mask for frame-0 preservation
- forward(): CFG-doubles conditioning_mask internally, per-token video timestep
  masking, scheduler.step on generated frames only
- inference(): dual-path image input (raw PIL via encode_image or pre-encoded tensor)
- _enhance_prompt_batch(): multimodal Gemma3 enhancement with raw PIL images
- _standardize_image_input(): MultiImageBatch flattening for single-condition model
- Registry entry 'ltx2_i2av' and example YAML config

Made-with: Cursor

* [adapter] fix: align MultiImageBatch/MultiVideoBatch type annotations across adapters

Wan2 I2V and Wan2 V2V had correct runtime logic (is_multi_image_batch check
in _standardize_*_input) but inference() and encode_image() annotations used
ImageBatch/VideoBatch instead of MultiImageBatch/MultiVideoBatch.

Made-with: Cursor

* [docs] update: add nested batch convention to adapter docs

- adapter_conventions.md: document that inference() receives MultiImageBatch/
  MultiVideoBatch from collator; single-condition adapters must flatten via
  _standardize_*_input; append Gotcha X-GenGroup#5; add cross-refs
- ff-new-model/SKILL.md: extend Pitfall X-GenGroup#6 with single-condition handling
  guidance and back-ref to adapter_conventions.md Gotcha X-GenGroup#5

Made-with: Cursor

* [style] fix: apply PR review fixes — logging, section-divider policy, unique_id

- Replace bare print() with logger.warning() in formatting.py
- Update section-divider rule: allow between methods, forbid inside functions
- Convert in-function decorative dividers to plain numbered comments
- Include negative_prompt in unique_id hash via _hash_id_fields() refactor
- Fix compute_unique_id default to 8 bytes (fits torch.int64)

Made-with: Cursor

* [reward,models/ltx2] fix: CLAP BatchNorm dtype + I2AV forward bugs

- CLAP: load model in float32 (BatchNorm requires it); fix audios->audio deprecation
- I2AV: remove conditioning_mask from _shared_fields to preserve batch dim
- I2AV: ensure next_latents in return_kwargs for frame-slicing logic
- Remove dtype: bfloat16 from CLAP reward config in example YAMLs

Made-with: Cursor

* Update config

* update config

* Update config

* Fix dtype

* Update

* [diffusers] fix: cherry-pick flash_3_varlen_hub mask dtype fix

Cherry-pick commit 0f8a83fa6 from diffusers to support additive
(bfloat16) attention masks in _flash_attention_3_varlen_hub. This
fixes the ValueError when LTX2 connectors pass non-bool masks.

Made-with: Cursor

* [data_utils] perf: eliminate redundant preprocessed-dataset writes

Each rank's preprocessed Arrow shard is now written exactly once: the
orchestrator routes Dataset.map output directly to the final per-rank
location via cache_file_name= (under {merged_cache_path}.tmp/_parts/),
and the consolidator writes only state.json + dataset_info.json before
atomically renaming .tmp -> merged_cache_path. No row data is re-copied
during the merge, no duplicate cache lands under ~/.cache/huggingface,
and the build-dir sentinel _build_meta.json enables crash recovery for
unchanged num_shards while wiping cleanly on num_shards changes.

Single-process and distributed paths are unified through the same flow
(N=1 case for single-process); enable_preprocess=False bypasses the
consolidate pipeline entirely to preserve pre-refactor behavior. I/O
budget per cache build drops from ~4*N*S to ~N*S bytes touched.

Made-with: Cursor

* [docs] update: comprehensive audio + R7 no-op-default encoder docs sweep

Bring agent docs and the developer guides on feat/ltx2-audio-video-support
in line with the R6/R7 BaseAdapter contract changes (PR X-GenGroup#129) AND extend
every audio-aware section to cover the new modality.

Critical contract fixes (mirror Wave A on data-utils-perf so the 0-hit
verification gate passes on this branch independently of merge order):
- constraints.md X-GenGroup#12: 7 abstract methods -> 4 (load_pipeline,
  decode_latents, forward, inference); new "Optional encoder overrides
  (no-op default)" subsection naming all 4 encoders incl. encode_audio;
  preprocess_func note now explains the audios dispatch and "skip when
  None" semantics. Trailing "Adapter hierarchy" paragraph also fixed
  ("7-method contract" -> "4-abstract-method contract" with explanation).
- ff-new-model/SKILL.md: frontmatter description fixed; Phase-1 step-3
  mapping gains audio encoder/VAE row; Phase-2 step-3 implementation
  table reorders so the 4 truly-abstract methods are on top, marks all
  4 encoders as Abstract? No (no-op default; override if your model
  consumes this modality), adds the encode_audio row; Pitfall X-GenGroup#6
  extended to include audios + the []-for-empty / never-unwrap contract
  (preserved the existing _standardize_*_input guidance and added a
  cross-ref to adapter_conventions.md Gotcha X-GenGroup#6).
- guidance/new_model.md: Step 4 narrative replaced "three encoding
  methods" with "Override the encoders your model consumes"; updated
  the dispatcher pseudocode to enumerate all 4 modalities with the
  "if encoded is not None" skip; new #### encode_audio block parallel
  to encode_video (MultiAudioBatch signature, default return None);
  Step-5 inference signature comment adds "audios: Optional[MultiAudioBatch]"
  (commented as opt-in); checklist gains explicit encode_video() /
  encode_audio() items framed as override-only.

Audio-aware sweep (Wave-B-only, lives here because LTX-2 actually
consumes audio):
- adapter_conventions.md: Batch Dimension Convention extended to
  include audios/MultiAudioBatch on inference(); new bullet codifying
  the multi-media batch homogeneity guarantee; new Gotcha X-GenGroup#6 for the
  []-for-empty / no-unwrap rule (applies symmetrically to images,
  videos, audios).
- guidance/new_model.md Data Format Conventions: new ### Audio table
  with audios/condition_audios/audio_features rows and a callout
  pointing to flow_factory.utils.audio.MultiAudioBatch; cross-cutting
  batch-boundary callout extended to include encode_audio().
- guidance/workflow.md Stage 1: goal narrative + Input row extended
  to include "audio files"; new audio-symmetry callout after the
  Flux.2 example explaining audio_dir is the third optional input
  handled by _preprocess_batch.
- architecture.md: Stage-1 ASCII box now lists audio + audio_features;
  Adapter Pattern subsection gains a one-liner that all 4 encoders are
  no-op by default (override only the modalities your model consumes).
- ff-develop/SKILL.md sec.2 Adapter Hierarchy: appended the R7 design
  lesson bullet (non-abstract no-op default + opt-in override over
  @AbstractMethod for new modalities; the 4 abstract methods are
  intentionally minimal).
- fix_patterns.md Recorded Fix Patterns: replaced "(No records yet)"
  with two full entries using the documented template — R6 multi-modal
  batch homogeneity and R7 non-abstract encoder defaults; both link
  back to the actual code locations.

Pure docs change; no Python touched. Verified: zero hits across
.agents/ + guidance/ for "7 abstract methods", "7-method contract",
"three encoding methods", "Implement the three encoding"; encode_audio
+ MultiAudioBatch present in every doc per the matrix; ReadLints clean
on all 8 touched files.

Made-with: Cursor

* [diffusers] sync: bump submodule to upstream main 77f8cf8bf

Drops the locally cherry-picked commit 620286eb5 ("support ltx-2 type
masking in flash_3_hub_varlen") and resets the submodule to the
official huggingface/diffusers main HEAD as of 2026-04-18.

Functional consequence: _flash_attention_3_varlen_hub no longer casts
non-bool attn_mask via `attn_mask > -1`. Upstream already carries the
`isinstance(result, tuple)` defensive unpack, so only the bool-cast is
missing. Waiting for upstream to land an equivalent fix; until then,
LTX2 paths that pass a non-bool attn_mask to flash-attn-3 varlen-hub
may misbehave inside _normalize_attn_mask.

Made-with: Cursor

* [utils] feat: add move_tensors_to_device recursive helper

Adds a shape-agnostic device-move utility in utils/base.py that walks
list / tuple / dict containers depth-first, copying torch.Tensor leaves
to the target device. Non-tensor leaves (PIL, str, int, np.ndarray)
pass through unchanged. Containers are reconstructed immutably; the
input is not modified.

Signature:
  move_tensors_to_device(value, device, max_depth=None)

The optional max_depth bounds recursion (None = unbounded; 0 = only
move when value itself is a Tensor; N = walk N levels).

Designed for the upcoming reward path device adaptation, but kept as
a general utility so future callers (e.g., a future BaseSample.to
refactor delegating with max_depth=1) can reuse it.

Pure addition; no consumers in this commit.

Made-with: Cursor

* [reward] refactor: route reward inputs through move_tensors_to_device

Inserts a move_tensors_to_device call between _convert_media_format
and model(**batch_input) in three reward computation sites:
  - _compute_pointwise_batch
  - _compute_groupwise_group
  - _compute_groupwise_local (inner per-group loop)

The recursive helper walks list/tuple/dict containers and copies tensor
leaves to model.device. The local batch_input dict is reconstructed;
sample objects are NOT mutated.

Behavior with current GPU-resident samples: same-device .to() is a no-op,
so reward outputs remain bit-identical. The change is defensive prep for
the upcoming sample-loop CPU offload (commit 6) where samples will arrive
on CPU and reward models still run on their declared device.

The distributed groupwise path (_compute_groupwise_distributed) needs no
change: it already passes device=self.accelerator.device to gather_samples,
so its inputs are GPU-resident regardless of caller-side device.

Made-with: Cursor

* [hparams,trainer] feat: add offload_samples_to_cpu config and BaseTrainer helper

Adds the configuration switch and the producer-side helper for the
upcoming sample CPU-offload + lazy-reload pipeline.

hparams/training_args.py:
  TrainingArguments gains a new field
    offload_samples_to_cpu: bool = False
  placed next to enable_gradient_checkpointing (sibling memory switch).
  The help string documents the trade-off (D2H per sample + per-reward
  H2D ~100ms/epoch vs sample/optimize GPU peak reduction).

trainers/abc.py:
  BaseTrainer gains _maybe_offload_samples_to_cpu(samples), a non-
  abstract helper that no-ops when the config is False and otherwise
  walks the sample list calling BaseSample.to('cpu'). The docstring
  records the ordering invariant required by the consumer trainers
  (must be called BEFORE reward_buffer.add_samples) and points to
  RewardProcessor's move_tensors_to_device for the consumer-side H2D.

No call sites yet -- behaviour is unchanged in this commit. The helper
is wired into the five trainers' sample() loops in commit 6.

Made-with: Cursor

* [trainer] refactor: lazy per-batch reload in GRPO/GRPO-Guard/DPO optimize()

Replaces the eager pre-stacked sample_batches list with a single per-batch
loop that lazily reconstructs each micro-batch:

  for batch_idx in range(num_batches):
      batch_samples = [sample.to(device) for sample in shuffled_samples[...]]
      batch = BaseSample.stack(batch_samples)
      ...

Affected sites:
  - trainers/grpo.py GRPOTrainer.optimize()
  - trainers/grpo.py GRPOGuardTrainer.optimize()
  - trainers/dpo.py DPOTrainer.optimize() (chosen_samples / rejected_samples
    extraction inside the per-pair-batch loop)

Behaviour-preserving for the current GPU-resident sample buffer: every
sample.to(device) is a same-device no-op, so loss values, gradients, and
optimizer steps remain bit-identical to HEAD~1.

The change is the consumer-side prerequisite for the upcoming sample-loop
CPU offload (commit 6): once samples may be CPU-resident, this lazy reload
is what keeps optimize()'s GPU footprint bounded by a single micro-batch
instead of the full epoch.

NFT/AWM are deliberately not touched in this commit -- their optimize()
has an extra eager precompute layer that requires structural restructuring
(commit 5).

Made-with: Cursor

* [trainer] refactor: NFT/AWM optimize() per-batch precompute interleave

Restructures NFT and AWM optimize() from the previous double-pass design
(eager precompute over ALL batches under sampling_context, then training
over ALL batches under current params) to a single-pass per-batch
interleave that matches the official DiffusionNFT and AWM implementations:

  for each micro-batch:
    1. lazy reload sample tensors to GPU and stack into a batch dict
    2. precompute under sampling policy:
         adapter.rollout()
         with sampling_context():
             compute (_all_timesteps, _all_random_noise, _old_v_pred_list
                      or _old_log_probs) for THIS batch only
    3. train under current policy:
         adapter.train()
         with self.autocast():
             for t_idx in range(num_train_timesteps):
                 forward / loss / backward / optimizer step

Memory savings: only the current batch's _all_random_noise plus
_old_v_pred_list (NFT) or _old_log_probs (AWM) lives on GPU at any time.
The previous design held all num_batches_per_epoch batches' precompute
output simultaneously, costing ~5+ GB on FLUX1 1024^2 LoRA at B=4 / T=40
and tens of GB on Wan video models (often the OOM trigger).

Train-inference consistency (philosophy #1; see
.agents/knowledge/topics/train_inference_consistency.md item X-GenGroup#4):

  - Rollout (sample()/adapter.inference()) is unchanged.
  - EMA params are loaded via sampling_context() and restored before each
    batch's training forward, identical to the per-batch behavior of the
    previous design.
  - ema_step() runs only once per outer epoch in start(), so every batch
    within an optimize() call sees the SAME EMA snapshot regardless of
    interleave timing -> per-batch and eager designs are equivalent on
    the EMA invariant.

Note on RNG (regression test guidance):

  randn_tensor for batch K is now called after batch K-1's backward step
  (vs all noises sampled upfront). The CUDA RNG consumption order
  changes; under the same seed, the per-batch noise sequences are NOT
  bit-identical to the eager design. The algorithm is unchanged (noise
  is augmentation; equivalent in expectation). Regression tests should
  use statistical metrics (loss mean / reward trend across an epoch),
  NOT a numeric diff of loss values, when comparing against HEAD~1.

Sample-level lazy reload (`[sample.to(device) for sample in slice]`) is
folded into this same restructure -- NFT/AWM now share the same lazy
reload pattern as GRPO/DPO from commit 4.

The KL paths (use_ref_parameters / use_ema_parameters per timestep) and
all loss / backward / optimizer logic are unchanged in body and order.

Made-with: Cursor

* [trainer] feat: wire sample() loop CPU offload across all trainers

Inserts self._maybe_offload_samples_to_cpu(sample_batch) into every
trainer's sample() loop, immediately after adapter.inference() and
BEFORE both samples.extend() and reward_buffer.add_samples():

  sample_batch = self.adapter.inference(...)
  self._maybe_offload_samples_to_cpu(sample_batch)  # synchronous D2H
  samples.extend(sample_batch)
  self.reward_buffer.add_samples(sample_batch)

Affected sites (5 total):
  - GRPOTrainer.sample()        (trainers/grpo.py)
  - GRPOGuardTrainer.sample()   (trainers/grpo.py)
  - DiffusionNFTTrainer.sample() (trainers/nft.py)
  - AWMTrainer.sample()         (trainers/awm.py)
  - DPOTrainer.sample()         (trainers/dpo.py)

Why BEFORE add_samples (not after):
  reward_buffer.add_samples() in the async-reward path records a CUDA
  sync_event and dispatches workers that read sample.image / sample.video
  / etc. Calling the offload BEFORE add_samples guarantees the recorded
  event captures "D2H complete + data ready on CPU"; workers wait on the
  event and then deterministically see CPU-resident samples. Inverse
  order would race the worker thread's getattr against the main thread's
  in-place setattr that BaseSample.to('cpu') performs.

Behaviour gating:
  The helper short-circuits when training_args.offload_samples_to_cpu is
  False (the default), so the entire pipeline is wired but inert. Setting
  the flag to True in any trainer YAML now activates the producer side
  (D2H here), and the previously-landed pieces handle the consumer side:
    * commit 2: reward_processor moves the CPU input dict to model.device
                via move_tensors_to_device.
    * commits 4 & 5: optimize() loops lazily reload [sample.to(device)
                     for sample in slice] per micro-batch.

End-to-end VRAM saving (roughly num_batches_per_epoch x per_batch_size
of sample tensors) is unlocked for the first time in this commit. Wan
video YAMLs that opt in (commit 7) will exercise it.

Evaluate paths are intentionally NOT touched -- eval samples are usually
small and one-shot, and eval logging may rely on tensors being on the
adapter device.

Made-with: Cursor

* [examples] feat: enable offload_samples_to_cpu for Wan video models

Adds `offload_samples_to_cpu: true` to all 13 Wan video example configs
(GRPO LoRA / Full and NFT LoRA / Full across Wan2.1 and Wan2.2, T2V /
I2V / V2V variants). Inserted next to `enable_gradient_checkpointing`
so the two memory switches sit together.

Why required for video models:
  per-sample tensors (all_latents, condition videos, image_embeds, ...)
  are GB-scale on Wan; without the offload, sample()/optimize() OOMs as
  soon as num_batches_per_epoch > 1. The plumbing wired in commits 1-6
  is now actually exercised on these configs.

Files (13 total):
  examples/grpo/lora/wan21_t2v.yaml
  examples/grpo/lora/wan21_i2v.yaml
  examples/grpo/lora/wan21_v2v.yaml
  examples/grpo/lora/wan22_t2v.yaml
  examples/grpo/lora/wan22_i2v.yaml
  examples/grpo/full/wan21_t2v.yaml
  examples/grpo/full/wan21_i2v.yaml
  examples/grpo/full/wan22_t2v.yaml
  examples/grpo/full/wan22_i2v.yaml
  examples/nft/lora/wan21_t2v.yaml
  examples/nft/lora/wan21_i2v.yaml
  examples/nft/lora/wan22_t2v.yaml
  examples/nft/full/wan22_t2v.yaml

Non-video model YAMLs are intentionally not touched in this commit;
moderate-VRAM-pressure image models (Flux2, Qwen-Image-Edit-Plus) get
an explicit `false` + pros/cons comment in commit 8 so users see the
option as a documented decision point, while small/standard image
models (FLUX1 / SD3 / Qwen-Image / Z-Image / DPO / etc.) rely on the
code default `False` to avoid YAML noise.

Made-with: Cursor

* [examples] docs: expose offload_samples_to_cpu option in Flux2 and Qwen-Image-Edit-Plus configs

Adds an explicit `offload_samples_to_cpu: false` to 13 example configs
(11 Flux2 variants + 2 Qwen-Image-Edit-Plus) preceded by a multi-line
comment that documents the parameter, its pros/cons, and the conditions
under which a user should flip it to true:

  # offload_samples_to_cpu: CPU-offload sample tensor fields between
  #   sample() and optimize() to reduce GPU peak memory.
  # Pros (true): saves N x per_batch_size GPU memory ... no correctness
  #   or convergence impact.
  # Cons (true): adds ~100ms/epoch H2D in reward path; tiny per-batch
  #   H2D in optimize (<5ms each).
  # Recommended (true) for higher resolutions, larger batch sizes, or
  #   any sample()/optimize() OOM. Default false works for current
  #   example settings.
  offload_samples_to_cpu: false

Tier rationale (3-tier YAML strategy, deviating intentionally from the
strict ALL-YAML rule in .cursor/rules/examples-yaml-sync.mdc):

  T1 (commit 7, Wan video, 13 YAMLs): explicit `true` -- required to
                                      avoid OOM.
  T2 (this commit, Flux2 + Qwen-Edit, 13 YAMLs): explicit `false` +
                                      pros/cons comment -- moderate
                                      VRAM pressure, decision left to
                                      the user with documentation right
                                      next to it.
  T3 (untouched, 23 YAMLs of FLUX1 / SD3 / Qwen-Image / Z-Image / DPO /
      AWM-non-Flux2 / template): no field added, sane defaults via the
                                 code-level default. The field is
                                 documented in the upcoming
                                 topics/sample_lifecycle.md (commit 9).

This three-tier policy keeps T3 YAMLs noise-free while making T1/T2
both behaviour-correct (T1) and discoverable (T2).

Files (13 total):
  examples/grpo/lora/flux2_t2i.yaml
  examples/grpo/lora/flux2_i2i.yaml
  examples/grpo/lora/flux2_klein.yaml
  examples/grpo/lora/flux2_klein_base.yaml
  examples/grpo/full/flux2_t2i.yaml
  examples/grpo/full/flux2_i2i.yaml
  examples/grpo/full/flux2_klein.yaml
  examples/grpo/full/flux2_klein_base.yaml
  examples/awm/lora/flux2_klein_base.yaml
  examples/nft/lora/flux2_klein_base.yaml
  examples/nft/full/flux2_klein_base.yaml
  examples/grpo/lora/qwen_image_edit_plus.yaml
  examples/grpo/full/qwen_image_edit_plus.yaml

Made-with: Cursor

* [docs] feat: add sample lifecycle topic and README routing

New leaf doc .agents/knowledge/topics/sample_lifecycle.md (per
.cursor/rules/agents-docs-maintenance.mdc), covering:
  - default sample lifecycle with the offload pipeline
  - the offload_samples_to_cpu switch and its effect at each stage
  - the 3-tier example YAML adoption matrix (T1 Wan video / T2 Flux2 +
    Qwen-Image-Edit-Plus / T3 the rest), including the rationale for
    intentionally deviating from the strict ALL-YAML rule in
    .cursor/rules/examples-yaml-sync.mdc
  - reward path device responsibility (move_tensors_to_device contract)
  - async-reward race-free argument for offload-before-add_samples order
  - NFT/AWM per-batch precompute interleave summary (memory savings,
    train-inference consistency, RNG-order caveat)
  - extra_kwargs device asymmetry caveat
    (rewards on CPU, advantage on GPU, neither moved by BaseSample.to)
  - Cross-refs to constraints X-GenGroup#11, X-GenGroup#14, X-GenGroup#15, train_inference_consistency
    item X-GenGroup#4, dtype_precision

README.md routing table gains a corresponding row pointing at the new
topic with explicit triggers (sample/optimize data flow changes,
debugging sample/optimize OOM, adding high-resolution / video example
configs). The new topic is the authoritative reference for the
offload_samples_to_cpu switch -- T3 YAMLs do not list the field, so
users discover it through this routing entry.

This is the documentation closing the 9-commit refactor of the sample
CPU offload + lazy reload pipeline (commits 1 through 8). No code
changes in this commit.

Made-with: Cursor

* [trainer] docs: simplify NFT/AWM optimize() docstrings

The previous docstrings (introduced in commit 49e7b49) inlined the full
memory analysis, train-inference consistency proof, and RNG-order caveat
-- ~30 lines each. All of that material lives in the authoritative
.agents/knowledge/topics/sample_lifecycle.md (commit 8dc4a78), so the
docstrings now keep only the essential "what":

  - one-line summary
  - per-batch interleave shape (3-step pipeline)
  - AWM-specific note on decoupled sampling/training timesteps
  - pointer to topics/sample_lifecycle.md for "why"

Net change: -33 lines across the two files; bodies unchanged.

Made-with: Cursor

* [ltx2] fix: handle batched per-sample timestep in adapter forward()

`t.expand(batch_size * 2)` raised RuntimeError when forward() was called
during training with `t` of shape (B,) (distinct per-sample timesteps
from `batch['timesteps'][:, ti]`). `expand` cannot stretch a non-singleton
dim from B to 2B.

Normalize `t` to (B,) at function entry with fail-fast shape validation,
then use `torch.cat([t, t])` for the CFG-doubled batch (matching the
`torch.cat([lat, lat])` ordering of `[t0..tB-1, t0..tB-1]`). Accepts
0-D scalar (inference), (1,) singleton, and (B,) per-sample inputs;
other shapes raise an informative ValueError.

Applied identically to LTX2_T2AV_Adapter and LTX2_I2AV_Adapter.

Made-with: Cursor

* [data_utils] fix: stabilize preprocess cache key via deep signature collection

compute_cache_path previously hashed ALL preprocess_kwargs — including
training-infrastructure fields like num_batches_per_epoch and
gradient_accumulation_steps that leak through **training_args unpacking
and filter_kwargs pass-through. These fields have non-deterministic
values across launches (world-size-derived, __post_init__ timing, etc.),
causing merged_cache_path to differ even with identical YAML and
force_reprocess=False.

Fix: add _select_cache_relevant_kwargs() which uses "deep signature
collection" — it inspects the named parameters of preprocess_func AND
(when preprocess_func accepts **kwargs and is a bound adapter method)
the named parameters of all encode_* forwarding targets on the same
adapter instance. Only kwargs whose key appears in this union are
included in kwargs_hash. Training-only fields that no encoder declares
are excluded.

Safety: over-hash (hashing an encoder param that doesn't run at
runtime) is harmless; under-hash (missing a param that affects output)
is prevented by collecting from all four encoder methods regardless of
runtime data presence.

Made-with: Cursor

* [examples] switch LoRA configs from DDP to DeepSpeed ZeRO-2

DeepSpeed ZeRO-2 provides optimizer-state sharding with negligible
overhead, making plain multi_gpu (DDP) redundant for LoRA training.

Made-with: Cursor

* [examples] docs: note LTX-2.3-Diffusers as an option in LTX2 configs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [examples] fix: enable offload_samples_to_cpu for LTX2 video configs

Matches Wan video-model configs. Without this, per-sample audio+video
tensors are GB-scale and sample()/optimize() OOMs at
num_batches_per_epoch > 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [examples] docs: rewrite stale Qwen attn_backend comment for LTX2

Replace misleading "for Qwen-Image Series" comment with a
model-agnostic description of available backend options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [models/ltx2,rewards] style: apply black and isort to new PR files

Cosmetic-only changes (import reorder, string-quote normalization,
long-line wrapping). Scoped to the 5 files this PR adds; existing
unclean files on main are out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix dtype source

* Update submodule

* [models,docs] refactor: align CFG handling across all adapters with forward-stage warning

Ensure every CFG-capable adapter follows a consistent two-stage pattern:

1. encode_prompt: derive do_classifier_free_guidance from guidance_scale
   (>1.0 standard, >0.0 for Z-Image); default negative_prompt to "" when None.
2. forward: if guidance_scale > threshold but negative_prompt_embeds is None,
   emit logger.warning and gracefully fallback to the no-CFG path.

Adapter-specific changes:
- flux2_klein: extract do_classifier_free_guidance variable in _forward
- sd3_5: add forward warning; migrate to setup_logger; drop unused import
- z_image: add forward warning (threshold > 0.0)
- wan2_t2v/v2v/i2v: add forward warning; unify negative_prompt expansion
- qwen_image: add guidance_scale to encode_prompt; add forward warning
- qwen_image_edit_plus: add guidance_scale to encode_prompt; rename
  true_cfg_scale/do_true_cfg to guidance_scale/do_classifier_free_guidance;
  add _forward warning
- ltx2_t2av/i2av: add forward warning for multi-guidance (video + audio)

Document the CFG convention in .agents/knowledge/topics/adapter_conventions.md
with reference implementation, model-specific extensions table, and gotcha X-GenGroup#7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [examples,docs,agents] refactor: restructure examples, add LTX-2 docs, sync agent knowledge

- Restructure examples/ to algorithm/ft/model/variant.yaml with examples/README.md
- Add LTX-2/2.3 to README (News, model table, install note)
- Add .scratch/ constraint for agent temp files (X-GenGroup#28), examples convention (X-GenGroup#29)
- Sync agent knowledge: GroupDistributedSampler in samplers.md, LTX2 + RationalRewards in architecture.md
- Clean up .docs/ltx2-research/ dev artifacts
- Update LTX2 configs: guidance_scale=1.0, comment out attn_backend

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant