Fix component management#27
Merged
Merged
Conversation
5 tasks
Jayce-Ping
added a commit
that referenced
this pull request
Apr 17, 2026
… unused import)
Each fix below maps to a specific Copilot review comment from Round-3
(2026-04-17). All five sites are documentation / comment changes that
bring the docstrings in line with current behavior, plus one trivial
unused-import removal flagged in the same review.
- audio.py load_audio (Comment id 3098052328):
Removed the false `Raises: ImportError` claim — load_audio never
raises ImportError because the implementation falls back to the stdlib
`wave` module if both torchaudio and soundfile are unavailable. Added
a Note: section listing all three backends in priority order
(torchaudio primary; soundfile wav/flac/ogg; stdlib wave WAV-only
16/32-bit PCM) so callers know which formats survive each fallback.
- audio.py _load_audio_backend (related to Comment 3098052328, found via
audit):
Replaced the cryptic "Priority: torchaudio > soundfile + torch" line
with a numbered backend chain that documents:
1. torchaudio.load — primary
2. soundfile.read — (T, C) transposed to (C, T)
3. stdlib wave — WAV-only last-resort
Added Raises: section for the WAV-fallback ValueError.
- audio.py hash_audio (Comment id 3098052292):
The docstring claimed "Longer audio is uniformly subsampled", but the
implementation uses `step = n // max_samples`, which collapses to
step == 1 (effective truncation to the first max_samples samples)
when max_samples < n < 2 * max_samples. Added a Note: section that
documents the truncation-vs-uniform-stride boundary at n = 2*max_samples
and clarifies that determinism (cache-key intent), not collision
resistance, is the design goal — keeping the existing implementation
intact and accurately documented.
- loader.py import torch (Comment id 3098052303):
Removed unused `import torch` (line 21). Only `torch.utils.data` is
used and that import on line 22 is satisfied independently.
- loader.py is_orchestrator block (Comment id 3098052316):
Rewrote the comment to be precise per case. The old comment claimed
"is_orchestrator prevents multi-node races ... when the FS is shared",
which is only true in "global" mode. Clarified:
- non-distributed: the lone process
- "global" mode: rank-0 globally — required when cache_dir is on a
shared FS (eliminates cross-node race on
shutil.rmtree / sentinel writes)
- "local" mode: per-node local main — ASSUMES cache_dir is on
node-local storage; pointing "local" mode at a
shared FS WILL race across node-local mains and
is unsupported
The user's contract (already documented on _create_or_load_dataset's
`preprocess_parallelism` arg) is now consistent with the comment at
the actual orchestrator selection site.
Comment id 3098052262 (save_audio "converted to int16") was already
addressed by the Round-3 commit (49d5a54) when save_audio was reduced to
a single torchaudio.save call — no separate fix needed here.
Per .agents/knowledge/constraints.md #27 (Google-style docstrings) and
philosophy.md #3 (fail-fast, no speculative code): the docstrings now
faithfully reflect what the code does, including the corner cases that
Round-2 narrowing exposed.
Verification: ReadLints — clean
black --check / diff — 358 lines, byte-identical to parent (49d5a54)
isort --check / diff — 18 lines, byte-identical to parent (49d5a54)
(all 358 + 18 are pre-existing on origin/main,
outside the edited regions; not touched per
"structural vs behavioral separation" — same
rationale as Phase 2/3.)
smoke test — load_audio docstring has no ImportError + names
all 3 backends; _load_audio_backend uses Backend
chain phrasing + names wave; hash_audio mentions
truncation + 2*max_samples boundary; loader.py
is_orchestrator comment includes 'unsupported'
and 'node-local'; no `^import torch$` survives
in loader.py.
Made-with: Cursor
5 tasks
Jayce-Ping
added a commit
that referenced
this pull request
Apr 17, 2026
… merge (#129) * [utils] feat: add audio utility module with type aliases, standardization, and loading Add `utils/audio.py` following the exact pattern of `utils/image.py` and `utils/video.py`. Provides a complete audio waveform toolkit: - Type aliases: `AudioSingle` (C,T), `AudioBatch` (B,C,T) - Validation: `is_audio()`, `is_audio_batch()` - Loading/saving: `load_audio()`, `save_audio()` with 3-tier backend fallback (torchaudio → soundfile → stdlib wave) - Conversion: `audio_to_tensor()`, `audio_to_numpy()`, `convert_audio()` for resampling and mono/stereo conversion - Standardization: `standardize_audio_batch()` with output_type='pt'|'np' - Hashing: `hash_audio()`, `hash_audio_list()` with int16 quantization Design conventions follow diffusers/audiocraft: - Channel-first (C, T) tensor layout - [-1.0, 1.0] float32 value range - Channel conversion: downmix=mean, upmix=repeat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Made-with: Cursor * [data] feat: support audio loading and preprocessing in dataset pipeline - Add `audio_dir` field to DataArguments (defaults to 'audios' subfolder) - Add audio loading block in GeneralDataset._preprocess_batch() that mirrors the existing video loading pattern: detect 'audio' column in JSONL, load via load_audio(), pass to preprocess_func(audios=...) - Add 'audios' to PREPROCESS_KEYS for metadata exclusion - Pass audio_dir through fn_kwargs to .map() call The audio_dir parameter flows automatically through loader.py via filter_kwargs — no changes needed in loader.py. Fully backward compatible: datasets without audio columns are unaffected. JSONL format: {"prompt": "...", "audio": "file.wav"} Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [data_utils] perf: eliminate redundant preprocessed-dataset writes Each rank's preprocessed Arrow shard is now written exactly once: the orchestrator routes Dataset.map output directly to the final per-rank location via cache_file_name= (under {merged_cache_path}.tmp/_parts/), and the consolidator writes only state.json + dataset_info.json before atomically renaming .tmp -> merged_cache_path. No row data is re-copied during the merge, no duplicate cache lands under ~/.cache/huggingface, and the build-dir sentinel _build_meta.json enables crash recovery for unchanged num_shards while wiping cleanly on num_shards changes. Single-process and distributed paths are unified through the same flow (N=1 case for single-process); enable_preprocess=False bypasses the consolidate pipeline entirely to preserve pre-refactor behavior. I/O budget per cache build drops from ~4*N*S to ~N*S bytes touched. Made-with: Cursor * [data_utils] fix: address PR #129 review - loader.py: gate build-dir prep on a shared is_orchestrator predicate (rank-0 globally for "global" mode, local main for "local" mode) and reuse it for the consolidator. Avoids concurrent shutil.rmtree / _build_meta.json writes when multiple nodes share the FS. - loader.py: align shard log with dataset.py convention {idx}/{N-1}. Addresses Copilot review comments #2 and #3 on PR #129. Made-with: Cursor * [utils] feat: add ndim validation to standardize_audio_batch Cherry-picked from a902429 (audio.py portion only). Adds explicit ValueError when input ndim is not in {1, 2, 3}, replacing silent fall-through. Aligns with the no-defensive-except.mdc fail-fast rule. The data_utils path only calls load_audio, never standardize_audio_batch, so this is a pure independent safety improvement to the audio utility module. Made-with: Cursor * [utils] fix: address PR #129 Comments 4-7 — fail-fast in audio.py + meta_matches recovery - audio.py save_audio: narrow torchaudio block to ImportError only, using try/import + else/save so genuine save failures (permission, encoding, disk full) propagate instead of silently falling back to soundfile or wave (Copilot Comment #4). - audio.py _load_audio_backend: same fix for torchaudio.load and soundfile.read blocks; the always-available wave fallback at the end stays as-is (Comment #5). - audio.py _resample: same fix for torchaudio.functional.resample; the scipy block already uses except ImportError correctly (Comment #6). - loader.py _meta_matches: catch (json.JSONDecodeError, OSError) so a corrupted sentinel from a previously crashed mid-write is treated as "stale" and triggers the wipe-and-recreate path, matching the existing "missing -> return False" semantics. Behavior is documented inline per no-defensive-except.mdc (Comment #7). All five sites now obey .cursor/rules/no-defensive-except.mdc: narrow, documented try/except limited to optional-import detection or recovery from a known on-disk failure mode. Made-with: Cursor * [utils] refactor: drop fallback chains in save_audio and _resample (fail-fast) torchaudio>=2.4.0 is a hard core dependency in pyproject.toml, so the optional-backend fallback chains in save_audio (soundfile + stdlib wave) and _resample (scipy + linear interpolation) were never genuinely optional - they only ever ran on broken installs, and silently substituted a different backend's behavior for what the caller asked for. Per .cursor/rules/no-defensive-except.mdc, the fail-fast posture is to surface the failure directly: - save_audio: now a single torchaudio.save call. If torchaudio is missing (broken install) the ImportError surfaces; if torchaudio.save raises (permission, encoding, disk full, unsupported extension) the exception surfaces with the original cause and full traceback. - _resample: now a single torchaudio.functional.resample call. dtype, device, and rate-mismatch errors surface directly instead of being silently masked by scipy or linear interp. Round-2 narrowed the broad except (ImportError, Exception) to ImportError-only (commit f8fb452); Round-3 goes further and removes the fallbacks themselves, since the underlying premise - "torchaudio might be missing" - is false for this codebase. _load_audio_backend's soundfile fallback is intentionally retained: soundfile is NOT a hard dep, and is needed for users who load formats torchaudio doesn't support out-of-the-box. That stays as a documented optional-backend pattern. Made-with: Cursor * [utils,data_utils] docs: address PR #129 Round-3 review (docstrings + unused import) Each fix below maps to a specific Copilot review comment from Round-3 (2026-04-17). All five sites are documentation / comment changes that bring the docstrings in line with current behavior, plus one trivial unused-import removal flagged in the same review. - audio.py load_audio (Comment id 3098052328): Removed the false `Raises: ImportError` claim — load_audio never raises ImportError because the implementation falls back to the stdlib `wave` module if both torchaudio and soundfile are unavailable. Added a Note: section listing all three backends in priority order (torchaudio primary; soundfile wav/flac/ogg; stdlib wave WAV-only 16/32-bit PCM) so callers know which formats survive each fallback. - audio.py _load_audio_backend (related to Comment 3098052328, found via audit): Replaced the cryptic "Priority: torchaudio > soundfile + torch" line with a numbered backend chain that documents: 1. torchaudio.load — primary 2. soundfile.read — (T, C) transposed to (C, T) 3. stdlib wave — WAV-only last-resort Added Raises: section for the WAV-fallback ValueError. - audio.py hash_audio (Comment id 3098052292): The docstring claimed "Longer audio is uniformly subsampled", but the implementation uses `step = n // max_samples`, which collapses to step == 1 (effective truncation to the first max_samples samples) when max_samples < n < 2 * max_samples. Added a Note: section that documents the truncation-vs-uniform-stride boundary at n = 2*max_samples and clarifies that determinism (cache-key intent), not collision resistance, is the design goal — keeping the existing implementation intact and accurately documented. - loader.py import torch (Comment id 3098052303): Removed unused `import torch` (line 21). Only `torch.utils.data` is used and that import on line 22 is satisfied independently. - loader.py is_orchestrator block (Comment id 3098052316): Rewrote the comment to be precise per case. The old comment claimed "is_orchestrator prevents multi-node races ... when the FS is shared", which is only true in "global" mode. Clarified: - non-distributed: the lone process - "global" mode: rank-0 globally — required when cache_dir is on a shared FS (eliminates cross-node race on shutil.rmtree / sentinel writes) - "local" mode: per-node local main — ASSUMES cache_dir is on node-local storage; pointing "local" mode at a shared FS WILL race across node-local mains and is unsupported The user's contract (already documented on _create_or_load_dataset's `preprocess_parallelism` arg) is now consistent with the comment at the actual orchestrator selection site. Comment id 3098052262 (save_audio "converted to int16") was already addressed by the Round-3 commit (49d5a54) when save_audio was reduced to a single torchaudio.save call — no separate fix needed here. Per .agents/knowledge/constraints.md #27 (Google-style docstrings) and philosophy.md #3 (fail-fast, no speculative code): the docstrings now faithfully reflect what the code does, including the corner cases that Round-2 narrowing exposed. Verification: ReadLints — clean black --check / diff — 358 lines, byte-identical to parent (49d5a54) isort --check / diff — 18 lines, byte-identical to parent (49d5a54) (all 358 + 18 are pre-existing on origin/main, outside the edited regions; not touched per "structural vs behavioral separation" — same rationale as Phase 2/3.) smoke test — load_audio docstring has no ImportError + names all 3 backends; _load_audio_backend uses Backend chain phrasing + names wave; hash_audio mentions truncation + 2*max_samples boundary; loader.py is_orchestrator comment includes 'unsupported' and 'node-local'; no `^import torch$` survives in loader.py. Made-with: Cursor * [data_utils,guidance] docs: complete audio_dir docstring coverage (audit) Round-5 follow-up to the design-principle audit on PR #129 — the audit found three audio_dir-related docstring drifts introduced by our own commits (27b5967, fa2e4fd) that Copilot did not flag: 1. dataset.py GeneralDataset.__init__: - We added audio_dir/target_arrow_path to the signature in fa2e4fd and only documented the latter. This commit adds Args entries for image_dir + video_dir + audio_dir + extra_hash_strs (the directory trio kept identical wording for symmetry; extra_hash_strs was pre-PR undocumented but is the natural workaround for the cache issue documented in the new Note). - Added a Note: section calling out that image_dir/video_dir/audio_dir are NOT in compute_cache_path's fingerprint (pre-existing and not fixed in this PR per "structural vs behavioral separation" — see audit report on the PR), so a directory swap with relative JSONL paths needs force_reprocess=True or extra_hash_strs to bust the cache. Users were silently inheriting this footgun without warning. 2. dataset.py GeneralDataset._preprocess_batch: - Added audio_dir to the Args section. - Renumbered the Workflow from 5+Returns to 7 explicit steps; the original 4-step body (text, images, videos, call) became 1/2/3/5, and we inserted "4. Load and prepare audio inputs" + an explicit metadata-pack step (7) that the prose previously left implicit. - Added a cross-reference to utils.audio.load_audio so readers don't have to grep for the loader. 3. guidance/workflow.md Stage 1 _preprocess_batch snippet: - Signature now matches the real one (audio_dir is the third dir parameter). - Comment list extended to 7 steps, mirroring the docstring fix above. Audit summary that prompted this commit (full report posted as PR comment): - 0 new Copilot comments since Round-4 push (a790e8a). - 12 historical Copilot comments verified addressed across Rounds 1-4. - Distributed barriers correct: prep -> map -> consolidate, three wait_for_everyone() boundaries, no double-fire on enable_preprocess=False. - consolidate_parts is idempotent on retry (state.json/dataset_info.json rewrites are last-write-wins; os.replace recovers from partial wipe). - audio.py fail-fast posture is clean: only ImportError around optional backend imports; everything else (decode/resample/save) propagates. - Pre-existing patterns surfaced by the audit but intentionally NOT touched per Phase-2/3 convention (structural vs behavioral separation): * set_format try/except Exception: pass at L267-270 + L446-449 of dataset.py (introduced in aa0715e, predates our PR). * Bare 'except:' in _compute_function_hash (introduced in 884a9db, predates our PR). * data_args.py import order is third-party-before-stdlib (predates our PR; isort baseline on origin/main is 18 lines, identical here). - Latent risk worth user-side awareness (already covered by the new Note: in __init__): image_dir/video_dir/audio_dir omitted from fingerprint -> stale cache on directory swap with relative paths. Affects all three sibling fields equally; documenting was the appropriate scope for this PR. Verification: - ReadLints: 0 errors on both edited files. - 13/13 smoke-test assertions pass (audio_dir in __init__ Args, image_dir/video_dir entries added, Note: re fingerprint present, workflow numbered 1-7, audio_dir in _preprocess_batch Args, load_audio cross-ref present, workflow.md signature matches, workflow.md steps 4-7 in expected positions). - black --check fingerprint on dataset.py: 563 lines before my edits, 563 lines after my edits (byte-identical) - no new formatting drift. Made-with: Cursor * [data_utils,utils,models] refactor: homogenize multi-media batch types in _preprocess_batch + add encode_audio abstract method Why --- `_preprocess_batch` in `dataset.py` was producing three different per-sample types for the `audios` column within a single batch: empty sample -> None 1 audio -> Tensor N audios -> List[Tensor] This breaks HF Arrow's homogeneous-column requirement for `Dataset.map(batched=True)` / cache-write, and forces every downstream consumer to handle three input shapes instead of one. Image/video had a parallel latent bug: their empty-branch only updated `xx_args` but not `batch[xx]`, leaving `batch[images]`/`batch[videos]` shorter than the input batch size — only "didn't surface" because real datasets are typically all-or-nothing per modality. Additionally, the BaseAdapter ABC documented `prompt`/`images`/`videos` but had no `audios` plumbing, so the audio inputs prepared by the dataset preprocess step had no path to the model's encode step. What ---- `dataset.py _preprocess_batch`: * Drop the unwrap-single line (`audios[0] if len(audios) == 1 else audios`); audio always stored as `List[torch.Tensor]` per sample. * Empty image/video/audio branches now append `[]` to BOTH `xx_args` AND `batch`, so every column has length == input batch size and a uniform `List[List[Media]]` type. (`MultiImageBatch` / `MultiVideoBatch` / `MultiAudioBatch` invariant.) * Docstring rewritten to specify the new contract per modality and a Note explaining why `[]`-for-empty matters for HF Arrow. `utils/audio.py`: * Add `MultiAudioBatch = Union[List[AudioBatch], torch.Tensor, np.ndarray]` type alias mirroring `MultiImageBatch` (image.py) and `MultiVideoBatch` (video.py); register in `__all__` and module-level type-hierarchy docstring. `models/abc.py`: * Add `audios: Optional[List[Union[torch.Tensor, List[torch.Tensor]]]]` parameter to `BaseAdapter.preprocess_func` (mirrors the existing `images`/`videos` Union style). * Iterate `(audios, self.encode_audio)` in the encoder dispatch loop so audio inputs flow through to the model when present. * Add new `@abstractmethod encode_audio()` modeled on `encode_video`, documenting the (single / batch-of-single / batch-of-list) input forms and the `condition_audios` output convention. 11 concrete adapters (z_image, wan2_v2v/t2v/i2v, sd3_5, qwen_image/_edit_plus, flux1/1_kontext/2/2_klein): * Add a one-line `def encode_audio(...): pass` stub mirroring each file's local `encode_video` style and explanatory docstring (e.g., "Not needed for X models." / "X does not support audio encoding."). The ABC call path already short-circuits via `if input is not None`, so unused stubs are never invoked. Verified -------- * 0 lint errors on all 14 touched files. * AST smoke test passes: every dataset.py edit site landed; audio.py exports `MultiAudioBatch`; abc.py defines `encode_audio` as abstract with the expected docstring; all 11 adapters define `encode_audio`. * black-diff hunks confirm new lines follow each file's existing style (single-quoted dict keys in dataset.py; aligned-comment Union members in audio.py; `audios : Optional[...]` in abc.py to match `prompt :`, `images :`, `videos :`). No NEW pre-existing-style violations introduced. * No current adapter overrides `encode_audio` (audited via Grep), so the new abstract is a forward-looking contract for the LTX2 adapter on the feat branch and any future audio-enabled model. Refs ---- PR #129 follow-up (Round 6). Made-with: Cursor * [models] refactor: make all 4 BaseAdapter encoders non-abstract with no-op defaults Why --- Round-6 promoted `encode_audio` to a 4th `@abstractmethod` on `BaseAdapter`, which forced all 11 existing concrete adapters to gain a one-line `def encode_audio(...): pass` stub just to remain instantiable. That penalises adapters that legitimately don't need the modality (e.g., SD3.5, Flux.1/2/Kontext, Z-Image, Qwen-Image*, Wan2-T2V/I2V/V2V) — they have to write boilerplate forwarding nothing — and it scales linearly with every new modality we add (audio today, who-knows-what tomorrow). The right invariant for `BaseAdapter` is the OPPOSITE: an adapter only implements the encoders for the modalities it actually consumes; missing modalities should be a silent no-op handled by the base class. `preprocess_func` already has the right plumbing — it iterates over `(text, encode_prompt), (images, encode_image), (videos, encode_video), (audios, encode_audio)` and merges only non-`None` results — so a default `pass` (which returns `None`) integrates cleanly with that loop. What ---- 1. `models/abc.py`: - Drop `@abstractmethod` from `encode_prompt`, `encode_image`, `encode_video`, `encode_audio` (all 4 encoders). - Replace each body with a single `pass` (Python returns `None`, which `preprocess_func`'s "skip if not isinstance(res, dict)" branch already handles correctly — see lines ~1928-1948). - Tighten signatures to use the homogeneous batch types produced by `dataset.py._preprocess_batch` after Round-6: encode_prompt(prompt: List[str], **kwargs) encode_image(images: MultiImageBatch, **kwargs) encode_video(videos: MultiVideoBatch, **kwargs) encode_audio(audios: MultiAudioBatch, **kwargs) — i.e., always a batch (`List[List[Media]]` ragged or uniform-shape Tensor/ndarray), never a single sample. This matches what the dataset actually hands to the encoders and lets every override stop re-implementing batch detection. - Return type narrowed to `Optional[Dict[str, Union[List[Any], torch.Tensor]]]` to make the no-op default's `None` return explicit in the contract. - Rewrite docstrings in Google style (consistent with existing base methods, satisfies repo design-principle re docstrings): state the no-op default, the dispatch contract, the input `Multi*Batch` shape, and the canonical output keys (`condition_images` / `condition_videos` / `condition_audios`). 2. `models/{z_image,wan,stable_diffusion,qwen_image,flux}/*.py` (11 adapters): revert the Round-6 `def encode_audio(...): pass` stub. Each adapter now inherits the no-op default from `BaseAdapter`. Net diff vs `origin/main` for these 11 files is now exactly **0 lines** — they match the PR base again, so the PR's footprint shrinks from 12 files to 1 (`models/abc.py` only, on the `models/` side). 3. New imports in `models/abc.py`: `from ..utils.image import MultiImageBatch` `from ..utils.video import MultiVideoBatch` `from ..utils.audio import MultiAudioBatch` (the audio one was already added in Round-6 via `utils/audio.py`). Compat / risk ------------- - LTX-2 adapters (`ltx2_t2av.py`, `ltx2_i2v.py`) live on `feat/ltx2-audio-video-support` and already implement `encode_audio` properly. The new ABC accepts a more general `MultiAudioBatch` (a superset of what they currently expect) so their existing `Union[torch.Tensor, List[torch.Tensor]]` typed override is still valid Python. **No LTX-2 changes needed.** - Adapters that DO need text/image/video encoding still override the corresponding encoder (their override silently shadows the base no-op), so behaviour is unchanged for everyone. - `preprocess_func`'s strict-validation branch (lines ~1937-1948, "must return a non-empty dict ...") only fires when the encoder returns something OTHER than `None` and that something is not a proper dict — `None` from the default still skips integration cleanly (see `if res is not None:` guard above the validation). Verification ------------ - `git diff origin/main -- src/flow_factory/models/<adapter>.py` → 0 lines for all 11 adapters. - AST smoke test: - `BaseAdapter` defines exactly 4 encoders; none decorated with `@abstractmethod`. - Each encoder body is `pass` (with optional docstring). - Each encoder's first non-self arg has the expected `Multi*Batch` annotation; return type is `Optional[...]`. - No adapter (of the 11) defines `encode_audio` anymore. - `ReadLints` clean on all 12 touched files. - `black --check --diff` line counts on all 12 files DECREASED vs HEAD (Δ ranges -37 to -219), so this commit introduces zero new formatting violations. Made-with: Cursor * [docs] update: align BaseAdapter contract docs with R6/R7 (4 abstract + 4 no-op encoders, MultiAudioBatch) constraints.md #12, ff-new-model/SKILL.md, guidance/new_model.md were written for the pre-R6/R7 contract. Bring them in sync with what src/flow_factory/models/abc.py actually declares after this PR: - 4 abstract methods: load_pipeline, decode_latents, forward, inference (was 7). - 4 per-modality encoders (encode_prompt/image/video/audio) are non-abstract no-op defaults. Subclasses override only the modalities they consume; preprocess_func dispatches to all four and skips when the encoder returns None. - Multi-modal batch convention extended to include audios: every modality column from _preprocess_batch is List[List[Media]] (empty samples contribute [], single-item samples contribute [item]). - guidance/new_model.md gains an encode_audio block, audios in the inference signature comment, and an encode_audio checklist item. Pure docs change; no Python touched. Made-with: Cursor --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jayce-Ping
added a commit
that referenced
this pull request
Jun 14, 2026
Resync .agents/, .cursor/, guidance/, AGENTS.md and CLAUDE.md with the current code after plugin growth (9 trainers, 14 model adapters, 13 reward models). Fixes registry drift, wrong config/API facts and broken cross-references found in a full audit. - architecture.md/AGENTS.md: add diffusion-opd trainer, clap/imagebind/ geneval rewards, Bagel/LTX2 models; fix RationalRewards* class names - constraints.md: evaluate() is concrete (not abstract); index #28-29; paradigm (#7) and training-args (#16) lists; de-numbered line refs - philosophy.md: Accelerate (DDP/DeepSpeed ZeRO-1-2/FSDP) backend; fix #27 ref - guidance: scheduler.* config keys, real sample()/compute_advantages snippets, GenEval metadata convention, audio reward param, Bagel link - skills: model_name_or_path, default_target_modules, data.datasets, rewards-as-list, 9 trainers; CLAUDE.md imports AGENTS.md to avoid drift - topics/samplers.md: correct _resolve_sampler_type + AdvantageProcessor group_distributed paths; parity_testing set_scheduler_timesteps - hparams/model_args.py: model_type Literal now matches registry keys Co-authored-by: Cursor <cursoragent@cursor.com>
Jayce-Ping
added a commit
to Jayce-Ping/Flow-Factory-Private
that referenced
this pull request
Jul 2, 2026
Jayce-Ping
added a commit
to Jayce-Ping/Flow-Factory-Private
that referenced
this pull request
Jul 2, 2026
… merge (X-GenGroup#129) * [utils] feat: add audio utility module with type aliases, standardization, and loading Add `utils/audio.py` following the exact pattern of `utils/image.py` and `utils/video.py`. Provides a complete audio waveform toolkit: - Type aliases: `AudioSingle` (C,T), `AudioBatch` (B,C,T) - Validation: `is_audio()`, `is_audio_batch()` - Loading/saving: `load_audio()`, `save_audio()` with 3-tier backend fallback (torchaudio → soundfile → stdlib wave) - Conversion: `audio_to_tensor()`, `audio_to_numpy()`, `convert_audio()` for resampling and mono/stereo conversion - Standardization: `standardize_audio_batch()` with output_type='pt'|'np' - Hashing: `hash_audio()`, `hash_audio_list()` with int16 quantization Design conventions follow diffusers/audiocraft: - Channel-first (C, T) tensor layout - [-1.0, 1.0] float32 value range - Channel conversion: downmix=mean, upmix=repeat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Made-with: Cursor * [data] feat: support audio loading and preprocessing in dataset pipeline - Add `audio_dir` field to DataArguments (defaults to 'audios' subfolder) - Add audio loading block in GeneralDataset._preprocess_batch() that mirrors the existing video loading pattern: detect 'audio' column in JSONL, load via load_audio(), pass to preprocess_func(audios=...) - Add 'audios' to PREPROCESS_KEYS for metadata exclusion - Pass audio_dir through fn_kwargs to .map() call The audio_dir parameter flows automatically through loader.py via filter_kwargs — no changes needed in loader.py. Fully backward compatible: datasets without audio columns are unaffected. JSONL format: {"prompt": "...", "audio": "file.wav"} Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [data_utils] perf: eliminate redundant preprocessed-dataset writes Each rank's preprocessed Arrow shard is now written exactly once: the orchestrator routes Dataset.map output directly to the final per-rank location via cache_file_name= (under {merged_cache_path}.tmp/_parts/), and the consolidator writes only state.json + dataset_info.json before atomically renaming .tmp -> merged_cache_path. No row data is re-copied during the merge, no duplicate cache lands under ~/.cache/huggingface, and the build-dir sentinel _build_meta.json enables crash recovery for unchanged num_shards while wiping cleanly on num_shards changes. Single-process and distributed paths are unified through the same flow (N=1 case for single-process); enable_preprocess=False bypasses the consolidate pipeline entirely to preserve pre-refactor behavior. I/O budget per cache build drops from ~4*N*S to ~N*S bytes touched. Made-with: Cursor * [data_utils] fix: address PR X-GenGroup#129 review - loader.py: gate build-dir prep on a shared is_orchestrator predicate (rank-0 globally for "global" mode, local main for "local" mode) and reuse it for the consolidator. Avoids concurrent shutil.rmtree / _build_meta.json writes when multiple nodes share the FS. - loader.py: align shard log with dataset.py convention {idx}/{N-1}. Addresses Copilot review comments X-GenGroup#2 and X-GenGroup#3 on PR X-GenGroup#129. Made-with: Cursor * [utils] feat: add ndim validation to standardize_audio_batch Cherry-picked from a902429 (audio.py portion only). Adds explicit ValueError when input ndim is not in {1, 2, 3}, replacing silent fall-through. Aligns with the no-defensive-except.mdc fail-fast rule. The data_utils path only calls load_audio, never standardize_audio_batch, so this is a pure independent safety improvement to the audio utility module. Made-with: Cursor * [utils] fix: address PR X-GenGroup#129 Comments 4-7 — fail-fast in audio.py + meta_matches recovery - audio.py save_audio: narrow torchaudio block to ImportError only, using try/import + else/save so genuine save failures (permission, encoding, disk full) propagate instead of silently falling back to soundfile or wave (Copilot Comment X-GenGroup#4). - audio.py _load_audio_backend: same fix for torchaudio.load and soundfile.read blocks; the always-available wave fallback at the end stays as-is (Comment X-GenGroup#5). - audio.py _resample: same fix for torchaudio.functional.resample; the scipy block already uses except ImportError correctly (Comment X-GenGroup#6). - loader.py _meta_matches: catch (json.JSONDecodeError, OSError) so a corrupted sentinel from a previously crashed mid-write is treated as "stale" and triggers the wipe-and-recreate path, matching the existing "missing -> return False" semantics. Behavior is documented inline per no-defensive-except.mdc (Comment X-GenGroup#7). All five sites now obey .cursor/rules/no-defensive-except.mdc: narrow, documented try/except limited to optional-import detection or recovery from a known on-disk failure mode. Made-with: Cursor * [utils] refactor: drop fallback chains in save_audio and _resample (fail-fast) torchaudio>=2.4.0 is a hard core dependency in pyproject.toml, so the optional-backend fallback chains in save_audio (soundfile + stdlib wave) and _resample (scipy + linear interpolation) were never genuinely optional - they only ever ran on broken installs, and silently substituted a different backend's behavior for what the caller asked for. Per .cursor/rules/no-defensive-except.mdc, the fail-fast posture is to surface the failure directly: - save_audio: now a single torchaudio.save call. If torchaudio is missing (broken install) the ImportError surfaces; if torchaudio.save raises (permission, encoding, disk full, unsupported extension) the exception surfaces with the original cause and full traceback. - _resample: now a single torchaudio.functional.resample call. dtype, device, and rate-mismatch errors surface directly instead of being silently masked by scipy or linear interp. Round-2 narrowed the broad except (ImportError, Exception) to ImportError-only (commit f8fb452); Round-3 goes further and removes the fallbacks themselves, since the underlying premise - "torchaudio might be missing" - is false for this codebase. _load_audio_backend's soundfile fallback is intentionally retained: soundfile is NOT a hard dep, and is needed for users who load formats torchaudio doesn't support out-of-the-box. That stays as a documented optional-backend pattern. Made-with: Cursor * [utils,data_utils] docs: address PR X-GenGroup#129 Round-3 review (docstrings + unused import) Each fix below maps to a specific Copilot review comment from Round-3 (2026-04-17). All five sites are documentation / comment changes that bring the docstrings in line with current behavior, plus one trivial unused-import removal flagged in the same review. - audio.py load_audio (Comment id 3098052328): Removed the false `Raises: ImportError` claim — load_audio never raises ImportError because the implementation falls back to the stdlib `wave` module if both torchaudio and soundfile are unavailable. Added a Note: section listing all three backends in priority order (torchaudio primary; soundfile wav/flac/ogg; stdlib wave WAV-only 16/32-bit PCM) so callers know which formats survive each fallback. - audio.py _load_audio_backend (related to Comment 3098052328, found via audit): Replaced the cryptic "Priority: torchaudio > soundfile + torch" line with a numbered backend chain that documents: 1. torchaudio.load — primary 2. soundfile.read — (T, C) transposed to (C, T) 3. stdlib wave — WAV-only last-resort Added Raises: section for the WAV-fallback ValueError. - audio.py hash_audio (Comment id 3098052292): The docstring claimed "Longer audio is uniformly subsampled", but the implementation uses `step = n // max_samples`, which collapses to step == 1 (effective truncation to the first max_samples samples) when max_samples < n < 2 * max_samples. Added a Note: section that documents the truncation-vs-uniform-stride boundary at n = 2*max_samples and clarifies that determinism (cache-key intent), not collision resistance, is the design goal — keeping the existing implementation intact and accurately documented. - loader.py import torch (Comment id 3098052303): Removed unused `import torch` (line 21). Only `torch.utils.data` is used and that import on line 22 is satisfied independently. - loader.py is_orchestrator block (Comment id 3098052316): Rewrote the comment to be precise per case. The old comment claimed "is_orchestrator prevents multi-node races ... when the FS is shared", which is only true in "global" mode. Clarified: - non-distributed: the lone process - "global" mode: rank-0 globally — required when cache_dir is on a shared FS (eliminates cross-node race on shutil.rmtree / sentinel writes) - "local" mode: per-node local main — ASSUMES cache_dir is on node-local storage; pointing "local" mode at a shared FS WILL race across node-local mains and is unsupported The user's contract (already documented on _create_or_load_dataset's `preprocess_parallelism` arg) is now consistent with the comment at the actual orchestrator selection site. Comment id 3098052262 (save_audio "converted to int16") was already addressed by the Round-3 commit (49d5a54) when save_audio was reduced to a single torchaudio.save call — no separate fix needed here. Per .agents/knowledge/constraints.md X-GenGroup#27 (Google-style docstrings) and philosophy.md X-GenGroup#3 (fail-fast, no speculative code): the docstrings now faithfully reflect what the code does, including the corner cases that Round-2 narrowing exposed. Verification: ReadLints — clean black --check / diff — 358 lines, byte-identical to parent (49d5a54) isort --check / diff — 18 lines, byte-identical to parent (49d5a54) (all 358 + 18 are pre-existing on origin/main, outside the edited regions; not touched per "structural vs behavioral separation" — same rationale as Phase 2/3.) smoke test — load_audio docstring has no ImportError + names all 3 backends; _load_audio_backend uses Backend chain phrasing + names wave; hash_audio mentions truncation + 2*max_samples boundary; loader.py is_orchestrator comment includes 'unsupported' and 'node-local'; no `^import torch$` survives in loader.py. Made-with: Cursor * [data_utils,guidance] docs: complete audio_dir docstring coverage (audit) Round-5 follow-up to the design-principle audit on PR X-GenGroup#129 — the audit found three audio_dir-related docstring drifts introduced by our own commits (27b5967, fa2e4fd) that Copilot did not flag: 1. dataset.py GeneralDataset.__init__: - We added audio_dir/target_arrow_path to the signature in fa2e4fd and only documented the latter. This commit adds Args entries for image_dir + video_dir + audio_dir + extra_hash_strs (the directory trio kept identical wording for symmetry; extra_hash_strs was pre-PR undocumented but is the natural workaround for the cache issue documented in the new Note). - Added a Note: section calling out that image_dir/video_dir/audio_dir are NOT in compute_cache_path's fingerprint (pre-existing and not fixed in this PR per "structural vs behavioral separation" — see audit report on the PR), so a directory swap with relative JSONL paths needs force_reprocess=True or extra_hash_strs to bust the cache. Users were silently inheriting this footgun without warning. 2. dataset.py GeneralDataset._preprocess_batch: - Added audio_dir to the Args section. - Renumbered the Workflow from 5+Returns to 7 explicit steps; the original 4-step body (text, images, videos, call) became 1/2/3/5, and we inserted "4. Load and prepare audio inputs" + an explicit metadata-pack step (7) that the prose previously left implicit. - Added a cross-reference to utils.audio.load_audio so readers don't have to grep for the loader. 3. guidance/workflow.md Stage 1 _preprocess_batch snippet: - Signature now matches the real one (audio_dir is the third dir parameter). - Comment list extended to 7 steps, mirroring the docstring fix above. Audit summary that prompted this commit (full report posted as PR comment): - 0 new Copilot comments since Round-4 push (a790e8a). - 12 historical Copilot comments verified addressed across Rounds 1-4. - Distributed barriers correct: prep -> map -> consolidate, three wait_for_everyone() boundaries, no double-fire on enable_preprocess=False. - consolidate_parts is idempotent on retry (state.json/dataset_info.json rewrites are last-write-wins; os.replace recovers from partial wipe). - audio.py fail-fast posture is clean: only ImportError around optional backend imports; everything else (decode/resample/save) propagates. - Pre-existing patterns surfaced by the audit but intentionally NOT touched per Phase-2/3 convention (structural vs behavioral separation): * set_format try/except Exception: pass at L267-270 + L446-449 of dataset.py (introduced in aa0715e, predates our PR). * Bare 'except:' in _compute_function_hash (introduced in 884a9db, predates our PR). * data_args.py import order is third-party-before-stdlib (predates our PR; isort baseline on origin/main is 18 lines, identical here). - Latent risk worth user-side awareness (already covered by the new Note: in __init__): image_dir/video_dir/audio_dir omitted from fingerprint -> stale cache on directory swap with relative paths. Affects all three sibling fields equally; documenting was the appropriate scope for this PR. Verification: - ReadLints: 0 errors on both edited files. - 13/13 smoke-test assertions pass (audio_dir in __init__ Args, image_dir/video_dir entries added, Note: re fingerprint present, workflow numbered 1-7, audio_dir in _preprocess_batch Args, load_audio cross-ref present, workflow.md signature matches, workflow.md steps 4-7 in expected positions). - black --check fingerprint on dataset.py: 563 lines before my edits, 563 lines after my edits (byte-identical) - no new formatting drift. Made-with: Cursor * [data_utils,utils,models] refactor: homogenize multi-media batch types in _preprocess_batch + add encode_audio abstract method Why --- `_preprocess_batch` in `dataset.py` was producing three different per-sample types for the `audios` column within a single batch: empty sample -> None 1 audio -> Tensor N audios -> List[Tensor] This breaks HF Arrow's homogeneous-column requirement for `Dataset.map(batched=True)` / cache-write, and forces every downstream consumer to handle three input shapes instead of one. Image/video had a parallel latent bug: their empty-branch only updated `xx_args` but not `batch[xx]`, leaving `batch[images]`/`batch[videos]` shorter than the input batch size — only "didn't surface" because real datasets are typically all-or-nothing per modality. Additionally, the BaseAdapter ABC documented `prompt`/`images`/`videos` but had no `audios` plumbing, so the audio inputs prepared by the dataset preprocess step had no path to the model's encode step. What ---- `dataset.py _preprocess_batch`: * Drop the unwrap-single line (`audios[0] if len(audios) == 1 else audios`); audio always stored as `List[torch.Tensor]` per sample. * Empty image/video/audio branches now append `[]` to BOTH `xx_args` AND `batch`, so every column has length == input batch size and a uniform `List[List[Media]]` type. (`MultiImageBatch` / `MultiVideoBatch` / `MultiAudioBatch` invariant.) * Docstring rewritten to specify the new contract per modality and a Note explaining why `[]`-for-empty matters for HF Arrow. `utils/audio.py`: * Add `MultiAudioBatch = Union[List[AudioBatch], torch.Tensor, np.ndarray]` type alias mirroring `MultiImageBatch` (image.py) and `MultiVideoBatch` (video.py); register in `__all__` and module-level type-hierarchy docstring. `models/abc.py`: * Add `audios: Optional[List[Union[torch.Tensor, List[torch.Tensor]]]]` parameter to `BaseAdapter.preprocess_func` (mirrors the existing `images`/`videos` Union style). * Iterate `(audios, self.encode_audio)` in the encoder dispatch loop so audio inputs flow through to the model when present. * Add new `@abstractmethod encode_audio()` modeled on `encode_video`, documenting the (single / batch-of-single / batch-of-list) input forms and the `condition_audios` output convention. 11 concrete adapters (z_image, wan2_v2v/t2v/i2v, sd3_5, qwen_image/_edit_plus, flux1/1_kontext/2/2_klein): * Add a one-line `def encode_audio(...): pass` stub mirroring each file's local `encode_video` style and explanatory docstring (e.g., "Not needed for X models." / "X does not support audio encoding."). The ABC call path already short-circuits via `if input is not None`, so unused stubs are never invoked. Verified -------- * 0 lint errors on all 14 touched files. * AST smoke test passes: every dataset.py edit site landed; audio.py exports `MultiAudioBatch`; abc.py defines `encode_audio` as abstract with the expected docstring; all 11 adapters define `encode_audio`. * black-diff hunks confirm new lines follow each file's existing style (single-quoted dict keys in dataset.py; aligned-comment Union members in audio.py; `audios : Optional[...]` in abc.py to match `prompt :`, `images :`, `videos :`). No NEW pre-existing-style violations introduced. * No current adapter overrides `encode_audio` (audited via Grep), so the new abstract is a forward-looking contract for the LTX2 adapter on the feat branch and any future audio-enabled model. Refs ---- PR X-GenGroup#129 follow-up (Round 6). Made-with: Cursor * [models] refactor: make all 4 BaseAdapter encoders non-abstract with no-op defaults Why --- Round-6 promoted `encode_audio` to a 4th `@abstractmethod` on `BaseAdapter`, which forced all 11 existing concrete adapters to gain a one-line `def encode_audio(...): pass` stub just to remain instantiable. That penalises adapters that legitimately don't need the modality (e.g., SD3.5, Flux.1/2/Kontext, Z-Image, Qwen-Image*, Wan2-T2V/I2V/V2V) — they have to write boilerplate forwarding nothing — and it scales linearly with every new modality we add (audio today, who-knows-what tomorrow). The right invariant for `BaseAdapter` is the OPPOSITE: an adapter only implements the encoders for the modalities it actually consumes; missing modalities should be a silent no-op handled by the base class. `preprocess_func` already has the right plumbing — it iterates over `(text, encode_prompt), (images, encode_image), (videos, encode_video), (audios, encode_audio)` and merges only non-`None` results — so a default `pass` (which returns `None`) integrates cleanly with that loop. What ---- 1. `models/abc.py`: - Drop `@abstractmethod` from `encode_prompt`, `encode_image`, `encode_video`, `encode_audio` (all 4 encoders). - Replace each body with a single `pass` (Python returns `None`, which `preprocess_func`'s "skip if not isinstance(res, dict)" branch already handles correctly — see lines ~1928-1948). - Tighten signatures to use the homogeneous batch types produced by `dataset.py._preprocess_batch` after Round-6: encode_prompt(prompt: List[str], **kwargs) encode_image(images: MultiImageBatch, **kwargs) encode_video(videos: MultiVideoBatch, **kwargs) encode_audio(audios: MultiAudioBatch, **kwargs) — i.e., always a batch (`List[List[Media]]` ragged or uniform-shape Tensor/ndarray), never a single sample. This matches what the dataset actually hands to the encoders and lets every override stop re-implementing batch detection. - Return type narrowed to `Optional[Dict[str, Union[List[Any], torch.Tensor]]]` to make the no-op default's `None` return explicit in the contract. - Rewrite docstrings in Google style (consistent with existing base methods, satisfies repo design-principle re docstrings): state the no-op default, the dispatch contract, the input `Multi*Batch` shape, and the canonical output keys (`condition_images` / `condition_videos` / `condition_audios`). 2. `models/{z_image,wan,stable_diffusion,qwen_image,flux}/*.py` (11 adapters): revert the Round-6 `def encode_audio(...): pass` stub. Each adapter now inherits the no-op default from `BaseAdapter`. Net diff vs `origin/main` for these 11 files is now exactly **0 lines** — they match the PR base again, so the PR's footprint shrinks from 12 files to 1 (`models/abc.py` only, on the `models/` side). 3. New imports in `models/abc.py`: `from ..utils.image import MultiImageBatch` `from ..utils.video import MultiVideoBatch` `from ..utils.audio import MultiAudioBatch` (the audio one was already added in Round-6 via `utils/audio.py`). Compat / risk ------------- - LTX-2 adapters (`ltx2_t2av.py`, `ltx2_i2v.py`) live on `feat/ltx2-audio-video-support` and already implement `encode_audio` properly. The new ABC accepts a more general `MultiAudioBatch` (a superset of what they currently expect) so their existing `Union[torch.Tensor, List[torch.Tensor]]` typed override is still valid Python. **No LTX-2 changes needed.** - Adapters that DO need text/image/video encoding still override the corresponding encoder (their override silently shadows the base no-op), so behaviour is unchanged for everyone. - `preprocess_func`'s strict-validation branch (lines ~1937-1948, "must return a non-empty dict ...") only fires when the encoder returns something OTHER than `None` and that something is not a proper dict — `None` from the default still skips integration cleanly (see `if res is not None:` guard above the validation). Verification ------------ - `git diff origin/main -- src/flow_factory/models/<adapter>.py` → 0 lines for all 11 adapters. - AST smoke test: - `BaseAdapter` defines exactly 4 encoders; none decorated with `@abstractmethod`. - Each encoder body is `pass` (with optional docstring). - Each encoder's first non-self arg has the expected `Multi*Batch` annotation; return type is `Optional[...]`. - No adapter (of the 11) defines `encode_audio` anymore. - `ReadLints` clean on all 12 touched files. - `black --check --diff` line counts on all 12 files DECREASED vs HEAD (Δ ranges -37 to -219), so this commit introduces zero new formatting violations. Made-with: Cursor * [docs] update: align BaseAdapter contract docs with R6/R7 (4 abstract + 4 no-op encoders, MultiAudioBatch) constraints.md X-GenGroup#12, ff-new-model/SKILL.md, guidance/new_model.md were written for the pre-R6/R7 contract. Bring them in sync with what src/flow_factory/models/abc.py actually declares after this PR: - 4 abstract methods: load_pipeline, decode_latents, forward, inference (was 7). - 4 per-modality encoders (encode_prompt/image/video/audio) are non-abstract no-op defaults. Subclasses override only the modalities they consume; preprocess_func dispatches to all four and skips when the encoder returns None. - Multi-modal batch convention extended to include audios: every modality column from _preprocess_batch is List[List[Media]] (empty samples contribute [], single-item samples contribute [item]). - guidance/new_model.md gains an encode_audio block, audios in the inference signature comment, and an encode_audio checklist item. Pure docs change; no Python touched. Made-with: Cursor --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.