adding part of pipeline as smoke merge to main#2007
Conversation
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Greptile SummaryThis PR adds a speaker ID pipeline (embedding extraction, AHC/BIRCH clustering, per-utterance confidence scoring) and a UTMOSv2 MOS-scoring stage to the NeMo Curator audio processing path, along with unit tests and Fern documentation pages.
Confidence Score: 3/5Two concrete defects in changed code: one breaks the entire audio stages namespace on import without NeMo, and one silently writes malformed WAV data for multi-channel waveforms passed to the UTMOSv2 scorer. The NeMo hard-import causes nemo_curator/stages/audio/speaker_id/speaker_embedding_lhotse.py and nemo_curator/stages/audio/metrics/utmosv2_score.py Important Files Changed
Reviews (1): Last reviewed commit: "adding part of pipeline as smoke merge t..." | Re-trigger Greptile |
| return f"gpu{cv.split(',')[0]}" | ||
| return f"pid{os.getpid()}" | ||
|
|
||
| if TYPE_CHECKING: | ||
| from lhotse import Cut, CutSet | ||
|
|
||
| try: | ||
| from nemo.collections.common.data.lhotse.nemo_adapters import ( | ||
| LazyNeMoIterator, | ||
| LazyNeMoTarredIterator, |
There was a problem hiding this comment.
Unconditional top-level NeMo import breaks environments without NeMo
The module raises RuntimeError at import time if NeMo is not installed, because the try/except around the NeMo import re-raises as a hard error. Since speaker_id/__init__.py eagerly imports SpeakerEmbeddingLhotseStage from this module, and nemo_curator/stages/audio/__init__.py re-exports it, any import nemo_curator.stages.audio will fail in environments where NeMo is not installed. Moving the LazyNeMoIterator/LazyNeMoTarredIterator imports inside the process() method would make NeMo a lazy dependency and avoid breaking all other audio stages.
| if waveform.ndim > 1: | ||
| waveform = waveform.mean(axis=1) |
There was a problem hiding this comment.
Wrong axis for multi-channel stereo mix-down silently corrupts audio
When the waveform is in channels-first format (channels, samples) as used by NeMo and Lhotse, mean(axis=1) averages over the time dimension, yielding shape (channels,) instead of (samples,). That array is then resampled and written as a 1-sample WAV, so UTMOSv2 scores garbage audio with no error. The correct axis for channels-first is 0.
| if waveform.ndim > 1: | |
| waveform = waveform.mean(axis=1) | |
| if waveform.ndim > 1: | |
| waveform = waveform.mean(axis=0) |
| results = self._model.predict( | ||
| input_dir=wav_dir, | ||
| batch_size=self.inference_batch_size, | ||
| num_repetitions=self.num_repetitions, | ||
| predict_dataset=self.predict_dataset, | ||
| num_workers=0, | ||
| verbose=False, | ||
| ) |
There was a problem hiding this comment.
_score_dir result ordering is not guaranteed
model.predict(input_dir=wav_dir) returns results whose order UTMOSv2 does not document. The code zips results directly against valid_indices relying on lexicographic file-name order. The zero-padded names happen to work today, but if a future UTMOSv2 version changes traversal order, MOS scores will be silently assigned to the wrong entries.
| try: | ||
| import librosa | ||
| return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr) | ||
| except ImportError: | ||
| ratio = target_sr / orig_sr | ||
| indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int) | ||
| indices = indices[indices < len(audio)] | ||
| return audio[indices] |
There was a problem hiding this comment.
Fallback resampler performs nearest-neighbor, not linear interpolation
The fallback uses np.round to pick the nearest existing sample index, which is nearest-neighbor resampling. The comment should say so, or the implementation should use np.interp for true linear interpolation.
| try: | |
| import librosa | |
| return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr) | |
| except ImportError: | |
| ratio = target_sr / orig_sr | |
| indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int) | |
| indices = indices[indices < len(audio)] | |
| return audio[indices] | |
| try: | |
| import librosa | |
| return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr) | |
| except ImportError: | |
| # Nearest-neighbour fallback (low quality - prefer installing librosa). | |
| ratio = target_sr / orig_sr | |
| indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int) | |
| indices = indices[indices < len(audio)] | |
| return audio[indices] |
Description
Usage
# Add snippet demonstrating usageChecklist