Skip to content

adding part of pipeline as smoke merge to main#2007

Open
Jorjeous wants to merge 1 commit into
mainfrom
Test_pipeline_MR
Open

adding part of pipeline as smoke merge to main#2007
Jorjeous wants to merge 1 commit into
mainfrom
Test_pipeline_MR

Conversation

@Jorjeous
Copy link
Copy Markdown
Member

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
@Jorjeous Jorjeous requested a review from a team as a code owner May 21, 2026 12:50
@Jorjeous Jorjeous requested review from oyilmaz-nvidia and removed request for a team May 21, 2026 12:50
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR adds a speaker ID pipeline (embedding extraction, AHC/BIRCH clustering, per-utterance confidence scoring) and a UTMOSv2 MOS-scoring stage to the NeMo Curator audio processing path, along with unit tests and Fern documentation pages.

  • speaker_embedding_lhotse.py imports NeMo at module level and raises RuntimeError immediately if NeMo is absent; since the audio package re-exports this class eagerly, the whole nemo_curator.stages.audio namespace breaks for users without NeMo installed.
  • utmosv2_score.py's _waveform_to_wav calls waveform.mean(axis=1) for stereo mix-down, which is the wrong axis for channels-first layout, silently writing a 1-sample WAV and scoring garbage audio.

Confidence Score: 3/5

Two concrete defects in changed code: one breaks the entire audio stages namespace on import without NeMo, and one silently writes malformed WAV data for multi-channel waveforms passed to the UTMOSv2 scorer.

The NeMo hard-import causes import nemo_curator.stages.audio to fail in any NeMo-free environment, affecting all audio pipeline users. The wrong-axis mix-down produces silently wrong MOS scores for multi-channel audio from NemoTarShardReaderStage without raising any exception.

nemo_curator/stages/audio/speaker_id/speaker_embedding_lhotse.py and nemo_curator/stages/audio/metrics/utmosv2_score.py

Important Files Changed

Filename Overview
nemo_curator/stages/audio/metrics/utmosv2_score.py New UTMOSv2 MOS scoring stage; contains a wrong-axis stereo mix-down bug that silently produces garbage scores for multi-channel waveforms, and relies on undocumented UTMOSv2 result ordering.
nemo_curator/stages/audio/speaker_id/speaker_embedding_lhotse.py New Lhotse-based speaker embedding stage; hard top-level NeMo import raises RuntimeError at import time in environments without NeMo, breaking the entire audio stages namespace.
nemo_curator/stages/audio/init.py Eagerly re-exports SpeakerEmbeddingLhotseStage, transitively forcing a hard NeMo import at package load time.
nemo_curator/stages/audio/speaker_id/speaker_clustering_and_scoring.py New AHC + confidence-scoring stage; global/shard/grouped clustering modes look correct; offset-based shard label slice-back logic is sound.
nemo_curator/stages/audio/speaker_id/clustering/large_scale_clustering_and_scoring.py New BIRCH + AHC large-scale clustering pipeline; leaf-cap backoff logic and tiled assignment are well-implemented.
nemo_curator/stages/audio/speaker_id/speaker_embedding_audiotask.py New AudioTask-native speaker embedding stage; per-batch NPZ flush with clear() after save looks correct.
nemo_curator/stages/audio/speaker_id/embedding/model_loader.py Custom WeSpeaker model loader bypassing wespeaker/init.py via importlib; logic is sound.
nemo_curator/stages/audio/speaker_id/clustering/ahc.py New AHC utility module; cosine-distance AHC, cluster quality, and per-utterance confidence all look correct.

Reviews (1): Last reviewed commit: "adding part of pipeline as smoke merge t..." | Re-trigger Greptile

Comment on lines +63 to +72
return f"gpu{cv.split(',')[0]}"
return f"pid{os.getpid()}"

if TYPE_CHECKING:
from lhotse import Cut, CutSet

try:
from nemo.collections.common.data.lhotse.nemo_adapters import (
LazyNeMoIterator,
LazyNeMoTarredIterator,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unconditional top-level NeMo import breaks environments without NeMo

The module raises RuntimeError at import time if NeMo is not installed, because the try/except around the NeMo import re-raises as a hard error. Since speaker_id/__init__.py eagerly imports SpeakerEmbeddingLhotseStage from this module, and nemo_curator/stages/audio/__init__.py re-exports it, any import nemo_curator.stages.audio will fail in environments where NeMo is not installed. Moving the LazyNeMoIterator/LazyNeMoTarredIterator imports inside the process() method would make NeMo a lazy dependency and avoid breaking all other audio stages.

Comment on lines +188 to +189
if waveform.ndim > 1:
waveform = waveform.mean(axis=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong axis for multi-channel stereo mix-down silently corrupts audio

When the waveform is in channels-first format (channels, samples) as used by NeMo and Lhotse, mean(axis=1) averages over the time dimension, yielding shape (channels,) instead of (samples,). That array is then resampled and written as a 1-sample WAV, so UTMOSv2 scores garbage audio with no error. The correct axis for channels-first is 0.

Suggested change
if waveform.ndim > 1:
waveform = waveform.mean(axis=1)
if waveform.ndim > 1:
waveform = waveform.mean(axis=0)

Comment on lines +228 to +235
results = self._model.predict(
input_dir=wav_dir,
batch_size=self.inference_batch_size,
num_repetitions=self.num_repetitions,
predict_dataset=self.predict_dataset,
num_workers=0,
verbose=False,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _score_dir result ordering is not guaranteed

model.predict(input_dir=wav_dir) returns results whose order UTMOSv2 does not document. The code zips results directly against valid_indices relying on lexicographic file-name order. The zero-padded names happen to work today, but if a future UTMOSv2 version changes traversal order, MOS scores will be silently assigned to the wrong entries.

Comment on lines +120 to +127
try:
import librosa
return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
except ImportError:
ratio = target_sr / orig_sr
indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
indices = indices[indices < len(audio)]
return audio[indices]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fallback resampler performs nearest-neighbor, not linear interpolation

The fallback uses np.round to pick the nearest existing sample index, which is nearest-neighbor resampling. The comment should say so, or the implementation should use np.interp for true linear interpolation.

Suggested change
try:
import librosa
return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
except ImportError:
ratio = target_sr / orig_sr
indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
indices = indices[indices < len(audio)]
return audio[indices]
try:
import librosa
return librosa.resample(audio, orig_sr=orig_sr, target_sr=target_sr)
except ImportError:
# Nearest-neighbour fallback (low quality - prefer installing librosa).
ratio = target_sr / orig_sr
indices = np.round(np.arange(0, len(audio), 1 / ratio)).astype(int)
indices = indices[indices < len(audio)]
return audio[indices]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant