Draft: PnC with LLM for audio pipeline by sushmitha-deva-09 · Pull Request #2006 · NVIDIA-NeMo/Curator

sushmitha-deva-09 · 2026-05-21T06:16:10Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

…_generic

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

copy-pr-bot · 2026-05-21T06:16:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-21T06:23:55Z

Greptile Summary

This PR introduces LLM-based Punctuation and Capitalization (PNC) into the NeMo Curator audio tagging pipeline via two new stages — PNCwithvLLMInferenceStage (vLLM inference) and CleanLLMOutputStage (CER-gated output validation) — backed by a refactored VLLMBase / VLLMInference class hierarchy in vllm_model.py.

Adds VLLMBase shared engine logic and VLLMInference chat-template helper, renames VLLMModel._llm → llm (public), and updates llm_cleanup.py to match.
Adds PNCwithvLLMInferenceStage and CleanLLMOutputStage with segment-level and top-level text processing, CER-based fallback to BERT PNC, and optional word-alignment update.
Adds unit tests, GPU-gated e2e tests, YAML pipeline configs, and comprehensive README documentation for standalone, 1st-pass, and 2nd-pass ASR usage patterns."

Confidence Score: 3/5

The core PNC stage logic is sound, but two issues in the shared VLLMInference class need attention before merging.

The get_entry_prompt method has a dead unreachable fallback — the early return at line 404 means the if self.use_chat_api guard inside the except block is always False when reached, so a template error in chat-API mode silently returns [] instead of entry_chat. Separately, process_entry_prompts destroys the GPU engine after every call with no clear signal to callers; a second call without setup() raises RuntimeError immediately. Both issues live in the shared VLLMInference class used beyond the audio PNC stage.

nemo_curator/models/vllm_model.py for the dead branch in get_entry_prompt and the engine teardown in process_entry_prompts; tests/stages/audio/tagging/text/test_pnc_vllm.py for the missing setup() calls that leave _full_vocab_set empty.

Important Files Changed

Filename	Overview
nemo_curator/models/vllm_model.py	Adds VLLMBase shared engine class and VLLMInference chat-template helper; renames _llm to llm; get_entry_prompt has a dead unreachable fallback branch; process_entry_prompts silently destroys the vLLM engine after every call; VLLM_USE_V1 is set redundantly.
nemo_curator/stages/audio/tagging/text/pnc.py	New PNCwithvLLMInferenceStage and CleanLLMOutputStage for LLM-backed PNC; remove_pncs calls .lower() twice unnecessarily; logic is otherwise correct.
nemo_curator/stages/math/modifiers/llm_cleanup.py	One-line fix to reference renamed llm attribute (was _llm) in the setup guard; correct and complete.
tests/stages/audio/tagging/text/test_pnc_vllm.py	Comprehensive unit tests for PNC stages; several CleanLLMOutputStage tests skip setup(), leaving _full_vocab_set empty and making is_valid_text a no-op in those paths.
tests/models/test_vllm_inference.py	Good unit coverage of VLLMInference; defers heavy imports via importlib for CI compatibility.
tests/stages/audio/tagging/e2e/test_pnc_llm_e2e.py	New GPU-gated e2e tests for simple, first-pass, and second-pass PNC pipelines.
pyproject.toml	Adds nemo_curator[vllm] to the audio_cuda12 extras group; small, correct addition.

Sequence Diagram

sequenceDiagram
    participant Pipeline
    participant PNCStage as PNCwithvLLMInferenceStage
    participant VLLMInference
    participant VLLMBase
    participant CleanStage as CleanLLMOutputStage

    Pipeline->>PNCStage: setup_on_node()
    PNCStage->>VLLMInference: setup_on_node()

    Pipeline->>PNCStage: setup()
    PNCStage->>VLLMInference: setup()
    VLLMInference->>VLLMBase: _init_tokenizer()
    VLLMInference->>VLLMBase: _init_engine()

    Pipeline->>PNCStage: process_batch(tasks)
    PNCStage->>VLLMInference: process_batch(prompts)
    VLLMInference->>VLLMBase: _generate(prompts, use_chat)
    VLLMBase-->>PNCStage: RequestOutput list
    PNCStage-->>Pipeline: tasks with generation field

    Pipeline->>CleanStage: process(task)
    CleanStage->>CleanStage: clean_llm_output + CER + is_valid_text
    CleanStage-->>Pipeline: task with generation_cleaned + use_bert_pnc

    Pipeline->>PNCStage: teardown()
    PNCStage->>VLLMInference: clean_up()
    VLLMInference->>VLLMBase: _cleanup_gpu()

_{Reviews (1): Last reviewed commit: "Merge with main" | Re-trigger Greptile}

greptile-apps · 2026-05-21T06:24:00Z

+            if self.use_chat_api:
+                return entry_chat


Unreachable fallback branch

The if self.use_chat_api: return entry_chat check inside the except block can never execute. When use_chat_api=True the function always returns entry_chat at line 404–405, before tokenizer.apply_chat_template is even called — the except block is only reachable when use_chat_api=False. The inner guard evaluates to False every time, and the fallback return is dead code. A developer expecting chat-API mode to fall back to entry_chat on a template error will find it silently returns [] instead.

greptile-apps · 2026-05-21T06:24:02Z

+    def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
+        """Generate in batches, then clean up."""


process_entry_prompts silently destroys the vLLM engine at the end of every call via self.clean_up(). This is a heavyweight, irreversible operation (GPU memory deallocation, process-group teardown) that callers will not expect from a method named process_entry_prompts. A second call without an intervening setup() raises RuntimeError. The docstring buries the side effect; at minimum expand it to warn callers that the engine cannot be reused after this call.

Suggested change

def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:

"""Generate in batches, then clean up."""

def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:

"""Generate in batches, then destroy the vLLM engine.

.. warning::

This method calls :meth:`clean_up` after generation, which releases

all GPU memory and tears down the distributed process group.

**The engine cannot be used again** without calling :meth:`setup` first.

For repeated inference calls, use :meth:`process_batch` directly.

"""

greptile-apps · 2026-05-21T06:24:04Z

+    def _init_engine(self, model_kwargs: dict[str, Any], sampling_kwargs: dict[str, Any]) -> None:
+        """Create the vLLM ``LLM`` engine and ``SamplingParams``.
+
+        Args:
+            model_kwargs: Keyword arguments forwarded to ``vllm.LLM``.
+            sampling_kwargs: Keyword arguments forwarded to ``vllm.SamplingParams``.
+        """
+        os.environ.setdefault("VLLM_USE_V1", "0")


os.environ.setdefault("VLLM_USE_V1", "0") is already called at module import time inside the try block that imports vllm, so the identical call here is redundant.

Suggested change

def _init_engine(self, model_kwargs: dict[str, Any], sampling_kwargs: dict[str, Any]) -> None:

"""Create the vLLM ``LLM`` engine and ``SamplingParams``.

Args:

model_kwargs: Keyword arguments forwarded to ``vllm.LLM``.

sampling_kwargs: Keyword arguments forwarded to ``vllm.SamplingParams``.

"""

os.environ.setdefault("VLLM_USE_V1", "0")

def _init_engine(self, model_kwargs: dict[str, Any], sampling_kwargs: dict[str, Any]) -> None:

"""Create the vLLM ``LLM`` engine and ``SamplingParams``.

Args:

model_kwargs: Keyword arguments forwarded to ``vllm.LLM``.

sampling_kwargs: Keyword arguments forwarded to ``vllm.SamplingParams``.

"""

greptile-apps · 2026-05-21T06:24:07Z

+    @staticmethod
+    def remove_pncs(text: str, punct_marks: str) -> str:
+        text = re.sub(r"[.,?،؟.、？¿!,?।:;]", "", text.lower())  # noqa: RUF001
+        pattern = f"[{re.escape(punct_marks)}]"
+        text = re.sub(pattern, " ", text.lower())
+        return " ".join(text.split())


text.lower() is called twice: once as the argument to the first re.sub (producing an already-lowercase string), then again on the already-lowercase result for the second re.sub. The second call is a no-op.

Suggested change

@staticmethod

def remove_pncs(text: str, punct_marks: str) -> str:

text = re.sub(r"[.,?،؟.、？¿!,?।:;]", "", text.lower()) # noqa: RUF001

pattern = f"[{re.escape(punct_marks)}]"

text = re.sub(pattern, " ", text.lower())

return " ".join(text.split())

@staticmethod

def remove_pncs(text: str, punct_marks: str) -> str:

text = text.lower()

text = re.sub(r"[.,?،؟.、？¿!,?।:;]", "", text) # noqa: RUF001

pattern = f"[{re.escape(punct_marks)}]"

text = re.sub(pattern, " ", text)

return " ".join(text.split())

greptile-apps · 2026-05-21T06:24:09Z

+        stage = CleanLLMOutputStage(
+            cer_threshold=0,
+            update_alignment=True,
+            alignment_key="alignment",
+            segments_key="__none__",
+        )
+        result = stage.process(AudioTask(data=data))
+
+        assert result.data["use_bert_pnc"] is False
+        words = [w["word"] for w in result.data["alignment"]]
+        assert words == ["Hello,", "world.", "Good", "morning,", "everyone."]
+        assert result.data["alignment"][0]["start"] == 0.0
+        assert result.data["alignment"][0]["confidence"] == 0.9
+


setup() not called — _full_vocab_set is empty

CleanLLMOutputStage._full_vocab_set is populated in setup(). Without it, the field defaults to an empty set, so is_valid_text(llm_cleaned, self._full_vocab_set) returns False for any non-empty text. Tests like test_invalid_chars_trigger_bert_fallback pass by coincidence (the CER or digit check fires first, or chars match so the vocab check is never reached), not because the vocab logic is exercised. Adding stage.setup() before stage.process(task) would make each test actually validate what it claims.

sushmitha-deva-09 added 30 commits February 25, 2026 15:31

Update pyptoject.toml

dc9c4e0

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA-NeMo/Curator into yt…

34db504

…_generic

Add generic audio tagging pipeline

3c4e97e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update configs and benchmarking scripts

e90ec75

Rename files and use common get duration method

6f0276f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix formatting

e4fa6de

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix minor bugs

cf93c3d

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update random usage in pyannote.py

db22329

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update get duration method

6f88859

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix minor issues

0dad1e1

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

44cf69d

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add inputs and outputs methods to all stages

79f9ccc

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix ruff check

b90875d

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

f47192e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix bug prepare segments

47baa5e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

498e740

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

04a4b99

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

AudioBatch to AudioTask migration

2debd02

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove unwanted stages

c5152c5

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove metric stages

ce1f3ba

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

65f49ab

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove unused packages

e25b385

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix minor bugs

91bc035

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add tts e2e test and specify key paramenters in stages

8d47252

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix extra whitespace and remove unwanted fixtures

09f9c9f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update typehints for setup calls

9aaccab

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update readme

d99b879

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Enhance benchmark logs

edb431f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add metric stages and tests

41e45e6

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove cuda parameter

9dc2695

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 added 23 commits May 4, 2026 19:33

Update lock file

5c741e5

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove pyarbic stage

9c1d873

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

4e6fb45

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Run tests in streaming mode

cebaf98

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update configs and tests

cc0e8b5

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update pnc script

7772fa6

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update pnc code

6072a4a

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

a78cce7

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove bert based pnc inference

562ca68

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update tts yaml config

6efc74e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add torch squim metrics test

ca3d20c

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Move vllm code to models and update scripts

7758cb7

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update first pass reference

8eece4b

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix tests

a94ee81

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Implement batching for squim stage and update scripts

da03db0

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

fde3e80

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Move metrics to generic metrics folder

2f54f9b

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update fleurs imports

92bc159

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Create vllm base class

148c5a9

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update yaml config

5337129

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update config

3be5352

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

6e6db94

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

cbad77f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 requested review from a team as code owners May 21, 2026 06:16

sushmitha-deva-09 requested review from suiyoubi and removed request for a team May 21, 2026 06:16

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: PnC with LLM for audio pipeline#2006

Draft: PnC with LLM for audio pipeline#2006
sushmitha-deva-09 wants to merge 119 commits into
NVIDIA-NeMo:mainfrom
sushmitha-deva-09:audio_core_3

sushmitha-deva-09 commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
		"""Generate in batches, then clean up."""

-    def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
-        """Generate in batches, then clean up."""
+    def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
+        """Generate in batches, then destroy the vLLM engine.
+        .. warning::
+            This method calls :meth:`clean_up` after generation, which releases
+            all GPU memory and tears down the distributed process group.
+            **The engine cannot be used again** without calling :meth:`setup` first.
+            For repeated inference calls, use :meth:`process_batch` directly.
+        """

Conversation

sushmitha-deva-09 commented May 21, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants