Skip to content

Draft: PnC with LLM for audio pipeline#2006

Open
sushmitha-deva-09 wants to merge 119 commits into
NVIDIA-NeMo:mainfrom
sushmitha-deva-09:audio_core_3
Open

Draft: PnC with LLM for audio pipeline#2006
sushmitha-deva-09 wants to merge 119 commits into
NVIDIA-NeMo:mainfrom
sushmitha-deva-09:audio_core_3

Conversation

@sushmitha-deva-09
Copy link
Copy Markdown
Contributor

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09 sushmitha-deva-09 requested review from a team as code owners May 21, 2026 06:16
@sushmitha-deva-09 sushmitha-deva-09 requested review from suiyoubi and removed request for a team May 21, 2026 06:16
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR introduces LLM-based Punctuation and Capitalization (PNC) into the NeMo Curator audio tagging pipeline via two new stages — PNCwithvLLMInferenceStage (vLLM inference) and CleanLLMOutputStage (CER-gated output validation) — backed by a refactored VLLMBase / VLLMInference class hierarchy in vllm_model.py.

  • Adds VLLMBase shared engine logic and VLLMInference chat-template helper, renames VLLMModel._llmllm (public), and updates llm_cleanup.py to match.
  • Adds PNCwithvLLMInferenceStage and CleanLLMOutputStage with segment-level and top-level text processing, CER-based fallback to BERT PNC, and optional word-alignment update.
  • Adds unit tests, GPU-gated e2e tests, YAML pipeline configs, and comprehensive README documentation for standalone, 1st-pass, and 2nd-pass ASR usage patterns."

Confidence Score: 3/5

The core PNC stage logic is sound, but two issues in the shared VLLMInference class need attention before merging.

The get_entry_prompt method has a dead unreachable fallback — the early return at line 404 means the if self.use_chat_api guard inside the except block is always False when reached, so a template error in chat-API mode silently returns [] instead of entry_chat. Separately, process_entry_prompts destroys the GPU engine after every call with no clear signal to callers; a second call without setup() raises RuntimeError immediately. Both issues live in the shared VLLMInference class used beyond the audio PNC stage.

nemo_curator/models/vllm_model.py for the dead branch in get_entry_prompt and the engine teardown in process_entry_prompts; tests/stages/audio/tagging/text/test_pnc_vllm.py for the missing setup() calls that leave _full_vocab_set empty.

Important Files Changed

Filename Overview
nemo_curator/models/vllm_model.py Adds VLLMBase shared engine class and VLLMInference chat-template helper; renames _llm to llm; get_entry_prompt has a dead unreachable fallback branch; process_entry_prompts silently destroys the vLLM engine after every call; VLLM_USE_V1 is set redundantly.
nemo_curator/stages/audio/tagging/text/pnc.py New PNCwithvLLMInferenceStage and CleanLLMOutputStage for LLM-backed PNC; remove_pncs calls .lower() twice unnecessarily; logic is otherwise correct.
nemo_curator/stages/math/modifiers/llm_cleanup.py One-line fix to reference renamed llm attribute (was _llm) in the setup guard; correct and complete.
tests/stages/audio/tagging/text/test_pnc_vllm.py Comprehensive unit tests for PNC stages; several CleanLLMOutputStage tests skip setup(), leaving _full_vocab_set empty and making is_valid_text a no-op in those paths.
tests/models/test_vllm_inference.py Good unit coverage of VLLMInference; defers heavy imports via importlib for CI compatibility.
tests/stages/audio/tagging/e2e/test_pnc_llm_e2e.py New GPU-gated e2e tests for simple, first-pass, and second-pass PNC pipelines.
pyproject.toml Adds nemo_curator[vllm] to the audio_cuda12 extras group; small, correct addition.

Sequence Diagram

sequenceDiagram
    participant Pipeline
    participant PNCStage as PNCwithvLLMInferenceStage
    participant VLLMInference
    participant VLLMBase
    participant CleanStage as CleanLLMOutputStage

    Pipeline->>PNCStage: setup_on_node()
    PNCStage->>VLLMInference: setup_on_node()

    Pipeline->>PNCStage: setup()
    PNCStage->>VLLMInference: setup()
    VLLMInference->>VLLMBase: _init_tokenizer()
    VLLMInference->>VLLMBase: _init_engine()

    Pipeline->>PNCStage: process_batch(tasks)
    PNCStage->>VLLMInference: process_batch(prompts)
    VLLMInference->>VLLMBase: _generate(prompts, use_chat)
    VLLMBase-->>PNCStage: RequestOutput list
    PNCStage-->>Pipeline: tasks with generation field

    Pipeline->>CleanStage: process(task)
    CleanStage->>CleanStage: clean_llm_output + CER + is_valid_text
    CleanStage-->>Pipeline: task with generation_cleaned + use_bert_pnc

    Pipeline->>PNCStage: teardown()
    PNCStage->>VLLMInference: clean_up()
    VLLMInference->>VLLMBase: _cleanup_gpu()
Loading

Reviews (1): Last reviewed commit: "Merge with main" | Re-trigger Greptile

Comment on lines +411 to +412
if self.use_chat_api:
return entry_chat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unreachable fallback branch

The if self.use_chat_api: return entry_chat check inside the except block can never execute. When use_chat_api=True the function always returns entry_chat at line 404–405, before tokenizer.apply_chat_template is even called — the except block is only reachable when use_chat_api=False. The inner guard evaluates to False every time, and the fallback return is dead code. A developer expecting chat-API mode to fall back to entry_chat on a template error will find it silently returns [] instead.

Comment on lines +419 to +420
def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
"""Generate in batches, then clean up."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 process_entry_prompts silently destroys the vLLM engine at the end of every call via self.clean_up(). This is a heavyweight, irreversible operation (GPU memory deallocation, process-group teardown) that callers will not expect from a method named process_entry_prompts. A second call without an intervening setup() raises RuntimeError. The docstring buries the side effect; at minimum expand it to warn callers that the engine cannot be reused after this call.

Suggested change
def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
"""Generate in batches, then clean up."""
def process_entry_prompts(self, entry_prompts: list, batch_size: int = 10000) -> list:
"""Generate in batches, then destroy the vLLM engine.
.. warning::
This method calls :meth:`clean_up` after generation, which releases
all GPU memory and tears down the distributed process group.
**The engine cannot be used again** without calling :meth:`setup` first.
For repeated inference calls, use :meth:`process_batch` directly.
"""

Comment on lines +68 to +75
def _init_engine(self, model_kwargs: dict[str, Any], sampling_kwargs: dict[str, Any]) -> None:
"""Create the vLLM ``LLM`` engine and ``SamplingParams``.

Args:
model_kwargs: Keyword arguments forwarded to ``vllm.LLM``.
sampling_kwargs: Keyword arguments forwarded to ``vllm.SamplingParams``.
"""
os.environ.setdefault("VLLM_USE_V1", "0")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 os.environ.setdefault("VLLM_USE_V1", "0") is already called at module import time inside the try block that imports vllm, so the identical call here is redundant.

Suggested change
def _init_engine(self, model_kwargs: dict[str, Any], sampling_kwargs: dict[str, Any]) -> None:
"""Create the vLLM ``LLM`` engine and ``SamplingParams``.
Args:
model_kwargs: Keyword arguments forwarded to ``vllm.LLM``.
sampling_kwargs: Keyword arguments forwarded to ``vllm.SamplingParams``.
"""
os.environ.setdefault("VLLM_USE_V1", "0")
def _init_engine(self, model_kwargs: dict[str, Any], sampling_kwargs: dict[str, Any]) -> None:
"""Create the vLLM ``LLM`` engine and ``SamplingParams``.
Args:
model_kwargs: Keyword arguments forwarded to ``vllm.LLM``.
sampling_kwargs: Keyword arguments forwarded to ``vllm.SamplingParams``.
"""

Comment on lines +266 to +271
@staticmethod
def remove_pncs(text: str, punct_marks: str) -> str:
text = re.sub(r"[.,?،؟.、?¿!,?।:;]", "", text.lower()) # noqa: RUF001
pattern = f"[{re.escape(punct_marks)}]"
text = re.sub(pattern, " ", text.lower())
return " ".join(text.split())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 text.lower() is called twice: once as the argument to the first re.sub (producing an already-lowercase string), then again on the already-lowercase result for the second re.sub. The second call is a no-op.

Suggested change
@staticmethod
def remove_pncs(text: str, punct_marks: str) -> str:
text = re.sub(r"[.,?،؟.、?¿!,?।:;]", "", text.lower()) # noqa: RUF001
pattern = f"[{re.escape(punct_marks)}]"
text = re.sub(pattern, " ", text.lower())
return " ".join(text.split())
@staticmethod
def remove_pncs(text: str, punct_marks: str) -> str:
text = text.lower()
text = re.sub(r"[.,?،؟.、?¿!,?।:;]", "", text) # noqa: RUF001
pattern = f"[{re.escape(punct_marks)}]"
text = re.sub(pattern, " ", text)
return " ".join(text.split())

Comment on lines +488 to +501
stage = CleanLLMOutputStage(
cer_threshold=0,
update_alignment=True,
alignment_key="alignment",
segments_key="__none__",
)
result = stage.process(AudioTask(data=data))

assert result.data["use_bert_pnc"] is False
words = [w["word"] for w in result.data["alignment"]]
assert words == ["Hello,", "world.", "Good", "morning,", "everyone."]
assert result.data["alignment"][0]["start"] == 0.0
assert result.data["alignment"][0]["confidence"] == 0.9

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 setup() not called — _full_vocab_set is empty

CleanLLMOutputStage._full_vocab_set is populated in setup(). Without it, the field defaults to an empty set, so is_valid_text(llm_cleaned, self._full_vocab_set) returns False for any non-empty text. Tests like test_invalid_chars_trigger_bert_fallback pass by coincidence (the CER or digit check fires first, or chars match so the vocab check is never reached), not because the vocab logic is exercised. Adding stage.setup() before stage.process(task) would make each test actually validate what it claims.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants