Add Qwen Omni first inference stage#1967
Conversation
Greptile SummaryThis PR introduces the first Granary v2 Qwen-Omni audio inference stage for NeMo Curator, wiring together a
Confidence Score: 3/5Two known unfixed issues in the model and inference-stage files will cause import failures on non-GPU installs and silent KeyError crashes for tasks with missing waveform data. The unconditional top-level imports of
Important Files Changed
Reviews (5): Last reviewed commit: "Move qwen-asr runtime stack into audio_c..." | Re-trigger Greptile |
| with open(out_path, "a", encoding="utf-8") as f: | ||
| f.write(json.dumps(task.data, ensure_ascii=False) + "\n") |
There was a problem hiding this comment.
json.dumps will crash when keep_waveform=True
task.data can contain a numpy ndarray (keyed by waveform_key) when the upstream InferenceQwenOmniStage is configured with keep_waveform: true. json.dumps has no numpy serializer and raises TypeError: Object of type ndarray is not JSON serializable, crashing the entire shard write for all tasks in that batch. Either strip the waveform here before serialising, or document that keep_waveform must be false when this writer is used (and add a validation guard in setup or __post_init__).
There was a problem hiding this comment.
Thanks for flagging. This is already addressed on the current head (936fa17f):
_manifest_datafirst drops keys named indrop_manifest_keys(defaults to("waveform",)) so the configuredwaveform_keynever reaches serialisation, regardless ofkeep_waveform.- Anything else with
.shapeand.dtype(numpy ndarrays, torch tensors, etc.) is dropped via a duck-typing guard beforejson.dumpsis called. - The remaining
json.dumpscall is wrapped intry/except TypeError, so a previously-unseen non-serialisable value raises a focusedTypeErrorwith the offending key instead of crashing the shard.
Citation: nemo_curator/stages/audio/io/sharded_manifest_writer.py:96-111. Resolving as already-fixed.
| cfg = value if OmegaConf.is_config(value) else OmegaConf.create(value) | ||
| if "_target_" in cfg: | ||
| return hydra.utils.instantiate(cfg) | ||
| raw = OmegaConf.to_container(cfg, resolve=True) | ||
| return Resources(**raw) | ||
| msg = f"Invalid resources override: {value!r}" |
There was a problem hiding this comment.
Credentials exposed in startup log
logger.info(f"Hydra config:\n{OmegaConf.to_yaml(cfg)}") prints the full resolved config, including any hf_token passed as a Hydra override. The credential ends up in every log sink (stdout, files, observability stacks) in plaintext. Consider redacting the hf_token field before logging — for example, by building a sanitised copy of the config dict — or logging only a subset of non-sensitive keys.
There was a problem hiding this comment.
Thanks for flagging. This was addressed in commit 936fa17f:
- The startup log call no longer uses raw
OmegaConf.to_yaml(cfg). It now uses_safe_config_yaml(cfg)(tutorialmain.py:346), which builds a redacted copy of the config before rendering to YAML. _redact_secret_valueswalks the config recursively (main.py:170-179) and replaces values of any key matching_SECRET_KEY_NAMES(which explicitly includeshf_token,password,secret_key,token,credentials, …) or_SECRET_KEY_PARTSsubstrings with<redacted>.- Trailing-suffix matching also catches any custom secret named
*_token,*_secret, or*_password.
Note: the regression test that covered this behaviour (tests/stages/audio/inference/test_qwen_omni_tutorial.py::test_safe_config_yaml_redacts_hf_token_but_keeps_token_counts) was removed in this revision per @sarahyurick's "we shouldn't need pytests for tutorials" comment. The helper code itself is unchanged.
Citations: tutorials/audio/qwen_omni_inprocess/main.py:60-72, :160-184, :346. Resolving as already-fixed.
| completed.add(shard_key) | ||
| return completed | ||
|
|
||
| @staticmethod |
There was a problem hiding this comment.
Unchecked
None return from extractfile
tarfile.TarFile.extractfile returns None for members that are not regular files (hard links, directory entries embedded in some tar formats). The preceding tar_info.isfile() guard does not cover all cases where extractfile may return None — calling .read() on a None result would raise AttributeError. Add a None check before .read().
|
|
||
| def _get_prompt_text(self, language: str | None) -> str: | ||
| """Return the EN-specific prompt for English, otherwise the default prompt.""" | ||
| if language and language == "English" and self.en_prompt_text: |
There was a problem hiding this comment.
The leading
language and is redundant: if language == "English" evaluates to True the string is already truthy, so the extra truthiness check is dead code and can mislead readers into thinking an empty-string case needs guarding here.
| if language and language == "English" and self.en_prompt_text: | |
| if language == "English" and self.en_prompt_text: |
| max_num_seqs=self.max_num_seqs, | ||
| max_model_len=self.max_model_len, | ||
| seed=1234, | ||
| enable_prefix_caching=True, | ||
| prefix_caching_hash_algo="xxhash", | ||
| ) | ||
|
|
||
| from transformers import Qwen3OmniMoeProcessor | ||
|
|
||
| self._processor = Qwen3OmniMoeProcessor.from_pretrained(self.model_id) | ||
|
|
||
| self._sampling_params = SamplingParams( |
There was a problem hiding this comment.
trust_remote_code=True hardcoded without user control
trust_remote_code=True executes arbitrary Python bundled with the model weights; there is no constructor parameter or config knob to disable it. For downstream users who want to run audited/frozen snapshots, or who apply security policies, this silently bypasses those controls. Exposing it as an __init__ parameter (defaulting to True for backward-compat) would let callers opt out.
sarahyurick
left a comment
There was a problem hiding this comment.
I did an initial pass to start familiarizing myself with the PR for now. Left some minor comments.
| model_id: str = _QWEN3_OMNI_MODEL_ID, | ||
| prompt_text: str = "Transcribe the audio.", | ||
| en_prompt_text: str | None = None, | ||
| followup_prompt: str = "Now listen to the audio again and add any false starts, filler words and preserve colloquial words (like lemme, gonna, wanna, etc) as is spoken in the audio.", |
There was a problem hiding this comment.
Instead of having the long string here, let's create a script variable called _FOLLOWUP_PROMPT or similar.
| prefix_caching_hash_algo="xxhash", | ||
| ) | ||
|
|
||
| from transformers import Qwen3OmniMoeProcessor |
| self._sampling_params = None | ||
| gc.collect() | ||
| try: | ||
| import torch |
| def _prepare_turn2_single( | ||
| self, waveform_16k: np.ndarray, pred_text: str, language: str | None = None, | ||
| ) -> dict[str, Any] | None: | ||
| from qwen_omni_utils import process_mm_info |
| """ | ||
|
|
||
| name: str = "sharded_manifest_writer" | ||
| output_dir: str = "" |
There was a problem hiding this comment.
| output_dir: str = "" | |
| output_dir: str |
instead of checking for it in the post init.
| from nemo_curator.stages.audio.inference.qwen_omni import InferenceQwenOmniStage | ||
| from nemo_curator.stages.audio.io.nemo_tarred_reader import NemoTarredAudioReader, NemoTarShardReaderStage | ||
| from nemo_curator.stages.audio.io.sharded_manifest_writer import ShardedManifestWriterStage | ||
| from tutorials.audio.qwen_omni_inprocess.main import ( |
There was a problem hiding this comment.
We shouldn't need pytests for tutorials.
| and prefetches common HuggingFace model attributes without hardcoding a | ||
| full Granary v2 post-processing graph in this entry point. | ||
| """ | ||
| from huggingface_hub import hf_hub_download, snapshot_download |
| If you do not have `uv`, use pip: | ||
|
|
||
| ```bash | ||
| pip install -e ".[audio_cuda12]" | ||
| ``` |
There was a problem hiding this comment.
We should only encourage uv and not pip.
- hoist followup_prompt default into _FOLLOWUP_PROMPT module constant
- move 6 lazy imports to module scope (torch, transformers.Qwen3OmniMoeProcessor,
qwen_omni_utils.process_mm_info, huggingface_hub.{snapshot_download,
hf_hub_download}, yaml); keep existing vllm try/except guard
- drop redundant `language and` short-circuit in _get_prompt_text
- guard tar.extractfile(...) against None before .read() in
NemoTarShardReaderStage; add hard-link regression test
- make ShardedManifestWriterStage.output_dir a required field; drop empty-
string post_init check
- add Apache/NVIDIA copyright headers to inference/__init__.py and
qwen_omni_inprocess.yaml
- drop pip-fallback install block from tutorial README (uv-only)
- remove tests/stages/audio/inference/test_qwen_omni_tutorial.py per
"no pytests for tutorials"
- retarget @patch decorators in test_qwen_omni.py to the use-site
(nemo_curator.stages.audio.inference.qwen_omni.snapshot_download) so
the patches still bind after the import hoist
Signed-off-by: Aaftab V <aaftabv@nvidia.com>
| from qwen_omni_utils import process_mm_info | ||
| from transformers import Qwen3OmniMoeProcessor |
There was a problem hiding this comment.
Unconditional top-level imports break non-
audio_cuda12 installs
qwen_omni_utils is only shipped with the audio_cuda12 optional extra, but from qwen_omni_utils import process_mm_info and from transformers import Qwen3OmniMoeProcessor are both imported at module level, outside any guard. On a standard Curator installation (including the Mac/ARM case the PR description explicitly claims will work), import nemo_curator.models.qwen_omni fails immediately with ImportError: No module named 'qwen_omni_utils'. The vllm import immediately below correctly uses try/except ImportError + VLLM_AVAILABLE; these two imports need the same treatment — either fold them into that same try block, or defer them into setup() where VLLM_AVAILABLE is already checked.
Add qwen-asr and its lazy-imported runtime companions to Curator's audio extras so that the harvest.curator Docker image gets the full qwen-asr stack via Curator's uv sync rather than via post-uv pip installs in NvLLMOps. This honors the Algorithmic vs Data-Mover Dep Ownership Rule: algorithmic libraries belong in Curator, NvLLMOps owns only data-mover clients. audio_common gains the qwen-asr forced-aligner text-norm and audio-feature companions that qwen-asr 0.0.6's qwen3_forced_aligner.py imports lazily: nagisa==0.2.11, soynlp==0.0.493, pyarabic, opencc-python-reimplemented, and nnAudio. audio_cuda12 gains qwen-asr==0.0.6 itself for the Granary v2 Qwen-ASR recovery stage, and fasttext==0.9.3 for the Granary v2 LID stage. fasttext already lives in text_cpu but audio_cuda12 does not pull text_cpu, so the declaration is duplicated here. [tool.uv] override-dependencies replaces the broad huggingface-hub>=0.34,<1.0 override with three exact pins proven against qwen-asr by NvLLMOps commit 68f18e9b: transformers==4.57.6, accelerate==1.12.0, and huggingface-hub==0.36.0. These force-override qwen-asr's declared (incompatible) version pins so the resolver picks the proven-compatible versions for the entire graph. This change extends PR1967's scope from "first-stage Qwen-Omni inference only" to also cover Granary v2 algorithmic-dep self-containment, so that later Granary v2 PRs (Qwen-ASR recovery, text filtering, PnC, ITN, SED) can rely on Curator's audio_cuda12 extra without further pip-after-uv overrides in NvLLMOps. Lock churn: +18 packages including qwen-asr 0.0.6, nagisa, soynlp, pyarabic, opencc-python-reimplemented, nnaudio, and qwen-asr's gradio/flask transitive demo deps. transformers/huggingface-hub/accelerate stayed at the override-pinned versions, so no version drift for the qwen-omni stack. Signed-off-by: Aaftab V <aaftabv@nvidia.com>
| """ | ||
|
|
||
| name: str = "nemo_tar_shard_discovery" | ||
| yaml_path: str = "" |
There was a problem hiding this comment.
| yaml_path: str = "" | |
| yaml_path: str |
This can be empty instead of checking it in the post init.
| """ | ||
|
|
||
| name: str = "nemo_tarred_audio_reader" | ||
| yaml_path: str = "" |
There was a problem hiding this comment.
| yaml_path: str = "" | |
| yaml_path: str |
Same comment as above.
There was a problem hiding this comment.
@mohammadaaftabv For changes that override the deps and might have impact outside just the intended modality can we run the full benchmark suite with changes from this PR to ensure nothing regressed?
There was a problem hiding this comment.
@mohammadaaftabv For changes that override the deps and might have impact outside just the intended modality can we run the full benchmark suite with changes from this PR to ensure nothing regressed?
| # qwen-asr 0.0.6's qwen3_forced_aligner.py lazily imports the following text-norm | ||
| # and audio-feature helpers. Keep them in audio_common (not audio_cuda12) so the | ||
| # CPU audio extra picks them up too if a future qwen-asr CPU path appears. | ||
| "nagisa==0.2.11", |
There was a problem hiding this comment.
We try to avoid hard pins if possible since it may cause incompatibility with other packages due to more restrictive pinning.
| # qwen-asr by NvLLMOps commit 68f18e9b. Lazy-imported runtime companions | ||
| # (nagisa, soynlp, pyarabic, opencc-python-reimplemented, nnAudio) come from | ||
| # audio_common above. | ||
| "qwen-asr==0.0.6", |
There was a problem hiding this comment.
Same, is it possible to avoid hard pin. The comment here is too verbose.
| "qwen-asr==0.0.6", | ||
| # Granary v2 LID stage uses fasttext directly; it also lives in text_cpu but | ||
| # audio_cuda12 does not pull text_cpu, so declare it explicitly here. | ||
| "fasttext==0.9.3", |
There was a problem hiding this comment.
same comment about hard pins
| "huggingface-hub==0.36.0", # Pinned to qwen-asr 0.0.6 runtime compat (NvLLMOps commit 68f18e9b); also covers transformers vs data-designer disagreement | ||
| "transformers==4.57.6", # Pinned to qwen-asr 0.0.6 runtime compat (NvLLMOps commit 68f18e9b); overrides qwen-asr's incompatible declared transformers pin | ||
| "accelerate==1.12.0", # Pinned to qwen-asr 0.0.6 runtime compat (NvLLMOps commit 68f18e9b); overrides qwen-asr's incompatible declared accelerate pin |
There was a problem hiding this comment.
Is it possible to avoid hard pins here. Generally hf-hub and transformers are used across multiple modalities so it might be harder to unify this here.
| "distance; sys_platform == 'never'", | ||
| "huggingface-hub>=0.34,<1.0", # Override huggingface-hub, transformers and data-designer require two different versions of hugging-face hub | ||
| "huggingface-hub==0.36.0", # Pinned to qwen-asr 0.0.6 runtime compat (NvLLMOps commit 68f18e9b); also covers transformers vs data-designer disagreement | ||
| "transformers==4.57.6", # Pinned to qwen-asr 0.0.6 runtime compat (NvLLMOps commit 68f18e9b); overrides qwen-asr's incompatible declared transformers pin |
Summary
Adds a small Granary v2 Qwen-Omni first-stage inference path for audio. Curator keeps algorithm-only scope: Qwen3-Omni vLLM inference, local staged NeMo-tarred reader, sharded manifest writer, and explicit YAML tutorial stage graph. NvLLMOps remains responsible for data delivery, Kratos/MPI/Ray launch, GPU allocation, rank upload, and perf merge.
Pipeline shape
Scope boundary (Curator vs NvLLMOps)
NemoTarredAudioReader(streaming tar reads via lhotse, manifest-keyed lookup with deduplication, optional duration filtering, checkpoint skip-completed-shards),ShardedManifestWriterStage(actor-mode single-writer with.donemarker, per-shard perf JSONL, aggregateperf_summary.json), and the Hydra YAML tutorial that wires these into an explicit stage graph.Files in this PR
Source:
nemo_curator/models/qwen_omni.pynemo_curator/stages/audio/inference/__init__.pynemo_curator/stages/audio/inference/qwen_omni.pynemo_curator/stages/audio/io/nemo_tarred_reader.pynemo_curator/stages/audio/io/sharded_manifest_writer.pynemo_curator/stages/audio/metrics/performance.pyTutorial:
tutorials/audio/qwen_omni_inprocess/main.pytutorials/audio/qwen_omni_inprocess/qwen_omni_inprocess.yamltutorials/audio/qwen_omni_inprocess/README.mdtutorials/audio/README.md(index entry)Tests:
tests/stages/audio/inference/test_qwen_omni.pytests/stages/audio/io/test_nemo_tarred_reader.pytests/stages/audio/io/test_sharded_manifest_writer.pyPackaging:
pyproject.toml(add theaudio_cuda12extra; pulls in vLLM +qwen-omni-utils)uv.lockHow to run the tutorial (single rank)
uv sync --extra audio_cuda12 source .venv/bin/activate python tutorials/audio/qwen_omni_inprocess/main.py \ --config-path=tutorials/audio/qwen_omni_inprocess \ --config-name=qwen_omni_inprocess \ workspace_dir=/work \ input_manifest=/data/data_config.yamlHydra config knobs of note:
tensor_parallel_size,max_model_len,max_num_seqs,gpu_memory_utilization,batch_size,max_output_tokens,keep_waveform,prefetch_fail_on_error,default_language,prompt_text/en_prompt_file/followup_prompt,system_prompt.Testing
Local:
git diff --check— cleanpython -m py_compileon every PR-touched file — cleanpytest --confcutdir=tests/stages/audio tests/stages/audio/inference/test_qwen_omni.py tests/stages/audio/io/test_nemo_tarred_reader.py tests/stages/audio/io/test_sharded_manifest_writer.py— passesLive multi-node (via the NvLLMOps Kratos workflow, not part of this PR):
qwen_omni_inprocesspipeline only (reader + Qwen-Omni + writer)Succeeded(Kratos run3eca039b-7c40-4adc-ab4a-26660edc6c29)Out of scope for this PR
This PR is intentionally only the first Qwen-Omni inference stage. It does not include Qwen-ASR recovery, hallucination / cross-lingual filtering, LID, regex cleanup, PnC, ITN, SED, diarization, or any of the later Granary v2 stages — those will land in follow-up PRs.
Risk
vllm/transformers/qwen-omni-utilsstack via the optionalaudio_cuda12extra. Thevllmimport keeps the existingtry/except ImportError + VLLM_AVAILABLEguard (matches 5 sister model wrappers onmain) so Mac/ARM Curator installs can stillimport nemo_curator.models.qwen_omnicleanly.trust_remote_code=Trueis required by Qwen3-Omni-30B-A3B-Instruct today (the model card ships custom modeling code).