Update wavlm.md to match new model card template#40047
Open
reedrya wants to merge 1 commit into
Open
Conversation
stevhliu
reviewed
Aug 11, 2025
stevhliu
left a comment
Member
There was a problem hiding this comment.
Thanks for your contribution!
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
| [WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions. |
Member
There was a problem hiding this comment.
Suggested change
| [WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions. | |
| [WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on [HuBERTs](./hubert) masked prediction approach. It introduces gated relative position bias for better recognition accuracy and an unsupervised utterance-mixing strategy to improve speaker discrimination. |
| [WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions. | ||
|
|
||
| ## Overview | ||
| You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection. |
Member
There was a problem hiding this comment.
Suggested change
| You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection. | |
| You can find all the original WavLM checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=wavlm) organization. |
| > Click on the WavLM models in the right sidebar for more examples of how to apply WavLM to different audio tasks. | ||
|
|
||
| The abstract from the paper is the following: | ||
| The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class. |
Member
There was a problem hiding this comment.
Suggested change
| The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class. | |
| The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class. |
| <hfoption id="Pipeline"> | ||
|
|
||
| Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm. | ||
| ```python |
Member
There was a problem hiding this comment.
import torch
from transformers import pipeline
pipeline = pipeline(
task="automatic-speech-recognition",
model="patrickvonplaten/wavlm-libri-clean-100h-base-plus",
torch_dtype=torch.float16,
device=0
)
pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")|
|
||
| ## Resources | ||
| ```python | ||
| import torch |
Member
There was a problem hiding this comment.
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset
import torch
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
processor = AutoProcessor.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus")
model = AutoModelForCTC.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16)
# audio file is decoded on the fly
inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
# transcribe speech
transcription = processor.batch_decode(predicted_ids)
transcription[0]| </hfoption> | ||
| </hfoptions> | ||
|
|
||
| Quantization reduces the memory burden of large models by representing the weights in a lower precision. |
Member
There was a problem hiding this comment.
The model is not that large so we can remove the Quantization section
| ``` | ||
|
|
||
| ## Notes | ||
| - WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing. |
Member
There was a problem hiding this comment.
Suggested change
| - WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing. | |
| - WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing. |
|
|
||
| ## Notes | ||
| - WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing. | ||
| - For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`. |
Member
There was a problem hiding this comment.
Suggested change
| - For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`. | |
| - For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`]. |
Comment on lines
+101
to
+105
| - The model works particularly well for tasks like speaker verification, identification, and diarization. | ||
|
|
||
| ## Resources | ||
| - [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification) | ||
| - [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr) |
Member
There was a problem hiding this comment.
Suggested change
| - The model works particularly well for tasks like speaker verification, identification, and diarization. | |
| ## Resources | |
| - [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification) | |
| - [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR updates the WavLM model card to comply with the format introduced in #36979.
Before submitting
Who can review?
@stevhliu
Notes