Update wavlm.md to match new model card template by reedrya · Pull Request #40047 · huggingface/transformers

reedrya · 2025-08-08T23:51:49Z

What does this PR do?

This PR updates the WavLM model card to comply with the format introduced in #36979.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).

Who can review?

@stevhliu

Notes

I did not include the AttentionMaskVisualizer section since I'm unfamiliar. Please advise if that should be added.

stevhliu

Thanks for your contribution!

stevhliu · 2025-08-11T16:14:16Z

-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-</div>
+[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.


Suggested change

[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.

[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on [HuBERTs](./hubert) masked prediction approach. It introduces gated relative position bias for better recognition accuracy and an unsupervised utterance-mixing strategy to improve speaker discrimination.

stevhliu · 2025-08-11T16:15:00Z

+[WavLM](https://huggingface.co/papers/2110.13900) is a self-supervised speech representation model from Microsoft designed to work across the “full stack” of speech tasks, from automatic speech recognition (ASR) to speaker diarization and audio event detection. It builds on HuBERT’s masked prediction approach but introduces denoising and data augmentation to make the learned representations more robust in noisy and multi-speaker conditions.

-## Overview
+You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.


Suggested change

You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.

You can find all the original WavLM checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=wavlm) organization.

stevhliu · 2025-08-11T16:18:24Z

+> Click on the WavLM models in the right sidebar for more examples of how to apply WavLM to different audio tasks.

-The abstract from the paper is the following:
+The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.


Suggested change

The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.

The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class.

stevhliu · 2025-08-11T16:18:34Z

+<hfoption id="Pipeline">

-Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm.
+```python


import torch from transformers import pipeline pipeline = pipeline( task="automatic-speech-recognition", model="patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16, device=0 ) pipeline("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac")

stevhliu · 2025-08-11T16:25:22Z


-## Resources
+```python
+import torch


from transformers import AutoProcessor, AutoModelForCTC from datasets import load_dataset import torch dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate processor = AutoProcessor.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus") model = AutoModelForCTC.from_pretrained("patrickvonplaten/wavlm-libri-clean-100h-base-plus", torch_dtype=torch.float16) # audio file is decoded on the fly inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) # transcribe speech transcription = processor.batch_decode(predicted_ids) transcription[0]

stevhliu · 2025-08-11T16:25:56Z

+</hfoption>
+</hfoptions>
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. 


The model is not that large so we can remove the Quantization section

stevhliu · 2025-08-11T16:26:21Z

+```
+
+## Notes
+- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.


Suggested change

- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.

- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing.

stevhliu · 2025-08-11T16:26:30Z

+
+## Notes
+- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
+- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.


Suggested change

- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.

- For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`].

stevhliu · 2025-08-11T16:26:41Z

+- The model works particularly well for tasks like speaker verification, identification, and diarization.
+
+## Resources
+- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification)
+- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr)


Suggested change

- The model works particularly well for tasks like speaker verification, identification, and diarization.

## Resources

- [Audio classification task guide](https://huggingface.co/docs/transformers/en/tasks/audio_classification)

- [Automatic speech recognition task guide](https://huggingface.co/docs/transformers/en/tasks/asr)

Update wavlm.md to match new model card template

a851678

stevhliu mentioned this pull request Aug 11, 2025

[Community contributions] Model cards #36979

Closed

stevhliu reviewed Aug 11, 2025

View reviewed changes

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update wavlm.md to match new model card template#40047

Update wavlm.md to match new model card template#40047
reedrya wants to merge 1 commit into
huggingface:mainfrom
reedrya:update-model-card-wavlm

reedrya commented Aug 8, 2025

Uh oh!

stevhliu left a comment

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

stevhliu Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	You can find all the original WavLM checkpoints under the [WavLM](https://huggingface.co/models?other=wavlm) collection.
	You can find all the original WavLM checkpoints under the [Microsoft](https://huggingface.co/microsoft/models?search=wavlm) organization.

	The example below demonstrates how to extract audio features with [`Pipeline`] or the [`AutoModel`] class.
	The example below demonstrates how to automatically transcribe speech into text with [`Pipeline`] or the [`AutoModel`] class.

	- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use `Wav2Vec2Processor` for preprocessing.
	- WavLM processes raw 16kHz audio waveforms provided as 1D float arrays. Use [`Wav2Vec2Processor`] for preprocessing.

	- For CTC-based fine-tuning, model outputs should be decoded with `Wav2Vec2CTCTokenizer`.
	- For CTC-based fine-tuning, model outputs should be decoded with [`Wav2Vec2CTCTokenizer`].

Conversation

reedrya commented Aug 8, 2025

What does this PR do?

Before submitting

Who can review?

Notes

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants