Skip to content

[webgpu] Whisper encoder fp16 precision issues #1590

@jozefchutka

Description

@jozefchutka

System Info

transformers.js v4

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

When using the automatic-speech-recognition pipeline with the model onnx-community/whisper-large-v3-turbo_timestamped and return_timestamps: true, version 4.0.0-next.7 (and starting from at least 4.0.0-next.1) outputs a single segment covering the full audio duration (and missing the expected transcripts), while 3.8.1 correctly produces multiple shorter sentence/phrase-level segments with accurate boundaries.

This appears to be a regression in the v4 preview series, as the 3.x behavior better preserves the natural sentence/phrase structure of the audio.

Reproduction

  1. Download attached audio file ( ch.zip ) and extract. Use the exact same pipeline setup in both versions:
<script type="module">
import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.8.1";
//import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.7";
const audio = new Float32Array(await (await fetch("ch.pcm")).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition",
	"onnx-community/whisper-large-v3-turbo_timestamped",
	{dtype:{encoder_model:"fp16", decoder_model_merged:"fp16"},
	device:"webgpu"});

const result = await pipe(audio, {
	return_timestamps: true});

console.log(result.chunks.map(chunk => `${chunk.timestamp[0]} -> ${chunk.timestamp[1]} ${chunk.text}`))
</script>
  1. v3.8.1 output in console log:
"12 -> 16  Some stay dry and others feel the pain"
"16 -> 17  Chocolate Rain"
"17 -> 21  A baby born will die before the sin"
"21 -> 22  Chocolate Rain"
"22 -> 26  The school books say it can't be here again"
"26 -> 28  Chocolate Rain"

v4 output:

"0 -> 29.98  Chocolate Rain"

I tried various values for generation parameters like changing chunk_length_s (e.g., to 10, 15, 20), adjusting stride_length_s, and experimenting with other options such as num_beams or temperature, but none of these changes improved the output in v4 - the result remained a single merged segment regardless of the settings.

I expect the output to retain sentence/phrase-level segmentation similar to 3.8.1 (multiple segments with timestamps aligned to lyrical/phrasal breaks), rather than aggressively merging everything into one incomplete chunk. The model itself is capable of predicting distinct segments, and previous versions respected that.

This might be related to:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions