[webgpu] Whisper encoder fp16 precision issues

### System Info

transformers.js v4

### Environment/Platform

- [x] Website/web-app
- [ ] Browser extension
- [ ] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)

### Description

When using the `automatic-speech-recognition` pipeline with the model `onnx-community/whisper-large-v3-turbo_timestamped` and return_timestamps: true, version 4.0.0-next.7 (and starting from at least 4.0.0-next.1) outputs a single segment covering the full audio duration (and missing the expected transcripts), while 3.8.1 correctly produces multiple shorter sentence/phrase-level segments with accurate boundaries.

This appears to be a regression in the v4 preview series, as the 3.x behavior better preserves the natural sentence/phrase structure of the audio.

### Reproduction

1. Download attached audio file ( [ch.zip](https://github.com/user-attachments/files/26054105/ch.zip) ) and extract. Use the exact same pipeline setup in both versions:

```html
<script type="module">
import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.8.1";
//import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.7";
const audio = new Float32Array(await (await fetch("ch.pcm")).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition",
	"onnx-community/whisper-large-v3-turbo_timestamped",
	{dtype:{encoder_model:"fp16", decoder_model_merged:"fp16"},
	device:"webgpu"});

const result = await pipe(audio, {
	return_timestamps: true});

console.log(result.chunks.map(chunk => `${chunk.timestamp[0]} -> ${chunk.timestamp[1]} ${chunk.text}`))
</script>
```

2. v3.8.1 output in console log:

```
"12 -> 16  Some stay dry and others feel the pain"
"16 -> 17  Chocolate Rain"
"17 -> 21  A baby born will die before the sin"
"21 -> 22  Chocolate Rain"
"22 -> 26  The school books say it can't be here again"
"26 -> 28  Chocolate Rain"
```

v4 output:

```
"0 -> 29.98  Chocolate Rain"
```

I tried various values for generation parameters like changing chunk_length_s (e.g., to 10, 15, 20), adjusting stride_length_s, and experimenting with other options such as num_beams or temperature, but none of these changes improved the output in v4 - the result remained a single merged segment regardless of the settings.

I expect the output to retain sentence/phrase-level segmentation similar to 3.8.1 (multiple segments with timestamps aligned to lyrical/phrasal breaks), rather than aggressively merging everything into one incomplete chunk. The model itself is capable of predicting distinct segments, and previous versions respected that.

This might be related to:
- https://github.com/huggingface/transformers.js/issues/1357
- https://github.com/huggingface/transformers.js/issues/1358

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Whisper encoder fp16 precision issues #1590

System Info

Environment/Platform

Description

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[webgpu] Whisper encoder fp16 precision issues #1590

Description

System Info

Environment/Platform

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions