ASR (Whisper) regression for detected timestamps

### System Info

transformers.js 4.2.0

### Environment/Platform

- [x] Website/web-app
- [ ] Browser extension
- [ ] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)

### Description

Using transformers.js 4.2.0 `automatic-speech-recognition` and standard `onnx-community/whisper-base` model, produces timestamps that are less accurate compared to v3.8.1


### Reproduction

Run the following demo.html in browser:

```html
<script type="module">
//import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.8.1";
import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.2.0";
const audio = new Float32Array(await (await fetch("input.pcm")).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition",
	"onnx-community/whisper-base",
	{dtype:{encoder_model:"fp32", decoder_model_merged:"q4"},
	device:"webgpu"});

const result = await pipe(audio, {
	return_timestamps: true});

console.log(result.chunks.map(chunk => `${chunk.timestamp[0]} -> ${chunk.timestamp[1]} ${chunk.text}`))
</script>
```

which prints for 3.8.1:

```
0:"0 -> 7  This is a day I've been looking forward to for two and a half years."
1:"7 -> 13  [applause]"
2:"13 -> 21  Every once in a while, a revolutionary product comes along that changes everything."
3:"21 -> 29  And Apple has been, well first of all, one's very fortunate if you get to work on just one of these,"
```

while using the very same model running with 4.2.0 prints:

```
0:"0 -> 7  This is a day I've been looking forward to for two and a half years."
1:"7 -> 22  Every once in a while, a revolutionary product comes along that changes everything."
2:"22 -> 28.92  And Apple has been, well first of all, one's very fortunate if you get to work on just one"
3:"28.92 -> 29.92  of these."
```

Notice the difference in timing the sentence `Every once in a while...` which actually starts around `13` second mark (hearing attached input audio [input.zip](https://github.com/user-attachments/files/27711520/input.zip)). It is correctly recognized by 3.8.1 but 4.2.0 output is `7`.

- This is not a model problem, as both outputs are run on the very same model.
- The issue looks similar to the one reported on https://github.com/huggingface/transformers.js/issues/1590 which exposes regression in timing issue with tjs v4 as well (while using another model)

Could you please look into the issue or provide a workaround?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASR (Whisper) regression for detected timestamps #1684

System Info

Environment/Platform

Description

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ASR (Whisper) regression for detected timestamps #1684

Description

System Info

Environment/Platform

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions