Skip to content

ASR (Whisper) regression for detected timestamps #1684

@jozefchutka

Description

@jozefchutka

System Info

transformers.js 4.2.0

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Using transformers.js 4.2.0 automatic-speech-recognition and standard onnx-community/whisper-base model, produces timestamps that are less accurate compared to v3.8.1

Reproduction

Run the following demo.html in browser:

<script type="module">
//import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.8.1";
import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.2.0";
const audio = new Float32Array(await (await fetch("input.pcm")).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition",
	"onnx-community/whisper-base",
	{dtype:{encoder_model:"fp32", decoder_model_merged:"q4"},
	device:"webgpu"});

const result = await pipe(audio, {
	return_timestamps: true});

console.log(result.chunks.map(chunk => `${chunk.timestamp[0]} -> ${chunk.timestamp[1]} ${chunk.text}`))
</script>

which prints for 3.8.1:

0:"0 -> 7  This is a day I've been looking forward to for two and a half years."
1:"7 -> 13  [applause]"
2:"13 -> 21  Every once in a while, a revolutionary product comes along that changes everything."
3:"21 -> 29  And Apple has been, well first of all, one's very fortunate if you get to work on just one of these,"

while using the very same model running with 4.2.0 prints:

0:"0 -> 7  This is a day I've been looking forward to for two and a half years."
1:"7 -> 22  Every once in a while, a revolutionary product comes along that changes everything."
2:"22 -> 28.92  And Apple has been, well first of all, one's very fortunate if you get to work on just one"
3:"28.92 -> 29.92  of these."

Notice the difference in timing the sentence Every once in a while... which actually starts around 13 second mark (hearing attached input audio input.zip). It is correctly recognized by 3.8.1 but 4.2.0 output is 7.

  • This is not a model problem, as both outputs are run on the very same model.
  • The issue looks similar to the one reported on [webgpu] Whisper encoder fp16 precision issues #1590 which exposes regression in timing issue with tjs v4 as well (while using another model)

Could you please look into the issue or provide a workaround?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions