You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using transformers.js 4.2.0 automatic-speech-recognition and standard onnx-community/whisper-base model, produces timestamps that are less accurate compared to v3.8.1
Reproduction
Run the following demo.html in browser:
<scripttype="module">//import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.8.1";import{pipeline}from"https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.2.0";constaudio=newFloat32Array(await(awaitfetch("input.pcm")).arrayBuffer());constpipe=awaitpipeline("automatic-speech-recognition","onnx-community/whisper-base",{dtype:{encoder_model:"fp32",decoder_model_merged:"q4"},device:"webgpu"});constresult=awaitpipe(audio,{return_timestamps: true});console.log(result.chunks.map(chunk=>`${chunk.timestamp[0]} -> ${chunk.timestamp[1]}${chunk.text}`))</script>
which prints for 3.8.1:
0:"0 -> 7 This is a day I've been looking forward to for two and a half years."
1:"7 -> 13 [applause]"
2:"13 -> 21 Every once in a while, a revolutionary product comes along that changes everything."
3:"21 -> 29 And Apple has been, well first of all, one's very fortunate if you get to work on just one of these,"
while using the very same model running with 4.2.0 prints:
0:"0 -> 7 This is a day I've been looking forward to for two and a half years."
1:"7 -> 22 Every once in a while, a revolutionary product comes along that changes everything."
2:"22 -> 28.92 And Apple has been, well first of all, one's very fortunate if you get to work on just one"
3:"28.92 -> 29.92 of these."
Notice the difference in timing the sentence Every once in a while... which actually starts around 13 second mark (hearing attached input audio input.zip). It is correctly recognized by 3.8.1 but 4.2.0 output is 7.
This is not a model problem, as both outputs are run on the very same model.
System Info
transformers.js 4.2.0
Environment/Platform
Description
Using transformers.js 4.2.0
automatic-speech-recognitionand standardonnx-community/whisper-basemodel, produces timestamps that are less accurate compared to v3.8.1Reproduction
Run the following demo.html in browser:
which prints for 3.8.1:
while using the very same model running with 4.2.0 prints:
Notice the difference in timing the sentence
Every once in a while...which actually starts around13second mark (hearing attached input audio input.zip). It is correctly recognized by 3.8.1 but 4.2.0 output is7.Could you please look into the issue or provide a workaround?