System Info
transformers.js v4
Environment/Platform
Description
When using the automatic-speech-recognition pipeline with the model onnx-community/whisper-large-v3-turbo_timestamped and return_timestamps: true, version 4.0.0-next.7 (and starting from at least 4.0.0-next.1) outputs a single segment covering the full audio duration (and missing the expected transcripts), while 3.8.1 correctly produces multiple shorter sentence/phrase-level segments with accurate boundaries.
This appears to be a regression in the v4 preview series, as the 3.x behavior better preserves the natural sentence/phrase structure of the audio.
Reproduction
- Download attached audio file ( ch.zip ) and extract. Use the exact same pipeline setup in both versions:
<script type="module">
import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.8.1";
//import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@4.0.0-next.7";
const audio = new Float32Array(await (await fetch("ch.pcm")).arrayBuffer());
const pipe = await pipeline("automatic-speech-recognition",
"onnx-community/whisper-large-v3-turbo_timestamped",
{dtype:{encoder_model:"fp16", decoder_model_merged:"fp16"},
device:"webgpu"});
const result = await pipe(audio, {
return_timestamps: true});
console.log(result.chunks.map(chunk => `${chunk.timestamp[0]} -> ${chunk.timestamp[1]} ${chunk.text}`))
</script>
- v3.8.1 output in console log:
"12 -> 16 Some stay dry and others feel the pain"
"16 -> 17 Chocolate Rain"
"17 -> 21 A baby born will die before the sin"
"21 -> 22 Chocolate Rain"
"22 -> 26 The school books say it can't be here again"
"26 -> 28 Chocolate Rain"
v4 output:
"0 -> 29.98 Chocolate Rain"
I tried various values for generation parameters like changing chunk_length_s (e.g., to 10, 15, 20), adjusting stride_length_s, and experimenting with other options such as num_beams or temperature, but none of these changes improved the output in v4 - the result remained a single merged segment regardless of the settings.
I expect the output to retain sentence/phrase-level segmentation similar to 3.8.1 (multiple segments with timestamps aligned to lyrical/phrasal breaks), rather than aggressively merging everything into one incomplete chunk. The model itself is capable of predicting distinct segments, and previous versions respected that.
This might be related to:
System Info
transformers.js v4
Environment/Platform
Description
When using the
automatic-speech-recognitionpipeline with the modelonnx-community/whisper-large-v3-turbo_timestampedand return_timestamps: true, version 4.0.0-next.7 (and starting from at least 4.0.0-next.1) outputs a single segment covering the full audio duration (and missing the expected transcripts), while 3.8.1 correctly produces multiple shorter sentence/phrase-level segments with accurate boundaries.This appears to be a regression in the v4 preview series, as the 3.x behavior better preserves the natural sentence/phrase structure of the audio.
Reproduction
v4 output:
I tried various values for generation parameters like changing chunk_length_s (e.g., to 10, 15, 20), adjusting stride_length_s, and experimenting with other options such as num_beams or temperature, but none of these changes improved the output in v4 - the result remained a single merged segment regardless of the settings.
I expect the output to retain sentence/phrase-level segmentation similar to 3.8.1 (multiple segments with timestamps aligned to lyrical/phrasal breaks), rather than aggressively merging everything into one incomplete chunk. The model itself is capable of predicting distinct segments, and previous versions respected that.
This might be related to: