You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- <p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.</p> -->
30
28
31
-
32
29
<!-- <h2 align="left", id="what-is-it">What is it 🔎</h2> -->
33
30
34
-
35
31
This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
36
32
37
33
- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
38
34
- 🪶 [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend, requires <8GB gpu memory for large-v2 with beam_size=5
39
35
- 🎯 Accurate word-level timestamps using wav2vec2 alignment
40
-
- 👯♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
36
+
- 👯♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
41
37
- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
42
38
43
-
44
-
45
39
**Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.
46
40
47
41
**Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is [wav2vec2.0](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self).
@@ -54,12 +48,12 @@ This repository provides fast automatic speech recognition (70x realtime with la
54
48
55
49
<h2align="left",id="highlights">New🚨</h2>
56
50
57
-
- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER)🏆
58
-
-_WhisperX_ accepted at INTERSPEECH 2023
51
+
- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER) 🏆
52
+
-_WhisperX_ accepted at INTERSPEECH 2023
59
53
- v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
60
54
- v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend!
61
55
- v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
62
-
- Paper drop🎓👨🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.
56
+
- Paper drop🎓👨🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with \*60-70x REAL TIME speed.
63
57
64
58
<h2align="left"id="setup">Setup ⚙️</h2>
65
59
@@ -118,21 +112,18 @@ Run whisper on example segment (using default params, whisper small) add `--high
118
112
119
113
whisperx path/to/audio.wav
120
114
121
-
122
-
Result using *WhisperX* with forced alignment to wav2vec2.0 large:
115
+
Result using _WhisperX_ with forced alignment to wav2vec2.0 large:
For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
@@ -143,27 +134,26 @@ To run on CPU instead of GPU (and for running on Mac OS X):
143
134
144
135
### Other languages
145
136
146
-
The phoneme ASR alignment model is *language-specific*, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
137
+
The phoneme ASR alignment model is _language-specific_, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
147
138
Just pass in the `--language` code, and use the whisper `--model large`.
148
139
149
140
Currently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.
150
141
151
-
152
142
#### E.g. German
143
+
153
144
whisperx --model large-v2 --language de path/to/audio.wav
For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https://www.robots.ox.ac.uk/~vgg/publications/2023/Bain23/bain23.pdf).
217
207
218
208
To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):
209
+
219
210
1. reduce batch size, e.g. `--batch_size 4`
220
-
2. use a smaller ASR model `--model base`
221
-
3. Use lighter compute type `--compute_type int8`
211
+
2.use a smaller ASR model `--model base`
212
+
3.Use lighter compute type `--compute_type int8`
222
213
223
214
Transcription differences from openai's whisper:
215
+
224
216
1. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
225
217
2. VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference
226
-
3.`--condition_on_prev_text` is set to `False` by default (reduces hallucination)
218
+
3.`--condition_on_prev_text` is set to `False` by default (reduces hallucination)
If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.
@@ -241,43 +232,40 @@ Bug finding and pull requests are also highly appreciated to keep this project g
241
232
242
233
<h2align="left"id="coming-soon">TODO 🗓</h2>
243
234
244
-
*[x] Multilingual init
235
+
-[x] Multilingual init
245
236
246
-
*[x] Automatic align model selection based on language detection
237
+
-[x] Automatic align model selection based on language detection
247
238
248
-
*[x] Python usage
239
+
-[x] Python usage
249
240
250
-
*[x] Incorporating speaker diarization
241
+
-[x] Incorporating speaker diarization
251
242
252
-
*[x] Model flush, for low gpu mem resources
243
+
-[x] Model flush, for low gpu mem resources
253
244
254
-
*[x] Faster-whisper backend
245
+
-[x] Faster-whisper backend
255
246
256
-
*[x] Add max-line etc. see (openai's whisper utils.py)
247
+
-[x] Add max-line etc. see (openai's whisper utils.py)
257
248
258
-
*[x] Sentence-level segments (nltk toolbox)
249
+
-[x] Sentence-level segments (nltk toolbox)
259
250
260
-
*[x] Improve alignment logic
251
+
-[x] Improve alignment logic
261
252
262
-
*[ ] update examples with diarization and word highlighting
253
+
-[ ] update examples with diarization and word highlighting
263
254
264
-
*[ ] Subtitle .ass output <- bring this back (removed in v3)
255
+
-[ ] Subtitle .ass output <- bring this back (removed in v3)
265
256
266
-
*[ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
257
+
-[ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
267
258
268
-
*[x] Allow silero-vad as alternative VAD option
269
-
270
-
*[ ] Improve diarization (word level). *Harder than first thought...*
259
+
-[x] Allow silero-vad as alternative VAD option
271
260
261
+
-[ ] Improve diarization (word level). _Harder than first thought..._
<ahref="https://www.buymeacoffee.com/maxhbain"target="_blank"><imgsrc="https://cdn.buymeacoffee.com/buttons/default-orange.png"alt="Buy Me A Coffee"height="41"width="174"></a>
279
268
280
-
281
269
<h2align="left"id="acks">Acknowledgements 🙏</h2>
282
270
283
271
This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and the University of Oxford.
@@ -286,8 +274,8 @@ Of course, this is builds on [openAI's whisper](https://github.com/openai/whispe
286
274
Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
287
275
And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
0 commit comments