Skip to content

Commit 36d552c

Browse files
committed
fix: remove DiarizationPipeline from public API
1 parent 7d36b83 commit 36d552c

File tree

2 files changed

+35
-56
lines changed

2 files changed

+35
-56
lines changed

README.md

Lines changed: 35 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -22,26 +22,20 @@
2222
</a>
2323
</p>
2424

25-
2625
<img width="1216" align="center" alt="whisperx-arch" src="https://raw.githubusercontent.com/m-bain/whisperX/refs/heads/main/figures/pipeline.png">
2726

28-
2927
<!-- <p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.</p> -->
3028

31-
3229
<!-- <h2 align="left", id="what-is-it">What is it 🔎</h2> -->
3330

34-
3531
This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
3632

3733
- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
3834
- 🪶 [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend, requires <8GB gpu memory for large-v2 with beam_size=5
3935
- 🎯 Accurate word-level timestamps using wav2vec2 alignment
40-
- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
36+
- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
4137
- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
4238

43-
44-
4539
**Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.
4640

4741
**Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is [wav2vec2.0](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self).
@@ -54,12 +48,12 @@ This repository provides fast automatic speech recognition (70x realtime with la
5448

5549
<h2 align="left", id="highlights">New🚨</h2>
5650

57-
- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER) 🏆
58-
- _WhisperX_ accepted at INTERSPEECH 2023
51+
- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER) 🏆
52+
- _WhisperX_ accepted at INTERSPEECH 2023
5953
- v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
6054
- v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend!
6155
- v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
62-
- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.
56+
- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with \*60-70x REAL TIME speed.
6357

6458
<h2 align="left" id="setup">Setup ⚙️</h2>
6559

@@ -118,21 +112,18 @@ Run whisper on example segment (using default params, whisper small) add `--high
118112

119113
whisperx path/to/audio.wav
120114

121-
122-
Result using *WhisperX* with forced alignment to wav2vec2.0 large:
115+
Result using _WhisperX_ with forced alignment to wav2vec2.0 large:
123116

124117
https://user-images.githubusercontent.com/36994049/208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4
125118

126119
Compare this to original whisper out the box, where many transcriptions are out of sync:
127120

128121
https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov
129122

130-
131123
For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
132124

133125
whisperx path/to/audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
134126

135-
136127
To label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):
137128

138129
whisperx path/to/audio.wav --model large-v2 --diarize --highlight_words True
@@ -143,27 +134,26 @@ To run on CPU instead of GPU (and for running on Mac OS X):
143134

144135
### Other languages
145136

146-
The phoneme ASR alignment model is *language-specific*, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
137+
The phoneme ASR alignment model is _language-specific_, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
147138
Just pass in the `--language` code, and use the whisper `--model large`.
148139

149140
Currently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.
150141

151-
152142
#### E.g. German
143+
153144
whisperx --model large-v2 --language de path/to/audio.wav
154145

155146
https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov
156147

157-
158148
See more examples in other languages [here](EXAMPLES.md).
159149

160-
## Python usage 🐍
150+
## Python usage 🐍
161151

162152
```python
163153
import whisperx
164-
import gc
154+
import gc
165155

166-
device = "cuda"
156+
device = "cuda"
167157
audio_file = "audio.mp3"
168158
batch_size = 16 # reduce if low on GPU mem
169159
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
@@ -192,7 +182,7 @@ print(result["segments"]) # after alignment
192182
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a
193183

194184
# 3. Assign speaker labels
195-
diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
185+
diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
196186

197187
# add min/max number of speakers if known
198188
diarize_segments = diarize_model(audio)
@@ -205,25 +195,27 @@ print(result["segments"]) # segments are now assigned speaker IDs
205195

206196
## Demos 🚀
207197

208-
[![Replicate (large-v3](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/victor-upmeet/whisperx)
209-
[![Replicate (large-v2](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/daanelson/whisperx)
210-
[![Replicate (medium)](https://img.shields.io/static/v1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/carnifexer/whisperx)
198+
[![Replicate (large-v3](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/victor-upmeet/whisperx)
199+
[![Replicate (large-v2](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/daanelson/whisperx)
200+
[![Replicate (medium)](https://img.shields.io/static/v1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/carnifexer/whisperx)
211201

212-
If you don't have access to your own GPUs, use the links above to try out WhisperX.
202+
If you don't have access to your own GPUs, use the links above to try out WhisperX.
213203

214204
<h2 align="left" id="whisper-mod">Technical Details 👷‍♂️</h2>
215205

216206
For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https://www.robots.ox.ac.uk/~vgg/publications/2023/Bain23/bain23.pdf).
217207

218208
To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):
209+
219210
1. reduce batch size, e.g. `--batch_size 4`
220-
2. use a smaller ASR model `--model base`
221-
3. Use lighter compute type `--compute_type int8`
211+
2. use a smaller ASR model `--model base`
212+
3. Use lighter compute type `--compute_type int8`
222213

223214
Transcription differences from openai's whisper:
215+
224216
1. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
225217
2. VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference
226-
3. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)
218+
3. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)
227219

228220
<h2 align="left" id="limitations">Limitations ⚠️</h2>
229221

@@ -232,7 +224,6 @@ Transcription differences from openai's whisper:
232224
- Diarization is far from perfect
233225
- Language specific wav2vec2 model is needed
234226

235-
236227
<h2 align="left" id="contribute">Contribute 🧑‍🏫</h2>
237228

238229
If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.
@@ -241,43 +232,40 @@ Bug finding and pull requests are also highly appreciated to keep this project g
241232

242233
<h2 align="left" id="coming-soon">TODO 🗓</h2>
243234

244-
* [x] Multilingual init
235+
- [x] Multilingual init
245236

246-
* [x] Automatic align model selection based on language detection
237+
- [x] Automatic align model selection based on language detection
247238

248-
* [x] Python usage
239+
- [x] Python usage
249240

250-
* [x] Incorporating speaker diarization
241+
- [x] Incorporating speaker diarization
251242

252-
* [x] Model flush, for low gpu mem resources
243+
- [x] Model flush, for low gpu mem resources
253244

254-
* [x] Faster-whisper backend
245+
- [x] Faster-whisper backend
255246

256-
* [x] Add max-line etc. see (openai's whisper utils.py)
247+
- [x] Add max-line etc. see (openai's whisper utils.py)
257248

258-
* [x] Sentence-level segments (nltk toolbox)
249+
- [x] Sentence-level segments (nltk toolbox)
259250

260-
* [x] Improve alignment logic
251+
- [x] Improve alignment logic
261252

262-
* [ ] update examples with diarization and word highlighting
253+
- [ ] update examples with diarization and word highlighting
263254

264-
* [ ] Subtitle .ass output <- bring this back (removed in v3)
255+
- [ ] Subtitle .ass output <- bring this back (removed in v3)
265256

266-
* [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
257+
- [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
267258

268-
* [x] Allow silero-vad as alternative VAD option
269-
270-
* [ ] Improve diarization (word level). *Harder than first thought...*
259+
- [x] Allow silero-vad as alternative VAD option
271260

261+
- [ ] Improve diarization (word level). _Harder than first thought..._
272262

273263
<h2 align="left" id="contact">Contact/Support 📇</h2>
274264

275-
276265
Contact [email protected] for queries.
277266

278267
<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
279268

280-
281269
<h2 align="left" id="acks">Acknowledgements 🙏</h2>
282270

283271
This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and the University of Oxford.
@@ -286,8 +274,8 @@ Of course, this is builds on [openAI's whisper](https://github.com/openai/whispe
286274
Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
287275
And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
288276

289-
290277
Valuable VAD & Diarization Models from:
278+
291279
- [pyannote audio][https://github.com/pyannote/pyannote-audio]
292280
- [silero vad][https://github.com/snakers4/silero-vad]
293281

whisperx/__init__.py

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,3 @@ def load_audio(*args, **kwargs):
2929
def assign_word_speakers(*args, **kwargs):
3030
diarize = _lazy_import("diarize")
3131
return diarize.assign_word_speakers(*args, **kwargs)
32-
33-
34-
class DiarizationPipeline:
35-
def __init__(self, *args, **kwargs):
36-
diarize = _lazy_import("diarize")
37-
self._pipeline = diarize.DiarizationPipeline(*args, **kwargs)
38-
39-
def __getattr__(self, name):
40-
return getattr(self._pipeline, name)

0 commit comments

Comments
 (0)