fix: remove DiarizationPipeline from public API

Barabazs · Barabazs · commit 36d552cad33d · 2025-05-03T09:25:59.000+02:00
diff --git a/README.md b/README.md
@@ -22,26 +22,20 @@
   </a>      
 </p>
 
-
 <img width="1216" align="center" alt="whisperx-arch" src="https://raw.githubusercontent.com/m-bain/whisperX/refs/heads/main/figures/pipeline.png">
 
-
 <!-- <p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.</p> -->
 
-
 <!-- <h2 align="left", id="what-is-it">What is it 🔎</h2> -->
 
-
 This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
 
 - ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
 - 🪶 [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend, requires <8GB gpu memory for large-v2 with beam_size=5
 - 🎯 Accurate word-level timestamps using wav2vec2 alignment
-- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels) 
+- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
 - 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
 
-
-
 **Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.
 
 **Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is [wav2vec2.0](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self).
@@ -54,12 +48,12 @@ This repository provides fast automatic speech recognition (70x realtime with la
 
 <h2 align="left", id="highlights">New🚨</h2>
 
-- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER)  🏆
-- _WhisperX_ accepted at INTERSPEECH 2023 
+- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER) 🏆
+- _WhisperX_ accepted at INTERSPEECH 2023
 - v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
 - v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend!
 - v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
-- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.
+- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with \*60-70x REAL TIME speed.
 
 <h2 align="left" id="setup">Setup ⚙️</h2>
 
@@ -118,21 +112,18 @@ Run whisper on example segment (using default params, whisper small) add `--high
 
     whisperx path/to/audio.wav
 
-
-Result using *WhisperX* with forced alignment to wav2vec2.0 large:
+Result using _WhisperX_ with forced alignment to wav2vec2.0 large:
 
 https://user-images.githubusercontent.com/36994049/208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4
 
 Compare this to original whisper out the box, where many transcriptions are out of sync:
 
 https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov
 
-
 For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
 
     whisperx path/to/audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
 
-
 To label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):
 
     whisperx path/to/audio.wav --model large-v2 --diarize --highlight_words True
@@ -143,27 +134,26 @@ To run on CPU instead of GPU (and for running on Mac OS X):
 
 ### Other languages
 
-The phoneme ASR alignment model is *language-specific*, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
+The phoneme ASR alignment model is _language-specific_, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
 Just pass in the `--language` code, and use the whisper `--model large`.
 
 Currently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.
 
-
 #### E.g. German
+
     whisperx --model large-v2 --language de path/to/audio.wav
 
 https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov
 
-
 See more examples in other languages [here](EXAMPLES.md).
 
-## Python usage  🐍
+## Python usage 🐍
 
 ```python
 import whisperx
-import gc 
+import gc
 
-device = "cuda" 
+device = "cuda"
 audio_file = "audio.mp3"
 batch_size = 16 # reduce if low on GPU mem
 compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
@@ -192,7 +182,7 @@ print(result["segments"]) # after alignment
 # import gc; gc.collect(); torch.cuda.empty_cache(); del model_a
 
 # 3. Assign speaker labels
-diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
+diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
 
 # add min/max number of speakers if known
 diarize_segments = diarize_model(audio)
@@ -205,25 +195,27 @@ print(result["segments"]) # segments are now assigned speaker IDs
 
 ## Demos 🚀
 
-[![Replicate (large-v3](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/victor-upmeet/whisperx) 
-[![Replicate (large-v2](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/daanelson/whisperx) 
-[![Replicate (medium)](https://img.shields.io/static/v1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/carnifexer/whisperx) 
+[![Replicate (large-v3](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/victor-upmeet/whisperx)
+[![Replicate (large-v2](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/daanelson/whisperx)
+[![Replicate (medium)](https://img.shields.io/static/v1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/carnifexer/whisperx)
 
-If you don't have access to your own GPUs, use the links above to try out WhisperX. 
+If you don't have access to your own GPUs, use the links above to try out WhisperX.
 
 <h2 align="left" id="whisper-mod">Technical Details 👷‍♂️</h2>
 
 For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https://www.robots.ox.ac.uk/~vgg/publications/2023/Bain23/bain23.pdf).
 
 To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):
+
 1.  reduce batch size, e.g. `--batch_size 4`
-2. use a smaller ASR model `--model base`
-3. Use lighter compute type `--compute_type int8`
+2.  use a smaller ASR model `--model base`
+3.  Use lighter compute type `--compute_type int8`
 
 Transcription differences from openai's whisper:
+
 1. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
 2. VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference
-3.  `--condition_on_prev_text` is set to `False` by default (reduces hallucination)
+3. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)
 
 <h2 align="left" id="limitations">Limitations ⚠️</h2>
 
@@ -232,7 +224,6 @@ Transcription differences from openai's whisper:
 - Diarization is far from perfect
 - Language specific wav2vec2 model is needed
 
-
 <h2 align="left" id="contribute">Contribute 🧑‍🏫</h2>
 
 If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.
@@ -241,43 +232,40 @@ Bug finding and pull requests are also highly appreciated to keep this project g
 
 <h2 align="left" id="coming-soon">TODO 🗓</h2>
 
-* [x] Multilingual init
+- [x] Multilingual init
 
-* [x] Automatic align model selection based on language detection
+- [x] Automatic align model selection based on language detection
 
-* [x] Python usage
+- [x] Python usage
 
-* [x] Incorporating  speaker diarization
+- [x] Incorporating speaker diarization
 
-* [x] Model flush, for low gpu mem resources
+- [x] Model flush, for low gpu mem resources
 
-* [x] Faster-whisper backend
+- [x] Faster-whisper backend
 
-* [x] Add max-line etc. see (openai's whisper utils.py)
+- [x] Add max-line etc. see (openai's whisper utils.py)
 
-* [x] Sentence-level segments (nltk toolbox)
+- [x] Sentence-level segments (nltk toolbox)
 
-* [x] Improve alignment logic
+- [x] Improve alignment logic
 
-* [ ] update examples with diarization and word highlighting
+- [ ] update examples with diarization and word highlighting
 
-* [ ] Subtitle .ass output <- bring this back (removed in v3)
+- [ ] Subtitle .ass output <- bring this back (removed in v3)
 
-* [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
+- [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
 
-* [x] Allow silero-vad as alternative VAD option
-
-* [ ] Improve diarization (word level). *Harder than first thought...*
+- [x] Allow silero-vad as alternative VAD option
 
+- [ ] Improve diarization (word level). _Harder than first thought..._
 
 <h2 align="left" id="contact">Contact/Support 📇</h2>
 
-
 Contact maxhbain@gmail.com for queries.
 
 <a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
 
-
 <h2 align="left" id="acks">Acknowledgements 🙏</h2>
 
 This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and the University of Oxford.
@@ -286,8 +274,8 @@ Of course, this is builds on [openAI's whisper](https://github.com/openai/whispe
 Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
 And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
 
-
 Valuable VAD & Diarization Models from:
+
 - [pyannote audio][https://github.com/pyannote/pyannote-audio]
 - [silero vad][https://github.com/snakers4/silero-vad]
 
diff --git a/whisperx/__init__.py b/whisperx/__init__.py
@@ -29,12 +29,3 @@ def load_audio(*args, **kwargs):
 def assign_word_speakers(*args, **kwargs):
     diarize = _lazy_import("diarize")
     return diarize.assign_word_speakers(*args, **kwargs)
-
-
-class DiarizationPipeline:
-    def __init__(self, *args, **kwargs):
-        diarize = _lazy_import("diarize")
-        self._pipeline = diarize.DiarizationPipeline(*args, **kwargs)
-
-    def __getattr__(self, name):
-        return getattr(self._pipeline, name)