sarvamai/pyannote-audio
forked from pyannote/pyannote-audio
Captured source
source ↗sarvamai/pyannote-audio
Description: Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
License: MIT
Stars: 1
Forks: 1
Open issues: 0
Created: 2025-06-21T15:19:14Z
Pushed: 2025-06-23T19:55:12Z
Default branch: main
Fork: yes
Parent repository: pyannote/pyannote-audio
Archived: no
README: Using pyannote.audio open-source toolkit in production? Consider switching to pyannoteAI for better and faster options.
pyannote.audio speaker diarization toolkit
pyannote.audio is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art pretrained models and pipelines, that can be further finetuned to your own data for even better performance.
TL;DR
1. Install `pyannote.audio` with pip install pyannote.audio 2. Accept `pyannote/segmentation-3.0` user conditions 3. Accept `pyannote/speaker-diarization-3.1` user conditions 4. Create access token at `hf.co/settings/tokens`.
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
# send pipeline to GPU (when available)
import torch
pipeline.to(torch.device("cuda"))
# apply pretrained pipeline
diarization = pipeline("audio.wav")
# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
# start=0.2s stop=1.5s speaker_0
# start=1.8s stop=3.9s speaker_1
# start=4.2s stop=5.7s speaker_0
# ...Highlights
- :hugs: pretrained pipelines (and models) on :hugs: model hub
- :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark))
- :snake: Python-first API
- :zap: multi-GPU training with pytorch-lightning
Documentation
- [Changelog](CHANGELOG.md)
- [Frequently asked questions](FAQ.md)
- Models
- Available tasks explained
- [Applying a pretrained model](tutorials/applying_a_model.ipynb)
- [Training, fine-tuning, and transfer learning](tutorials/training_a_model.ipynb)
- Pipelines
- Available pipelines explained
- [Applying a pretrained pipeline](tutorials/applying_a_pipeline.ipynb)
- [Adapting a pretrained pipeline to your own data](tutorials/adapting_pretrained_pipeline.ipynb)
- [Training a pipeline](tutorials/voice_activity_detection.ipynb)
- Contributing
- [Adding a new model](tutorials/add_your_own_model.ipynb)
- [Adding a new task](tutorials/add_your_own_task.ipynb)
- Adding a new pipeline
- Sharing pretrained models and pipelines
- Blog
- 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb)
- 2022-10-23 > "One speaker segmentation model to rule them all"
- 2021-08-05 > "Streaming voice activity detection with pyannote.audio"
- Videos
- Introduction to speaker diarization / JSALT 2023 summer school / 90 min
- Speaker segmentation model / Interspeech 2021 / 3 min
- First release of pyannote.audio / ICASSP 2020 / 8 min
- Community contributions (not maintained by the core team)
- 2024-04-05 > [Offline speaker diarization (speaker-diarization-3.1)](tutorials/community/offline_usage_speaker_diarization.ipynb) by Simon Ottenhaus
Benchmark
Out of the box, pyannote.audio speaker diarization pipeline v3.1 is expected to be much better (and faster) than v2.x. Those numbers are diarization error rates (in %):
| Benchmark | v2.1 | v3.1 | pyannoteAI | | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------ | | AISHELL-4 | 14.1 | 12.2 | 11.9 | | AliMeeting (channel 1) | 27.4 | 24.4 | 22.5 | | AMI (IHM) | 18.9 | 18.8 | 16.6 | | AMI (SDM) | 27.1 | 22.4 | 20.9 | | AVA-AVD | 66.3 | 50.0 | 39.8 | | CALLHOME (part 2) | 31.6 | 28.4 | 22.2 | | DIHARD 3 (full) | 26.9 | 21.7 | 17.2 | | Earnings21 | 17.0 | 9.4 | 9.0 | | Ego4D (dev.) | 61.5 | 51.2 | 43.8 | | MSDWild | 32.8 | 25.3 | 19.8 | | RAMC | 22.5 | 22.2 | 18.4 | | REPERE (phase2) | 8.2 | 7.8 | 7.6 | | VoxConverse (v0.3) | 11.2 | 11.3 | 9.4 |
Diarization error rate (in %)
Citations
If you use pyannote.audio please use the following citations:
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}@inproceedings{Bredin23,
author={Hervé…Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine fork, low traction