Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

DiariZen

DiariZen is a speaker diarization toolkit driven by AudioZen and Pyannote 3.1.

Installation

# create virtual python environment
conda create --name diarizen python=3.10
conda activate diarizen

# install diarizen 
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 "mkl<2024.1" -c pytorch -c nvidia -c defaults
pip install -r requirements.txt && pip install -e .

# install pyannote-audio
cd pyannote-audio && pip install -e .[dev,testing]

# install dscore
git submodule init
git submodule update

Usage

  • For model training, see recipes/diar_ssl/run_stage.sh.
  • For model pruning, see recipes/diar_ssl_pruning/run_stage.sh.
  • For inference, our model supports for Hugging Face 🤗. See below:
from diarizen.pipelines.inference import DiariZenPipeline # load pre-trained model diar_pipeline = DiariZenPipeline.from_pretrained("BUT-FIT/diarizen-wavlm-large-s80-md") # apply diarization pipeline diar_results = diar_pipeline('./example/EN2002a_30s.wav') # print results for turn, _, speaker in diar_results.itertracks(yield_label=True): print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}") # start=0.0s stop=2.7s speaker_0 # start=0.8s stop=13.6s speaker_3 # start=5.8s stop=6.4s speaker_0 # ... # load pre-trained model and save RTTM result diar_pipeline = DiariZenPipeline.from_pretrained( "BUT-FIT/diarizen-wavlm-large-s80-md", rttm_out_dir='.' ) # apply diarization pipeline diar_results = diar_pipeline('./example/EN2002a_30s.wav', sess_name='EN2002a')

Benchmark

We train DiariZen models on a compound dataset composed of the datasets listed in the table below, followed by structured pruning to remove redundant parameters. For the results below:

  • AISHELL-4 was converted to mono using sox in.wav -c 1 out.wav.
  • NOTSOFAR-1 contains only single-channel recordings, e.g. sc_plaza_0, sc_rockfall_0.
  • Diarization Error Rate (DER) is evaluated without applying a collar.
  • No domain adaptation is applied to any individual dataset.
  • All experiments use the same clustering hyperparameters across datasets.
DatasetPyannote v3.1DiariZen-Base-s80DiariZen-Large-s80DiariZen-Large-s80-v2
AMI-SDM22.415.814.013.9
AISHELL-412.210.79.810.1
AliMeeting far24.414.112.510.8
NOTSOFAR-1-20.317.916.7
MSDWild25.317.415.615.8
DIHARD3 full21.715.914.514.5
RAMC22.211.411.011.0
VoxConverse11.39.79.29.1

Updates

2026-01-31: Updated multi-channel WavLM support for speaker diarization.

2025-12-09: Updated benchmarks with DiariZen-Large-s80-v2.

2025-06-03: Uploaded structured pruning recipes, released new pre-trained models, and updated multiple benchmark results.

Citations

If you found this work helpful, please consider citing

@inproceedings{han2025leveraging,
  title={Leveraging self-supervised learning for speaker diarization},
  author={Han, Jiangyu and Landini, Federico and Rohdin, Johan and Silnova, Anna and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
  booktitle={Proc. ICASSP},
  year={2025}
}

@article{han2025fine,
  title={Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization},
  author={Han, Jiangyu and Landini, Federico and Rohdin, Johan and Silnova, Anna and Diez, Mireia and Cernocky, Jan and Burget, Lukas},
  journal={arXiv preprint arXiv:2505.24111},
  year={2025}
}

@article{han2025efficient,
  title={Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models},
  author={Han, Jiangyu and P{\'a}lka, Petr and Delcroix, Marc and Landini, Federico and Rohdin, Johan and Cernock{\`y}, Jan and Burget, Luk{\'a}{\v{s}}},
  journal={arXiv preprint arXiv:2506.18623},
  year={2025}
}

License

  • The code in this repository is licensed under the MIT license.
  • The pre-trained model weights are released under the CC BY-NC 4.0 license. By downloading these weights, you agree to the non-commercial terms and the compliance details outlined below.

Important Compliance Note

The CC BY-NC 4.0 license for model weights ensures compliance with the most restrictive training datasets (e.g., RAMC, MSDWild, and DIHARD-3) that explicitly prohibit commercial use.

Regarding License Compatibility: We acknowledge that certain datasets in our training mixture (e.g., AISHELL-4, AliMeeting) carry a CC BY-SA 4.0 (ShareAlike) license. Under the assumption that model weights could be viewed as "derivative works," a logical conflict arises between the "Must allow commercial use" (SA) and "Must forbid commercial use" (NC) requirements.

To prioritize the protection of non-commercial data providers and to prevent unauthorized commercial exploitation, we have applied the NC (Non-Commercial) restriction to this specific release. By downloading these weights, you agree:

  • To use them for research and academic purposes only.
  • That this release does not grant a commercial license, even for the portions of the model potentially influenced by SA-licensed data.
  • That you assume all legal risks if you attempt to use these weights in a commercial product.

Contact

If you have any comment or question, please contact ihan@fit.vut.cz

关于 About

A toolkit for speaker diarization.

语言 Languages

Jupyter Notebook54.7%
Python44.7%
Shell0.5%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
44
Total Commits
峰值: 10次/周
Less
More

核心贡献者 Contributors