docker-whisperX

Docker Build

This is the docker image for WhisperX: Automatic Speech Recognition with Word-Level Timestamps (and Speaker Diarization) from the community.

The objective of this project is to efficiently manage the continuous integration docker build workflow on the GitHub Free runner on a weekly basis. Which includes building 175 Docker images in parallel, each with a size of 10GB. To ensure smooth operation, I have concentrated on utilizing docker layer caches efficiently, maximizing layer reuse, carefully managing cache read/write order to prevent any issues, and optimizing to minimize image size and build time.

Additionally, for my personal preference, I am dedicated to following best practices, industry standards and policies to the best of my ability.

Get the Dockerfile at GitHub, or pull the image from ghcr.io.

🚀 Get your Docker ready for GPU support

Windows

Once you have installed Docker Desktop, CUDA Toolkit, NVIDIA Windows Driver, and ensured that your Docker is running with WSL2, you are ready to go.

Here is the official documentation for further reference.
https://docs.nvidia.com/cuda/wsl-user-guide/index.html#nvidia-compute-software-support-on-wsl-2 https://docs.docker.com/desktop/wsl/use-wsl/#gpu-support

Linux, OSX

Install an NVIDIA GPU Driver if you do not already have one installed.
https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

Install the NVIDIA Container Toolkit with this guide.
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

[!TIP]
I have a Chinese blog about this topic:
Podman GPU Configuration Notes for Fedora/RHEL

📦 Available Pre-built Image

GitHub Workflow Status (with event) GitHub last commit (branch)

[!NOTE]
The WhisperX code base in these images aligns with the git submodule commit hash.
I have a scheduled CI workflow runs weekly to target on the main branch and rebuild all docker images.

docker run --gpus all -it -v ".:/app" ghcr.io/jim60105/whisperx:base-en     -- --output_format srt audio.mp3
docker run --gpus all -it -v ".:/app" ghcr.io/jim60105/whisperx:large-v3-ja -- --output_format srt audio.mp3
docker run --gpus all -it -v ".:/app" ghcr.io/jim60105/whisperx:no_model    -- --model tiny --language en --output_format srt audio.mp3

The image tags are formatted as WHISPER_MODEL-LANG, for example, tiny-en, base-de or large-v3-zh.
Please be aware that the whisper models *.en, large-v1, large-v2 have been excluded as I believe they are not frequently used. If you require these models, please refer to the following section to build them on your own.

You can find the actual build matrix in 04-build-matrix-images.yml and all available tags at ghcr.io.

In addition, there is also a no_model tag that does not include any pre-downloaded models, also referred to as latest.

`distil-large-v3-en` model

A distilled variant of large-v3 published by HuggingFace as distil-whisper/distil-large-v3. At 756M parameters it is roughly half the size of openai large-v3 (1550M, ~51% smaller) and 6.3× faster in relative latency while still landing within 1% WER of large-v3 on long-form audio under both sequential and chunked transcription algorithms — a strong default for English-only batch workloads.

Only the English (en) language pairing is published since distil-whisper models are English-only by design. Pull it with ghcr.io/jim60105/whisperx:distil-large-v3-en.

`breeze-asr-26-zh` model

A Taiwanese Hokkien (Taigi / 台語) ASR model published by MediaTek Research as MediaTek-Research/Breeze-ASR-26 and re-packaged for faster-whisper runtime by paulpengtw/faster-whisper-Breeze-ASR-26. The model is fine-tuned from Whisper on ~10,000 hours of synthetic Taigi speech (including Taigi/Mandarin code-switching) and transcribes spoken Taigi into Mandarin Chinese characters, leveraging the substantial lexical overlap between the two languages for a pragmatic, reproducible benchmarking workflow.

Because the output script is Mandarin, this image is shipped under the zh language pairing. Pull it with ghcr.io/jim60105/whisperx:breeze-asr-26-zh.

Note: when transcribing genuine Taigi (台語) audio, phoneme-level alignment will not work — the bundled zh wav2vec2 alignment model is trained on Mandarin phonology and cannot reliably align Taigi pronunciations against the model's Mandarin-character output. Pass --no_align to skip the alignment pass for Taigi input, e.g. docker run ... ghcr.io/jim60105/whisperx:breeze-asr-26-zh -- --no_align audio.mp3.

⚡️ Preserve the download cache for the align models when working with various languages

You can mount the /.cache to share align models between containers.
Please use tag no_model (latest) for this scenario.

docker run --gpus all -it -v ".:/app" -v whisper_cache:/.cache ghcr.io/jim60105/whisperx:latest -- --model large-v3 --language en --output_format srt audio.mp3

🛠️ Building the Docker Image

[!IMPORTANT]
Clone the Git repository recursively to include submodules:
git clone --recursive https://github.com/jim60105/docker-whisperX.git

Build Arguments

The Dockerfile builds the image contained models. It accepts two build arguments: LANG and WHISPER_MODEL.

LANG: The language to transcribe. The default is en. See supported languages in load_align_model.py.
WHISPER_MODEL: The model name. The default is base. See fast-whisper for supported models.

In case of multiple language alignments needed, use space separated list of languages "LANG=pl fr en" when building the image. Also note that WhisperX is not doing well to handle multiple languages within the same audio file. Even if you do not provide the language parameter, it will still recognize the language (or fallback to en) and use it for choosing the alignment model. Alignment models are language specific. This instruction is simply for embedding multiple alignment models into a docker image.

Build Command

For example, if you want to build the image with en language and large-v3 model:

docker build --build-arg LANG=en --build-arg WHISPER_MODEL=large-v3 -t whisperx:large-v3-en .

If you want to build the image without any pre-downloaded models:

docker build --target no_model -t whisperx:no_model .

If you want to build all images at once, we have a Docker bake file available:

docker buildx bake build no_model

Usage Command

Mount the current directory as /app and run WhisperX with additional input arguments:

docker run --gpus all -it -v ".:/app" whisperx:large-v3-ja -- --output_format srt audio.mp3

[!NOTE]
Remember to prepend -- before the arguments.
--model and --language args are defined in Dockerfile, no need to specify.

📝 LICENSE

The main program, WhisperX, is distributed under the BSD-4 license.
Please consult their repository for access to the source code and license.

The Dockerfile and CI workflow files in this repository are licensed under the MIT license.

docker-whisperX

🚀 Get your Docker ready for GPU support

Windows

Linux, OSX

📦 Available Pre-built Image

`distil-large-v3-en` model

`breeze-asr-26-zh` model

⚡️ Preserve the download cache for the align models when working with various languages

🛠️ Building the Docker Image

Build Arguments

Build Command

Usage Command

📝 LICENSE

🌟 Star History

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors

docker-whisperX

🚀 Get your Docker ready for GPU support

Windows

Linux, OSX

📦 Available Pre-built Image

distil-large-v3-en model

breeze-asr-26-zh model

⚡️ Preserve the download cache for the align models when working with various languages

🛠️ Building the Docker Image

Build Arguments

Build Command

Usage Command

📝 LICENSE

🌟 Star History

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors

`distil-large-v3-en` model

`breeze-asr-26-zh` model