SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation
Le Shen*, Qian Qiao*, Tan Yu*, Ke Zhou, Tianhang Yu, Yu Zhan, Zhenjie Wang, Dingcheng Zhen, Ming Tao, Shunshun Yin, Siyuan Liu ✉
*Equal Contribution ✉Corresponding Author
🔥 News
- 2026.02.12 - We have released the SoulX-FlashHead, which is a streaming talking head project that achieves real-time performance on consumer GPUs (e.g., RTX 4090/5090).
- 2026.01.08 - We have released the inference code, and the model weights.
- 2025.12.30 - We released Project page on SoulX-FlashTalk.
- 2025.12.30 - We released SoulX-FlashTalk Technical Report on Arxiv and GitHub repository.
🤫 Coming soon
A 4-GPU real-time version of SoulX-FlashTalk.
📑 Todo List
- Technical report
- Project Page
- Inference code
- Checkpoint release
- Online demo
📢 Live Streaming & Video Podcast
🎬 Online Demos
🌰 Examples
📖 Quickstart
🔧 Installation
1. Create a Conda environment
conda create -n flashtalk python=3.10 conda activate flashtalk
2. Install PyTorch on CUDA
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu128
3. Install other dependencies
pip install -r requirements.txt
4. Flash-attention installation:
pip install ninja pip install flash_attn==2.8.0.post2 --no-build-isolation
5. FFmpeg installation
# Ubuntu / Debian apt-get install ffmpeg # CentOS / RHEL yum install ffmpeg ffmpeg-devel
or
# Conda (no root required) conda install -c conda-forge ffmpeg==7
🤗 Model download
| Model Component | Description | Link |
|---|---|---|
SoulX-FlashTalk-14B | Our 14b model | 🤗 Huggingface |
chinese-wav2vec2-base | chinese-wav2vec2-base | 🤗 Huggingface |
# If you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com pip install "huggingface_hub[cli]" huggingface-cli download Soul-AILab/SoulX-FlashTalk-14B --local-dir ./models/SoulX-FlashTalk-14B huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./models/chinese-wav2vec2-base
🚀 Inference
# Infer on single GPU # Requires more than 64G of VRAM. Use --cpu_offload to reduce VRAM usage to 40G. bash inference_script_single_gpu.sh # Infer on multy GPUs # Real-time inference speed can only be supported on 8xH800 or higher graphics cards bash inference_script_multi_gpu.sh
👋 Online Demo
Coming Soon!
📧 Contact Us
If you are interested in leaving a message to our work, feel free to email le.shen@mail.dhu.edu.cn or qiaoqian@soulapp.cn or yutan@soulapp.cn or zhouke@soulapp.cn or liusiyuan@soulapp.cn
Due to Group 1 reaching its capacity, we have opened a new WeChat group. Additionally, we represent SoulApp and warmly welcome everyone to download the app and join our Soul group for further technical discussions and updates!
Join WeChat Group (加入微信技术群) |
Download SoulApp & Join Group (下载SoulApp加入群组) |
📚 Citation
If you find our work useful in your research, please consider citing:
@misc{shen2025soulxflashtalktechnicalreport,
title={SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation},
author={Le Shen and Qian Qiao and Tan Yu and Ke Zhou and Tianhang Yu and Yu Zhan and Zhenjie Wang and Ming Tao and Shunshun Yin and Siyuan Liu},
year={2025},
eprint={2512.23379},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.23379},
}
🙇 Acknowledgement
- Infinitetalk and Wan: the base model we built upon.
- Self forcing: the codebase we built upon.
- DMD and Self forcing++: the key distillation technique used by our method.
[!TIP] If you find our work useful, please also consider starring the original repositories of these foundational methods.