Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Awesome TTS & Voice Generation Models

A curated list of open-source Text-to-Speech (TTS), voice cloning, and music generation models. Models are sorted by release date (newest first).

logo-tts2


Table of Contents


Text-to-Speech (TTS) Models

TTS Quick Comparison

ModelVoice CloningASRLanguagesStreamingLicense
LongCat-AudioDiTZh/EnMIT
VoxCPM230Apache-2.0
MOSS-TTS-Nano20Apache-2.0
T5Gemma-TTSEn/Zh/JpMIT
TinyTTSEnApache-2.0
LEMAS-TTS10Apache-2.0
OmniVoice600+Apache-2.0
LongCat-NextZh/EnMIT
Voxtral-4B-TTS9CC BY-NC 4.0
Irodori-TTS-500M-v2JpMIT
Fish Audio S2 Pro80+Research License
KittenTTSEn+Apache-2.0
MOSS-TTS20Apache-2.0
SoulX-Singer✅ (Singing)Zh/En/CantoApache-2.0
SoproTTSEnApache-2.0
NeuTTSEn/Es/De/FrApache-2.0
Qwen3-TTS10Apache-2.0
GLM-TTSZh/EnApache-2.0
VibeVoice-RealtimeMultiMIT
Fun-CosyVoice 3.09 + 18 dialectsApache-2.0
MioTTS-2.6BEn/JpLFM
Supertonic 25OpenRAIL-M
KugelAudio23 EUMIT
Kokoro-82M8 (54 voices)Apache-2.0
KokoClone7Apache-2.0
IndexTTS2Zh/EnApache-2.0
Maya1EnApache-2.0
LFM2-Audio-1.5BEnLFM
Step-Audio-EditXZh/En/Jp/KoApache-2.0
FireRedTTS27 langsApache-2.0
VoxCPMZh/EnApache-2.0
LuxTTS-Apache-2.0
MegaTTS3Zh/EnApache-2.0
Spark-TTSZh/EnApache-2.0
Fish Speech8 langsApache-2.0
Step-AudioZh/En/JpApache-2.0
SoulX-PodcastZh/En/CantoApache-2.0
Chatterbox23+MIT
Orpheus-TTSMultiApache-2.0
DiaEnApache-2.0
VieNeu-TTSViApache-2.0
MiMo-AudioMultiApache-2.0
Kimi-AudioMultiMIT/Apache-2.0
ZipVoiceZh/EnApache-2.0
LongCat-AudioDiT

LongCat-AudioDiT

Description: State-of-the-art diffusion-based TTS model operating directly in waveform latent space. Developed by Meituan's LongCat team, it requires only a Waveform VAE and Diffusion backbone, effectively mitigating compounding errors.

Release Date: March 30, 2026

FeatureValue
Parameters1B / 3.5B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesChinese, English
Streaming
Sample Rate24000 Hz
LicenseMIT

Key Innovation: Adaptive Projection Guidance (APG) replaces traditional classifier-free guidance for elevated generation quality. Outperforms Seed-TTS on zero-shot voice cloning benchmarks.

Links: GitHub Hugging Face 1B Hugging Face 3.5B

VoxCPM2

VoxCPM2

Description: OpenBMB's next-generation tokenizer-free diffusion autoregressive TTS model with 2 billion parameters. Supports 30 languages with automatic detection, voice design from text descriptions, and high-fidelity voice cloning.

Release Date: 2026

FeatureValue
Parameters2B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (Voice Design)
Languages30 (+ 9 Chinese dialects)
Streaming✅ (RTF ~0.3)
Audio Output48 kHz
LicenseApache-2.0

Key Innovation: Tokenizer-free design with LocEnc → TSLM → RALM → LocDiT pipeline. Built-in super-resolution via AudioVAE V2 for 48kHz output.

Links: GitHub Hugging Face Demo

MOSS-TTS-Nano

MOSS-TTS-Nano

Description: Ultra-lightweight open-source multilingual speech generation model with only 0.1B parameters. Designed for realtime speech generation that runs directly on CPU without GPU.

Release Date: 2026

FeatureValue
Parameters0.1B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
Languages20
Streaming✅ (CPU-friendly)
Audio Output48 kHz Stereo
LicenseApache-2.0

Key Innovation: Pure autoregressive architecture with MOSS-Audio-Tokenizer-Nano. Compresses audio to 12.5 Hz token stream using RVQ with 16 codebooks. Runs on 4-core CPU.

Links: GitHub Hugging Face Demo

T5Gemma-TTS

T5Gemma-TTS

Description: Multilingual TTS model with voice cloning and duration control, built on the T5Gemma encoder-decoder LLM architecture. Supports batch generation for multiple audio variations.

Release Date: 2026

FeatureValue
Parameters2B-2B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish, Chinese, Japanese
Streaming
VRAM7.6-10.6 GB
LicenseMIT

Key Innovation: PM-RoPE positional encoding with XCodec2 audio codec. Low-VRAM options with CPU offloading. Batch inference efficiency with single encoder pass.

Links: GitHub Hugging Face Demo

TinyTTS

TinyTTS

Description: The smallest English TTS model with only 1.6 million parameters. End-to-end neural network achieving ~53x real-time synthesis speed on CPU via ONNX optimization.

Release Date: 2026

FeatureValue
Parameters1.6M
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish
Streaming✅ (~53x RTF)
Model Size~3.4 MB (ONNX FP16)
LicenseApache-2.0

Key Innovation: Ultra-compact architecture optimized for CPU-only deployment. Multi-platform support via Python and Node.js APIs. Works on laptops, edge devices, and embedded systems.

Links: GitHub Hugging Face Demo

LEMAS-TTS

LEMAS-TTS

Description: Part of the LEMAS (Large-scale Extensible Multilingual Audio Suite) project. Zero-shot multilingual TTS with 0.3B parameters supporting 10 languages with word-level precise editing capabilities.

Release Date: 2026

FeatureValue
Parameters0.3B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
Languages10 (zh/en/de/fr/es/pt/it/ru/id/vi)
Streaming
Special FeatureWord-level editing (LEMAS-Edit)
LicenseApache-2.0

Key Innovation: Built on 150,000+ hours of multilingual speech data with word-level timestamps. Includes LEMAS-Edit for precise word-level speech editing via masked token infilling.

Links: Website Hugging Face TTS Hugging Face Edit

OmniVoice

OmniVoice

Description: Massive multilingual zero-shot TTS model scaling to 600+ languages. Uses diffusion language model-style discrete non-autoregressive architecture with single-stage text-to-acoustic mapping.

Release Date: 2026

FeatureValue
Parameters-
Zero-shot Voice Cloning
ASR
Pronunciation Control✅ (Pinyin/CMU)
Emotion Control✅ (Voice Design)
Languages600+
Streaming
Training Data581k hours
LicenseApache-2.0

Key Innovation: Simplified single-stage architecture vs conventional two-stage pipelines. Full-codebook random masking strategy with LLM initialization for superior intelligibility. Noise-robust prompt processing.

Links: Website Hugging Face

LongCat-Next

LongCat-Next

Description: Native multimodal foundation model by Meituan LongCat Team processing text, vision, and audio under a single autoregressive objective. Industrial-strength model with strong speech synthesis and voice cloning.

Release Date: March 2026

FeatureValue
Parameters3B (MoE A3B)
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesChinese, English
Streaming✅ (Low latency)
Audio Output24 kHz
LicenseMIT

Key Innovation: Discrete Native Autoregression Paradigm (DiNA) unifying modalities in shared discrete token space. Combines visual understanding, generation, and audio processing in single model.

Links: GitHub Hugging Face

Voxtral-4B-TTS

Voxtral-4B-TTS

Description: Frontier, open-weights text-to-speech model developed by Mistral AI. Designed to be fast, instantly adaptable, and produces lifelike speech with natural prosody and emotional range.

Release Date: March 2026

FeatureValue
Parameters4B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (expressive speech)
Languages9 (En, Fr, Es, De, It, Pt, Nl, Ar, Hi)
Streaming✅ (RTF 0.103 at concurrency 1)
Audio Output24 kHz
LicenseCC BY-NC 4.0

Links: Hugging Face Demo Blog

Irodori-TTS-500M-v2

Irodori-TTS-500M-v2

Description: Japanese Text-to-Speech model based on Rectified Flow Diffusion Transformer. Features emoji-based style and sound effect control by embedding emojis in input text for expressive speech generation.

Release Date: 2026

FeatureValue
Parameters500M
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (emoji-based)
LanguagesJapanese
Streaming
Output Quality48kHz waveform
LicenseMIT

Key Feature: Emoji annotation control - insert specific emojis into text to control speaking styles, emotions, and sound effects.

Links: Hugging Face GitHub Demo

Fish Audio S2 Pro

Fish Audio S2 Pro

Description: Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. It combines reinforcement learning alignment with a dual-autoregressive architecture for high-quality speech synthesis.

Release Date: March 10, 2026

FeatureValue
Parameters5B (4B Slow AR + 400M Fast AR)
Zero-shot Voice Cloning
ASR
Pronunciation Control✅ (15,000+ tags)
Emotion Control✅ (fine-grained inline control)
Languages80+ (Tier 1: En, Zh, Jp)
Streaming✅ (RTF 0.195, 100ms TTFA)
Model Size~10 GB (BF16)
LicenseFish Audio Research License

Links: GitHub Hugging Face

KittenTTS

KittenTTS

Description: KittenTTS is an open-source realistic text-to-speech model designed for lightweight deployment. It is a state-of-the-art TTS model under 25MB with just 15 million parameters, running without GPU on any device.

Release Date: February 24, 2026 (v0.8.1)

FeatureValue
Parameters15M-80M
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish, Multiple
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face

MOSS-TTS

MOSS-TTS

Description: MOSS-TTS is a production-grade Text-to-Speech foundation model developed by OpenMOSS Team and MOSI.AI. Features state-of-the-art evaluation performance on Seed-TTS-eval benchmark with zero-shot voice cloning.

Release Date: February 10, 2026

FeatureValue
Parameters8B (Delay), 1.7B (Local)
Zero-shot Voice Cloning
ASR
Pronunciation Control✅ (Pinyin/Phoneme-level)
Emotion Control
Languages20 languages
Streaming
Max Duration1 hour
LicenseApache-2.0

Links: GitHub Hugging Face Project Page

SoulX-Singer

SoulX-Singer

Description: SoulX-Singer is a high-fidelity, zero-shot singing voice synthesis model for generating realistic singing voices for unseen singers without fine-tuning.

Release Date: February 6, 2026

FeatureValue
Parameters-
Zero-shot Voice Cloning✅ (Singing)
ASR
Pronunciation Control✅ (MIDI/F0)
Emotion Control
LanguagesMandarin, English, Cantonese
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

SoproTTS

SoproTTS

Description: SoproTTS is a lightweight English text-to-speech model with zero-shot voice cloning. It uses dilated convolutions (WaveNet-style) and lightweight cross-attention layers instead of the common Transformer architecture.

Release Date: February 4, 2026 (v1.5)

FeatureValue
Parameters135M
Zero-shot Voice Cloning✅ (3-12s)
ASR
Pronunciation Control
Emotion Control✅ (style_strength)
LanguagesEnglish
Streaming✅ (250ms TTFA)
RTF0.05 (CPU M3)
Training Cost~$100
LicenseApache-2.0

Links: GitHub Hugging Face

NeuTTS

NeuTTS

Description: NeuTTS is a collection of open-source on-device TTS models with instant voice cloning. Built off LLM backbones with GGUF format quantizations for efficient on-device deployment.

Release Date: Early 2026

FeatureValue
Parameters360M (Air), 120M (Nano)
Zero-shot Voice Cloning✅ (3-second)
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish, Spanish, German, French
Streaming
On-Device✅ (GGUF quantizations)
LicenseApache-2.0 (Air), NeuTTS Open License 1.0 (Nano)

Links: GitHub Hugging Face Hugging Face

Qwen3-TTS

Qwen3-TTS

Description: Qwen3-TTS is an open-source series of Text-to-Speech models developed by Alibaba Cloud. Supports stable, expressive, and streaming speech generation with free-form voice design.

Release Date: January 22, 2026

FeatureValue
Parameters0.6B-1.7B
Zero-shot Voice Cloning✅ (3-second)
ASR
Pronunciation Control
Emotion Control
Languages10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
Streaming✅ (97ms latency)
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

GLM-TTS

GLM-TTS

Description: High-quality TTS synthesis system based on LLMs from ZhipuAI, supporting zero-shot voice cloning with Multi-Reward Reinforcement Learning.

Release Date: December 11, 2025

FeatureValue
Parameters-
Zero-shot Voice Cloning✅ (3-10s)
ASR
Pronunciation Control✅ (Phoneme-level)
Emotion Control✅ (RL-enhanced)
LanguagesChinese, English
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

VibeVoice-Realtime

VibeVoice-Realtime

Description: Real-time TTS model from Microsoft with streaming text input and ultra-low latency (~300ms).

Release Date: December 3, 2025

FeatureValue
Parameters0.5B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesMultilingual
Streaming✅ (300ms)
Max Duration~10 minutes
LicenseMIT

Links: GitHub Hugging Face

Fun-CosyVoice 3.0

Fun-CosyVoice 3.0

Description: Advanced TTS system based on LLMs for zero-shot multilingual speech synthesis from FunAudioLLM.

Release Date: December 2025

FeatureValue
Parameters0.5B
Zero-shot Voice Cloning✅ (Multi-lingual/Cross-lingual)
ASR
Pronunciation Control✅ (Pinyin/CMU)
Emotion Control
Languages9 + 18+ Chinese dialects
Streaming✅ (150ms)
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

MioTTS-2.6B

MioTTS-2.6B

Description: Lightweight, high-speed LLM-based TTS model for English and Japanese with minimal resource usage.

Release Date: 2026

FeatureValue
Parameters2.6B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish, Japanese
Streaming
RTF0.135-0.145
LicenseLFM Open License

Links: Hugging Face GitHub

Supertonic 2

Supertonic 2

Description: Lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on-device—no cloud, no API calls, no privacy concerns. Outperforms ElevenLabs Flash v2.5 by up to 42× in speed benchmarks.

Release Date: 2026

FeatureValue
Parameters66M
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish, Korean, Spanish, Portuguese, French
Streaming
RTF0.001-0.015 (up to 167× realtime)
On-Device✅ (ONNX Runtime)
LicenseOpenRAIL-M

Performance Comparison:

SystemSpeed (chars/sec)RTF
Supertonic 2 (RTX 4090)12,1640.001
Supertonic 2 (M4 Pro CPU)1,2630.012
ElevenLabs Flash v2.52870.5
Kokoro (Open-source)1171.3

Links: GitHub Hugging Face Demo

KugelAudio

KugelAudio

Description: Open-source TTS for European languages with 7B parameters. Outperformed ElevenLabs in human preference testing.

Release Date: Early 2026

FeatureValue
Parameters7B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (Speaking styles)
Languages23 European languages
Streaming
LicenseMIT

Links: GitHub Hugging Face Website

Kokoro-82M

Kokoro-82M

Description: Kokoro is an open-weight Text-to-Speech model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Release Date: January 27, 2025 (v1.0)

FeatureValue
Parameters82M
ArchitectureStyleTTS 2, ISTFTNet
Zero-shot Voice Cloning
ASR
Pronunciation Control✅ (via misaki G2P)
Emotion Control✅ (voice styles)
Languages8 (54 voices)
Streaming✅ (generator pattern)
Cost<$0.06 per hour of audio
LicenseApache-2.0

Links: GitHub Hugging Face Demo

KokoClone

KokoClone

Description: KokoClone is a fast, real-time compatible multilingual voice cloning system built on top of Kokoro-ONNX. It enables users to type text in multiple languages, provide a short 3-10 second reference audio clip, and instantly generate speech in that same voice.

Release Date: 2025

FeatureValue
Parameters82M (Base: Kokoro-ONNX)
Zero-shot Voice Cloning✅ (3-10s reference)
ASR
Pronunciation Control
Emotion Control
Languages7 (En, Hi, Fr, Ja, Zh, It, Pt, Es)
Streaming✅ (CPU real-time)
LicenseApache-2.0

Links: GitHub Hugging Face Demo

IndexTTS2

IndexTTS2

Description: AI-Enhanced Text-to-Speech System with Intelligent Optimization and self-learning capabilities.

Release Date: November 2025

FeatureValue
Parameters-
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (5 emotions)
LanguagesChinese, English
Streaming
Multi-speaker✅ (1-4 speakers)
LicenseApache-2.0

Links: GitHub Hugging Face

Maya1

Maya1

Description: State-of-the-art speech model for expressive voice generation with natural language voice control.

Release Date: November 2025

FeatureValue
Parameters3B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (Tags)
LanguagesEnglish (Multi-accent)
Streaming✅ (<100ms)
LicenseApache-2.0

Links: Hugging Face Website

LFM2-Audio-1.5B

LFM2-Audio-1.5B

Description: Liquid AI's first end-to-end audio foundation model with low latency and real-time conversation.

Release Date: November 28, 2025

FeatureValue
Parameters1.5B
Zero-shot Voice Cloning
ASR✅ (Integrated)
Pronunciation ControlN/A
Emotion Control
LanguagesEnglish
Streaming
LicenseLFM Open License

Links: Hugging Face Website

Step-Audio-EditX

Step-Audio-EditX

Description: 3B-parameter LLM-based RL audio model specialized in expressive and iterative audio editing.

Release Date: November 2025

FeatureValue
Parameters3B (4B BF16)
Zero-shot Voice Cloning
ASR
Pronunciation Control✅ (Polyphone)
Emotion Control✅ (14 emotions)
LanguagesMandarin, English, Sichuanese, Cantonese, Japanese, Korean
Streaming
LicenseApache-2.0

Links: Hugging Face arXiv

FireRedTTS2

FireRedTTS2

Description: Long-form streaming TTS system for multi-speaker dialogue generation with stable, natural speech.

Release Date: September 2025

FeatureValue
Parameters-
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEN, ZH, JP, KO, FR, DE, RU
Streaming✅ (140ms)
Multi-speaker✅ (4 speakers)
Max Duration3 minutes
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

VoxCPM

VoxCPM

Description: Tokenizer-free TTS system for context-aware speech generation and true-to-life voice cloning.

Release Date: September 16, 2025

FeatureValue
Parameters640M-800M
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesChinese, English
Streaming✅ (RTF 0.17)
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

LuxTTS

LuxTTS

Description: Lightweight ZipVoice-based TTS model for high quality voice cloning at speeds exceeding 150x realtime.

Release Date: 2025

FeatureValue
Parameters-
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
Languages-
Streaming
RTF150x
VRAM1GB
LicenseApache-2.0

Links: GitHub Hugging Face

MegaTTS3

MegaTTS3

Description: Advanced zero-shot speech synthesis with Sparse Alignment Enhanced Latent Diffusion Transformer.

Release Date: March 22, 2025

FeatureValue
Parameters0.45B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesChinese, English
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

Spark-TTS

Spark-TTS

Description: Efficient LLM-Based TTS Model with Single-Stream Decoupled Speech Tokens, built on Qwen2.5.

Release Date: March 2025

FeatureValue
Parameters0.5B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesChinese, English
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

Fish Speech

Fish Speech

Description: State-of-the-art open source TTS and voice cloning model that generates natural, realistic, and emotionally rich speech.

Release Date: May 31, 2025 (v1.5.1)

FeatureValue
Parameters4B (S1), 0.5B (S1-mini)
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
Languages8 (EN, JP, KO, ZH, FR, DE, AR, ES)
Streaming
RTF~1:7
LicenseApache-2.0

Links: GitHub Website

Step-Audio

Step-Audio

Description: Production-ready open-source framework for intelligent speech interaction with unified speech comprehension and generation.

Release Date: February 17, 2025

FeatureValue
Parameters130B (Chat), 3B (TTS)
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesChinese, English, Japanese
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

Audio Flamingo 3 (AF3) / Audio Flamingo Next

Audio Flamingo 3 (AF3) / Audio Flamingo Next

Description: NVIDIA ADLR's fully open-source Large Audio Language Model with state-of-the-art audio understanding. Audio Flamingo Next (AF-Next) is the latest generation featuring stronger general audio understanding, longer context support, and timestamp-grounded reasoning.

Release Date: July 2025 (AF3), 2026 (AF-Next)

FeatureValue
Parameters7B
Zero-shot Voice Cloning
ASR
Pronunciation ControlN/A
Emotion Control
LanguagesMulti-lingual
Streaming
ContextUp to 30 minutes
LicenseApache-2.0

Key Innovation (AF-Next): Staged curriculum training with GRPO-based RL post-training. Three specialized checkpoints: Instruct, Think (reasoning), and Captioner. Temporal Audio Chain-of-Thought grounding intermediate reasoning to timestamps.

Links: GitHub Hugging Face AF3 Website AF-Next

SoulX-Podcast

SoulX-Podcast

Description: SOTA Multi-Speaker TTS model for generating realistic long-form podcasts with dialectal diversity.

Release Date: 2025

FeatureValue
Parameters-
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesMandarin, English, Cantonese, Sichuanese, Henanese
Streaming
Max Duration90+ minutes
LicenseApache-2.0

Links: GitHub Hugging Face arXiv

Chatterbox

Chatterbox

Description: Family of SOTA open-source TTS models by Resemble AI with zero-shot voice cloning and multilingual synthesis.

Release Date: June 13, 2025 (v0.1.2)

FeatureValue
Parameters350M-500M
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control✅ (Tags)
Languages23+
Streaming
LicenseMIT

Links: GitHub Website

Orpheus-TTS

Orpheus-TTS

Description: SOTA open-source TTS built on Llama-3b backbone demonstrating emergent capabilities of LLMs for speech synthesis.

Release Date: April 2025

FeatureValue
Parameters3B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesMultilingual
Streaming✅ (200ms)
LicenseApache-2.0

Links: GitHub Website

Dia

Dia

Description: 1.6B parameter TTS model by Nari Labs for generating ultra-realistic dialogue in one pass.

Release Date: June 27, 2024

FeatureValue
Parameters1.6B
Zero-shot Voice Cloning
ASR
Pronunciation Control
Emotion Control
LanguagesEnglish
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face

VieNeu-TTS

VieNeu-TTS

Description: Advanced on-device Vietnamese TTS model with instant voice cloning from 3-5 seconds of reference audio.

Release Date: 2025

FeatureValue
Parameters0.3B-0.6B
Zero-shot Voice Cloning✅ (3-5s)
ASR
Pronunciation Control
Emotion Control
LanguagesVietnamese
Streaming✅ (On-device)
LicenseApache-2.0

Links: Hugging Face GitHub

MiMo-Audio

MiMo-Audio

Description: Audio Language Model by Xiaomi functioning as a Few-Shot Learner with SOTA audio understanding.

Release Date: 2025

FeatureValue
Parameters7B
Zero-shot Voice Cloning
ASR
Pronunciation ControlN/A
Emotion Control
LanguagesMulti-lingual
Streaming
LicenseApache-2.0

Links: GitHub Hugging Face

Kimi-Audio

Kimi-Audio

Description: Open-source audio foundation model by Moonshot AI for audio understanding, generation, and conversation.

Release Date: 2024

FeatureValue
Parameters7B
Zero-shot Voice Cloning
ASR
Pronunciation ControlN/A
Emotion Control
LanguagesMulti-lingual
Streaming
LicenseMIT/Apache-2.0

Links: GitHub Hugging Face

ZipVoice

ZipVoice

Description: Fast and high-quality zero-shot TTS models based on flow matching.

Release Date: June 16, 2025

FeatureValue
Parameters123M
Zero-shot Cloning
LanguagesChinese, English
Dialogue
LicenseApache-2.0

Links: GitHub Website arXiv


Music Generation Models

Music Quick Comparison

ModelMusic GenLanguagesStreamingLicense
ACE-Step 1.550+MIT
LeVo 2Zh/EnApache-2.0
Foundation-1✅ (Samples)-Stability AI
Music Flamingo--Apache-2.0
Magenta Realtime-Apache-2.0/CC-BY-4.0
Uni-MoE (Audio)-Apache-2.0
ACE-Step 1.5

ACE-Step 1.5

Description: The most powerful local music generation model outperforming most commercial alternatives. Supports Mac, AMD, Intel, and CUDA devices.

Release Date: February 20, 2026 (v0.1.2)

FeatureValue
Parameters0.6B-4B (LM), DiT variants
Music Generation
Lyrics Support✅ (50+ languages)
Voice2BGM
Reference Audio
Track Separation
Duration10s - 10min
VRAM<4GB
PlatformsCUDA, MPS, ROCm, XPU, CPU
LicenseMIT

Links: GitHub Hugging Face Website arXiv

LeVo 2

LeVo 2 (SongGeneration 2)

Description: Open-source foundation model for commercial-grade music generation by Tencent AI Lab. It outperforms open-source baselines and rivals commercial systems in Overall Quality, Melody, Arrangement, Sound Quality, and Structure.

Release Date: 2025

FeatureValue
ArchitectureHybrid LLM-Diffusion
Music Generation
Lyrics Support✅ (Chinese, English)
Multilingual✅ (Zh, En)
Text/Audio Prompts
VRAM12GB-22GB
LicenseApache-2.0

Links: GitHub Hugging Face Demo

Foundation-1

Foundation-1

Description: Structured text-to-sample generation model for music production workflows. Generates tempo-synced, key-aware, bar-aware sample generation with support for instrument identity, timbre control, and FX processing.

Release Date: 2025

FeatureValue
TypeText-to-Sample (Music)
Base Modelstabilityai/stable-audio-open-1.0
Instrument Control
Timbre Descriptors✅ (Warm, Bright, etc.)
FX Tags✅ (Reverb, Delay, etc.)
Musical Notation✅ (Chord, Melody, Arp)
VRAM~8GB
LicenseStability AI Community License

Links: Hugging Face

Music Flamingo

Music Flamingo

Description: Large audio-language model designed to advance music (including song) understanding. Achieves SOTA on 10+ music benchmarks.

Release Date: 2025

FeatureValue
Parameters-
Music Understanding
Music Generation
Rich Captions
Music QA
Reasoning✅ (Chain-of-thought)
Long-form
LicenseApache-2.0

Links: Website Hugging Face

Magenta Realtime

Magenta Realtime

Description: Open music generation model from Google DeepMind enabling continuous generation of musical audio steered by text prompts or audio examples.

Release Date: August 2025

FeatureValue
Parameters-
Music Generation✅ (Real-time)
Text-to-Music
Audio-to-Music
Reference Audio
Continuous Generation
LatencyStyle prompt 2s+
Context10 seconds
Training Data~190k hours
LicenseApache-2.0 (code), CC-BY-4.0 (model)

Links: GitHub Hugging Face arXiv

SoulX-Singer

SoulX-Singer

(Already listed in TTS - singing voice synthesis)

FeatureValue
Parameters-
Singing Generation
Zero-shot
Melody Control✅ (F0/MIDI)
LanguagesMandarin, English, Cantonese
LicenseApache-2.0
Uni-MoE (Audio)

Uni-MoE (Audio)

Description: MoE-based omnimodal model with voice cloning, TTS, T2M (text-to-music), and V2M (video-to-music).

Release Date: October 16, 2025 (Uni-MoE-Audio)

FeatureValue
Parameters-
Voice Cloning
TTS
Text-to-Music
Video-to-Music
Dynamic Routing
LicenseApache-2.0

Links: GitHub arXiv


Anything to Audio

Models that can generate audio from multiple input modalities (video, text, image, audio). These are unified frameworks for multimodal audio synthesis.

Anything to Audio Quick Comparison

ModelTextVideoImageAudioLicense
WooshApache-2.0
PrismAudioApache-2.0
ThinkSoundApache-2.0
HunyuanVideo-FoleyResearch Only
MMAudioApache-2.0
AudioXApache-2.0
Uni-MoE (Audio)Apache-2.0
AudioX / Audio-Omni

AudioX / Audio-Omni

Description: Audio-Omni is the first end-to-end framework unifying understanding, generation, and editing across general sound, music, and speech domains. Presented at SIGGRAPH 2026. AudioX is a unified framework integrating text, video, image, and audio conditions.

Release Date: March 2025 (AudioX), 2026 (Audio-Omni)

FeatureValue
Parameters-
Text-to-Audio
Text-to-Music
Text-to-Speech
Video-to-Audio/Music
Audio Editing✅ (Add/Remove/Extract/Style)
Voice Conversion
LicenseApache-2.0 / CC-BY-NC-4.0

Key Innovation: First unified framework covering all three audio domains. Combines frozen multimodal LLM (Qwen2.5-Omni) with trainable Diffusion Transformer for high-fidelity synthesis. Any-to-any audio processing.

Links: GitHub AudioX GitHub Audio-Omni Hugging Face AudioX Hugging Face Audio-Omni arXiv AudioX

MMAudio

MMAudio

Description: Multimodal joint training framework for high-quality synchronized audio generation from video and/or text inputs. State-of-the-art open source model for generating sounds for videos, images, and text prompts.

Release Date: December 2024 (CVPR 2025)

FeatureValue
Parameters-
Video-to-Audio
Text-to-Audio
Image-to-Audio
Synchronized Audio
Multimodal Joint Training
LicenseApache-2.0

Links: GitHub Hugging Face Demo arXiv

HunyuanVideo-Foley

HunyuanVideo-Foley

Description: Tencent's end-to-end video sound effect generation model for professional-grade AI Foley sound generation. Analyzes footage and creates immersive audio that matches the visual content perfectly.

Release Date: 2025

FeatureValue
Parameters-
Video-to-Audio (Foley)
Text-to-Audio
High-Quality Foley
Context-Aware
Output Quality48 kHz
LicenseResearch & Non-commercial only

Links: GitHub Demo Website arXiv

ThinkSound

ThinkSound

Description: Unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning. Supports generating or editing audio from video, text, audio, or their combinations. Accepted to NeurIPS 2025.

Release Date: 2025

FeatureValue
Parameters-
Video-to-Audio✅ (SOTA)
Text-to-Audio
Audio-to-Audio
Audio Editing
CoT-Driven Reasoning
Interactive Object-centric Editing
LicenseApache-2.0 (Research only)

Links: GitHub Hugging Face Demo

Woosh

Woosh

Description: Sony AI's sound effect foundation model for text-to-audio and video-to-audio generation. Includes Woosh-AE (audio encoder/decoder), Woosh-Flow/DFlow (T2A), and Woosh-VFlow/DVFlow (V2A) with distilled fast inference variants.

Release Date: 2026

FeatureValue
Text-to-Audio
Video-to-Audio
Audio Encoding
Fast Inference✅ (Distilled models)
ArchitectureFlow-based generative models
LicenseApache-2.0

Key Innovation: Optimized for sound effects (not general audio) with both public and private model versions. Video-conditioned generation without requiring captions. Competitive with Stable Audio Open and TangoFlux.

Links: GitHub arXiv

PrismAudio

PrismAudio

Description: Video-to-Audio generation framework with Reinforcement Learning and specialized Chain-of-Thought (CoT) planning. Decomposes reasoning into four specialized modules (Semantic, Temporal, Aesthetic, Spatial CoT) for comprehensive video understanding. Built upon ThinkSound.

Release Date: 2025 (ICLR 2026)

FeatureValue
Parameters518M
Video-to-Audio
CoT Planning✅ (4 modules)
Multi-Dimensional RL
Fast-GRPO✅ (Hybrid ODE-SDE)
Inference Time0.63 seconds
LicenseApache-2.0

Performance Benchmarks:

MetricVGGSoundAudioCanvas
Semantic (CLAP)0.470.52
Temporal (DeSync↓)0.410.36
Aesthetic (MOS-Q)4.21±0.354.12±0.28

Links: GitHub Hugging Face Demo arXiv

Uni-MoE (Audio)

Uni-MoE (Audio)

Description: MoE-based omnimodal model with voice cloning, TTS, T2M (text-to-music), and V2M (video-to-music).

Release Date: October 16, 2025 (Uni-MoE-Audio)

FeatureValue
Parameters-
Voice Cloning
TTS
Text-to-Music
Video-to-Music
Dynamic Routing
LicenseApache-2.0

Links: GitHub arXiv


Audio Restoration & Enhancement

Audio Restoration & Enhancement Quick Comparison

ModelTypeBandwidth ExtensionInpaintingLicense
NVIDIA A2SBRestorationNVIDIA Non-Commercial
NovaSREnhancementApache-2.0
AudioSREnhancementMIT
NVIDIA A2SB

NVIDIA A2SB (Audio-to-Audio Schrodinger Bridges)

Description: Diffusion-based audio restoration model tailored for high-resolution music at 44.1kHz. An end-to-end, vocoder-free, multi-task model capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Can restore hour-long audio inputs without boundary artifacts.

Release Date: January 2025

FeatureValue
ArchitectureEnd-to-end vocoder-free
Bandwidth Extension
Audio Inpainting
High-Resolution✅ (44.1kHz)
Long Audio✅ (hour-long)
Streaming
LicenseNVIDIA OneWay NonCommercial License

Links: GitHub Hugging Face arXiv

NovaSR

NovaSR

Description: Lightning fast audio upsampler - 50kB model that upscales 16kHz audio to 48kHz at 3500x realtime.

Release Date: 2025

FeatureValue
Size52kB
Speed3600x realtime (A100)
Input16kHz
Output48kHz
VRAMMinimal
LicenseApache-2.0

Links: GitHub Hugging Face

AudioSR

AudioSR

Description: Audio super resolution model using latent diffusion to upscale low-quality audio to 48kHz.

Release Date: February 12, 2026 (v1.1.1)

FeatureValue
Input8kHz-48kHz
Output48kHz
VRAM6GB min
Stereo
Long Audio
LicenseMIT

Links: GitHub arXiv


Speech Recognition (ASR)

ASR Quick Comparison

ModelLanguagesStreamingLicense
Cohere Transcribe14Apache-2.0
VibeVoice-ASR50+MIT
FunASR50+MIT
Cohere Transcribe

Cohere Transcribe

Description: Open-source automatic speech recognition (ASR) model developed by Cohere. A 2 billion parameter dedicated audio-in, text-out model that ranks #1 on the English ASR leaderboard.

Release Date: March 2026

FeatureValue
Parameters2B
ArchitectureConformer-based encoder-decoder
ASR
Languages14 (En, Fr, De, It, Es, Pt, Gr, Nl, Pl, Zh, Jp, Ko, Vi, Ar)
Streaming
RTFxUp to 3x faster than comparable models
LicenseApache-2.0

Key Features:

  • Long-form transcription with automatic chunking (>35 seconds)
  • Optional punctuation control
  • Batched inference support
  • vLLM integration for production serving
  • Apple Silicon support via mlx-audio
  • WebGPU browser deployment via transformers.js

Links: Hugging Face Demo Blog

VibeVoice-ASR

VibeVoice-ASR

Description: Microsoft's unified speech-to-text model for 60-minute long-form audio processing with speaker diarization and timestamping.

Release Date: January 21, 2026

FeatureValue
Parameters7B
ASR
Languages50+
Streaming
LicenseMIT

Links: GitHub Hugging Face

FunASR

FunASR

Description: Fundamental end-to-end speech recognition toolkit with SOTA pretrained models.

Release Date: Ongoing (First: 2023)

FeatureValue
ASR
VAD
Punctuation
Speaker Diarization
Multi-talker ASR
Timestamp
Emotion Recognition
Languages50+
LicenseMIT/Model License

Links: GitHub Website


Additional Resources

ComfyUI Integrations

Leaderboard

ResourceDescriptionLink
Open ASR LeaderboardHugging Face leaderboard for comparing ASR model performance across languages and metrics.Hugging Face

Contributing

This list is continuously evolving. If you have any models to add or updates to suggest, please feel free to contribute!


Last Updated: March 2026

关于 About

List of open-source TTS, voice cloning, and music generation models
ai-musicai-voiceasrmusic-generationttsvoice-cloning

语言 Languages

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
12
Total Commits
峰值: 3次/周
Less
More

核心贡献者 Contributors