Awesome TTS & Voice Generation Models

A curated list of open-source Text-to-Speech (TTS), voice cloning, and music generation models. Models are sorted by release date (newest first).

logo-tts2

Text-to-Speech (TTS) Models

TTS Quick Comparison

Model	Voice Cloning	ASR	Languages	Streaming	License
LongCat-AudioDiT	✅	❌	Zh/En	❌	MIT
VoxCPM2	✅	❌	30	✅	Apache-2.0
MOSS-TTS-Nano	✅	❌	20	✅	Apache-2.0
T5Gemma-TTS	✅	❌	En/Zh/Jp	❌	MIT
TinyTTS	❌	❌	En	✅	Apache-2.0
LEMAS-TTS	✅	❌	10	❌	Apache-2.0
OmniVoice	✅	❌	600+	❌	Apache-2.0
LongCat-Next	✅	✅	Zh/En	✅	MIT
Voxtral-4B-TTS	✅	❌	9	✅	CC BY-NC 4.0
Irodori-TTS-500M-v2	✅	❌	Jp	❌	MIT
Fish Audio S2 Pro	✅	❌	80+	✅	Research License
KittenTTS	✅	❌	En+	✅	Apache-2.0
MOSS-TTS	✅	❌	20	✅	Apache-2.0
SoulX-Singer	✅ (Singing)	❌	Zh/En/Canto	✅	Apache-2.0
SoproTTS	✅	❌	En	✅	Apache-2.0
NeuTTS	✅	❌	En/Es/De/Fr	✅	Apache-2.0
Qwen3-TTS	✅	❌	10	✅	Apache-2.0
GLM-TTS	✅	❌	Zh/En	✅	Apache-2.0
VibeVoice-Realtime	✅	❌	Multi	✅	MIT
Fun-CosyVoice 3.0	✅	❌	9 + 18 dialects	✅	Apache-2.0
MioTTS-2.6B	✅	❌	En/Jp	✅	LFM
Supertonic 2	❌	❌	5	✅	OpenRAIL-M
KugelAudio	✅	❌	23 EU	✅	MIT
Kokoro-82M	✅	❌	8 (54 voices)	✅	Apache-2.0
KokoClone	✅	❌	7	✅	Apache-2.0
IndexTTS2	✅	❌	Zh/En	✅	Apache-2.0
Maya1	✅	❌	En	✅	Apache-2.0
LFM2-Audio-1.5B	✅	✅	En	✅	LFM
Step-Audio-EditX	✅	❌	Zh/En/Jp/Ko	✅	Apache-2.0
FireRedTTS2	✅	❌	7 langs	✅	Apache-2.0
VoxCPM	✅	❌	Zh/En	✅	Apache-2.0
LuxTTS	✅	❌	-	✅	Apache-2.0
MegaTTS3	✅	❌	Zh/En	✅	Apache-2.0
Spark-TTS	✅	❌	Zh/En	✅	Apache-2.0
Fish Speech	✅	❌	8 langs	✅	Apache-2.0
Step-Audio	✅	✅	Zh/En/Jp	✅	Apache-2.0
SoulX-Podcast	✅	❌	Zh/En/Canto	✅	Apache-2.0
Chatterbox	✅	❌	23+	✅	MIT
Orpheus-TTS	✅	❌	Multi	✅	Apache-2.0
Dia	✅	❌	En	✅	Apache-2.0
VieNeu-TTS	✅	❌	Vi	✅	Apache-2.0
MiMo-Audio	✅	✅	Multi	✅	Apache-2.0
Kimi-Audio	✅	✅	Multi	✅	MIT/Apache-2.0
ZipVoice	✅	❌	Zh/En	✅	Apache-2.0

LongCat-AudioDiT

Description: State-of-the-art diffusion-based TTS model operating directly in waveform latent space. Developed by Meituan's LongCat team, it requires only a Waveform VAE and Diffusion backbone, effectively mitigating compounding errors.

Release Date: March 30, 2026

Feature	Value
Parameters	1B / 3.5B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	❌
Languages	Chinese, English
Streaming	❌
Sample Rate	24000 Hz
License	MIT

Key Innovation: Adaptive Projection Guidance (APG) replaces traditional classifier-free guidance for elevated generation quality. Outperforms Seed-TTS on zero-shot voice cloning benchmarks.

Links:

VoxCPM2

Description: OpenBMB's next-generation tokenizer-free diffusion autoregressive TTS model with 2 billion parameters. Supports 30 languages with automatic detection, voice design from text descriptions, and high-fidelity voice cloning.

Release Date: 2026

Feature	Value
Parameters	2B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅ (Voice Design)
Languages	30 (+ 9 Chinese dialects)
Streaming	✅ (RTF ~0.3)
Audio Output	48 kHz
License	Apache-2.0

Key Innovation: Tokenizer-free design with LocEnc → TSLM → RALM → LocDiT pipeline. Built-in super-resolution via AudioVAE V2 for 48kHz output.

Links:

MOSS-TTS-Nano

Description: Ultra-lightweight open-source multilingual speech generation model with only 0.1B parameters. Designed for realtime speech generation that runs directly on CPU without GPU.

Release Date: 2026

Feature	Value
Parameters	0.1B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	❌
Languages	20
Streaming	✅ (CPU-friendly)
Audio Output	48 kHz Stereo
License	Apache-2.0

Key Innovation: Pure autoregressive architecture with MOSS-Audio-Tokenizer-Nano. Compresses audio to 12.5 Hz token stream using RVQ with 16 codebooks. Runs on 4-core CPU.

Links:

T5Gemma-TTS

Description: Multilingual TTS model with voice cloning and duration control, built on the T5Gemma encoder-decoder LLM architecture. Supports batch generation for multiple audio variations.

Release Date: 2026

Feature	Value
Parameters	2B-2B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	❌
Languages	English, Chinese, Japanese
Streaming	❌
VRAM	7.6-10.6 GB
License	MIT

Key Innovation: PM-RoPE positional encoding with XCodec2 audio codec. Low-VRAM options with CPU offloading. Batch inference efficiency with single encoder pass.

Links:

TinyTTS

Description: The smallest English TTS model with only 1.6 million parameters. End-to-end neural network achieving ~53x real-time synthesis speed on CPU via ONNX optimization.

Release Date: 2026

Feature	Value
Parameters	1.6M
Zero-shot Voice Cloning	❌
ASR	❌
Pronunciation Control	✅
Emotion Control	❌
Languages	English
Streaming	✅ (~53x RTF)
Model Size	~3.4 MB (ONNX FP16)
License	Apache-2.0

Key Innovation: Ultra-compact architecture optimized for CPU-only deployment. Multi-platform support via Python and Node.js APIs. Works on laptops, edge devices, and embedded systems.

Links:

LEMAS-TTS

Description: Part of the LEMAS (Large-scale Extensible Multilingual Audio Suite) project. Zero-shot multilingual TTS with 0.3B parameters supporting 10 languages with word-level precise editing capabilities.

Release Date: 2026

Feature	Value
Parameters	0.3B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	10 (zh/en/de/fr/es/pt/it/ru/id/vi)
Streaming	❌
Special Feature	Word-level editing (LEMAS-Edit)
License	Apache-2.0

Key Innovation: Built on 150,000+ hours of multilingual speech data with word-level timestamps. Includes LEMAS-Edit for precise word-level speech editing via masked token infilling.

Links:

OmniVoice

Description: Massive multilingual zero-shot TTS model scaling to 600+ languages. Uses diffusion language model-style discrete non-autoregressive architecture with single-stage text-to-acoustic mapping.

Release Date: 2026

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅ (Pinyin/CMU)
Emotion Control	✅ (Voice Design)
Languages	600+
Streaming	❌
Training Data	581k hours
License	Apache-2.0

Key Innovation: Simplified single-stage architecture vs conventional two-stage pipelines. Full-codebook random masking strategy with LLM initialization for superior intelligibility. Noise-robust prompt processing.

Links:

LongCat-Next

Description: Native multimodal foundation model by Meituan LongCat Team processing text, vision, and audio under a single autoregressive objective. Industrial-strength model with strong speech synthesis and voice cloning.

Release Date: March 2026

Feature	Value
Parameters	3B (MoE A3B)
Zero-shot Voice Cloning	✅
ASR	✅
Pronunciation Control	✅
Emotion Control	✅
Languages	Chinese, English
Streaming	✅ (Low latency)
Audio Output	24 kHz
License	MIT

Key Innovation: Discrete Native Autoregression Paradigm (DiNA) unifying modalities in shared discrete token space. Combines visual understanding, generation, and audio processing in single model.

Links:

Voxtral-4B-TTS

Description: Frontier, open-weights text-to-speech model developed by Mistral AI. Designed to be fast, instantly adaptable, and produces lifelike speech with natural prosody and emotional range.

Release Date: March 2026

Feature	Value
Parameters	4B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	✅ (expressive speech)
Languages	9 (En, Fr, Es, De, It, Pt, Nl, Ar, Hi)
Streaming	✅ (RTF 0.103 at concurrency 1)
Audio Output	24 kHz
License	CC BY-NC 4.0

Links:

Irodori-TTS-500M-v2

Description: Japanese Text-to-Speech model based on Rectified Flow Diffusion Transformer. Features emoji-based style and sound effect control by embedding emojis in input text for expressive speech generation.

Release Date: 2026

Feature	Value
Parameters	500M
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	✅ (emoji-based)
Languages	Japanese
Streaming	❌
Output Quality	48kHz waveform
License	MIT

Key Feature: Emoji annotation control - insert specific emojis into text to control speaking styles, emotions, and sound effects.

Links:

Fish Audio S2 Pro

Description: Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. It combines reinforcement learning alignment with a dual-autoregressive architecture for high-quality speech synthesis.

Release Date: March 10, 2026

Feature	Value
Parameters	5B (4B Slow AR + 400M Fast AR)
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅ (15,000+ tags)
Emotion Control	✅ (fine-grained inline control)
Languages	80+ (Tier 1: En, Zh, Jp)
Streaming	✅ (RTF 0.195, 100ms TTFA)
Model Size	~10 GB (BF16)
License	Fish Audio Research License

Links:

KittenTTS

Description: KittenTTS is an open-source realistic text-to-speech model designed for lightweight deployment. It is a state-of-the-art TTS model under 25MB with just 15 million parameters, running without GPU on any device.

Release Date: February 24, 2026 (v0.8.1)

Feature	Value
Parameters	15M-80M
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	✅
Languages	English, Multiple
Streaming	✅
License	Apache-2.0

Links:

MOSS-TTS

Description: MOSS-TTS is a production-grade Text-to-Speech foundation model developed by OpenMOSS Team and MOSI.AI. Features state-of-the-art evaluation performance on Seed-TTS-eval benchmark with zero-shot voice cloning.

Release Date: February 10, 2026

Feature	Value
Parameters	8B (Delay), 1.7B (Local)
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅ (Pinyin/Phoneme-level)
Emotion Control	✅
Languages	20 languages
Streaming	✅
Max Duration	1 hour
License	Apache-2.0

Links:

SoulX-Singer

Description: SoulX-Singer is a high-fidelity, zero-shot singing voice synthesis model for generating realistic singing voices for unseen singers without fine-tuning.

Release Date: February 6, 2026

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅ (Singing)
ASR	❌
Pronunciation Control	✅ (MIDI/F0)
Emotion Control	✅
Languages	Mandarin, English, Cantonese
Streaming	✅
License	Apache-2.0

Links:

SoproTTS

Description: SoproTTS is a lightweight English text-to-speech model with zero-shot voice cloning. It uses dilated convolutions (WaveNet-style) and lightweight cross-attention layers instead of the common Transformer architecture.

Release Date: February 4, 2026 (v1.5)

Feature	Value
Parameters	135M
Zero-shot Voice Cloning	✅ (3-12s)
ASR	❌
Pronunciation Control	❌
Emotion Control	✅ (style_strength)
Languages	English
Streaming	✅ (250ms TTFA)
RTF	0.05 (CPU M3)
Training Cost	~$100
License	Apache-2.0

Links:

NeuTTS

Description: NeuTTS is a collection of open-source on-device TTS models with instant voice cloning. Built off LLM backbones with GGUF format quantizations for efficient on-device deployment.

Release Date: Early 2026

Feature	Value
Parameters	360M (Air), 120M (Nano)
Zero-shot Voice Cloning	✅ (3-second)
ASR	❌
Pronunciation Control	❌
Emotion Control	❌
Languages	English, Spanish, German, French
Streaming	✅
On-Device	✅ (GGUF quantizations)
License	Apache-2.0 (Air), NeuTTS Open License 1.0 (Nano)

Links:

Qwen3-TTS

Description: Qwen3-TTS is an open-source series of Text-to-Speech models developed by Alibaba Cloud. Supports stable, expressive, and streaming speech generation with free-form voice design.

Release Date: January 22, 2026

Feature	Value
Parameters	0.6B-1.7B
Zero-shot Voice Cloning	✅ (3-second)
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
Streaming	✅ (97ms latency)
License	Apache-2.0

Links:

GLM-TTS

Description: High-quality TTS synthesis system based on LLMs from ZhipuAI, supporting zero-shot voice cloning with Multi-Reward Reinforcement Learning.

Release Date: December 11, 2025

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅ (3-10s)
ASR	❌
Pronunciation Control	✅ (Phoneme-level)
Emotion Control	✅ (RL-enhanced)
Languages	Chinese, English
Streaming	✅
License	Apache-2.0

Links:

VibeVoice-Realtime

Description: Real-time TTS model from Microsoft with streaming text input and ultra-low latency (~300ms).

Release Date: December 3, 2025

Feature	Value
Parameters	0.5B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	Multilingual
Streaming	✅ (300ms)
Max Duration	~10 minutes
License	MIT

Links:

Fun-CosyVoice 3.0

Description: Advanced TTS system based on LLMs for zero-shot multilingual speech synthesis from FunAudioLLM.

Release Date: December 2025

Feature	Value
Parameters	0.5B
Zero-shot Voice Cloning	✅ (Multi-lingual/Cross-lingual)
ASR	❌
Pronunciation Control	✅ (Pinyin/CMU)
Emotion Control	✅
Languages	9 + 18+ Chinese dialects
Streaming	✅ (150ms)
License	Apache-2.0

Links:

MioTTS-2.6B

Description: Lightweight, high-speed LLM-based TTS model for English and Japanese with minimal resource usage.

Release Date: 2026

Feature	Value
Parameters	2.6B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	❌
Languages	English, Japanese
Streaming	✅
RTF	0.135-0.145
License	LFM Open License

Links:

Supertonic 2

Description: Lightning-fast, on-device text-to-speech system designed for extreme performance with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on-device—no cloud, no API calls, no privacy concerns. Outperforms ElevenLabs Flash v2.5 by up to 42× in speed benchmarks.

Release Date: 2026

Feature	Value
Parameters	66M
Zero-shot Voice Cloning	❌
ASR	❌
Pronunciation Control	❌
Emotion Control	❌
Languages	English, Korean, Spanish, Portuguese, French
Streaming	✅
RTF	0.001-0.015 (up to 167× realtime)
On-Device	✅ (ONNX Runtime)
License	OpenRAIL-M

Performance Comparison:

System	Speed (chars/sec)	RTF
Supertonic 2 (RTX 4090)	12,164	0.001
Supertonic 2 (M4 Pro CPU)	1,263	0.012
ElevenLabs Flash v2.5	287	0.5
Kokoro (Open-source)	117	1.3

Links:

KugelAudio

Description: Open-source TTS for European languages with 7B parameters. Outperformed ElevenLabs in human preference testing.

Release Date: Early 2026

Feature	Value
Parameters	7B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅ (Speaking styles)
Languages	23 European languages
Streaming	✅
License	MIT

Links:

Kokoro-82M

Description: Kokoro is an open-weight Text-to-Speech model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Release Date: January 27, 2025 (v1.0)

Feature	Value
Parameters	82M
Architecture	StyleTTS 2, ISTFTNet
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅ (via misaki G2P)
Emotion Control	✅ (voice styles)
Languages	8 (54 voices)
Streaming	✅ (generator pattern)
Cost	<$0.06 per hour of audio
License	Apache-2.0

Links:

KokoClone

Description: KokoClone is a fast, real-time compatible multilingual voice cloning system built on top of Kokoro-ONNX. It enables users to type text in multiple languages, provide a short 3-10 second reference audio clip, and instantly generate speech in that same voice.

Release Date: 2025

Feature	Value
Parameters	82M (Base: Kokoro-ONNX)
Zero-shot Voice Cloning	✅ (3-10s reference)
ASR	❌
Pronunciation Control	❌
Emotion Control	✅
Languages	7 (En, Hi, Fr, Ja, Zh, It, Pt, Es)
Streaming	✅ (CPU real-time)
License	Apache-2.0

Links:

IndexTTS2

Description: AI-Enhanced Text-to-Speech System with Intelligent Optimization and self-learning capabilities.

Release Date: November 2025

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅ (5 emotions)
Languages	Chinese, English
Streaming	✅
Multi-speaker	✅ (1-4 speakers)
License	Apache-2.0

Links:

Maya1

Description: State-of-the-art speech model for expressive voice generation with natural language voice control.

Release Date: November 2025

Feature	Value
Parameters	3B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅ (Tags)
Languages	English (Multi-accent)
Streaming	✅ (<100ms)
License	Apache-2.0

Links:

LFM2-Audio-1.5B

Description: Liquid AI's first end-to-end audio foundation model with low latency and real-time conversation.

Release Date: November 28, 2025

Feature	Value
Parameters	1.5B
Zero-shot Voice Cloning	✅
ASR	✅ (Integrated)
Pronunciation Control	N/A
Emotion Control	✅
Languages	English
Streaming	✅
License	LFM Open License

Links:

Step-Audio-EditX

Description: 3B-parameter LLM-based RL audio model specialized in expressive and iterative audio editing.

Release Date: November 2025

Feature	Value
Parameters	3B (4B BF16)
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅ (Polyphone)
Emotion Control	✅ (14 emotions)
Languages	Mandarin, English, Sichuanese, Cantonese, Japanese, Korean
Streaming	✅
License	Apache-2.0

Links:

FireRedTTS2

Description: Long-form streaming TTS system for multi-speaker dialogue generation with stable, natural speech.

Release Date: September 2025

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	EN, ZH, JP, KO, FR, DE, RU
Streaming	✅ (140ms)
Multi-speaker	✅ (4 speakers)
Max Duration	3 minutes
License	Apache-2.0

Links:

VoxCPM

Description: Tokenizer-free TTS system for context-aware speech generation and true-to-life voice cloning.

Release Date: September 16, 2025

Feature	Value
Parameters	640M-800M
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	Chinese, English
Streaming	✅ (RTF 0.17)
License	Apache-2.0

Links:

LuxTTS

Description: Lightweight ZipVoice-based TTS model for high quality voice cloning at speeds exceeding 150x realtime.

Release Date: 2025

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	❌
Emotion Control	❌
Languages	-
Streaming	✅
RTF	150x
VRAM	1GB
License	Apache-2.0

Links:

MegaTTS3

Description: Advanced zero-shot speech synthesis with Sparse Alignment Enhanced Latent Diffusion Transformer.

Release Date: March 22, 2025

Feature	Value
Parameters	0.45B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	Chinese, English
Streaming	✅
License	Apache-2.0

Links:

Spark-TTS

Description: Efficient LLM-Based TTS Model with Single-Stream Decoupled Speech Tokens, built on Qwen2.5.

Release Date: March 2025

Feature	Value
Parameters	0.5B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	Chinese, English
Streaming	✅
License	Apache-2.0

Links:

Fish Speech

Description: State-of-the-art open source TTS and voice cloning model that generates natural, realistic, and emotionally rich speech.

Release Date: May 31, 2025 (v1.5.1)

Feature	Value
Parameters	4B (S1), 0.5B (S1-mini)
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	8 (EN, JP, KO, ZH, FR, DE, AR, ES)
Streaming	✅
RTF	~1:7
License	Apache-2.0

Links:

Step-Audio

Description: Production-ready open-source framework for intelligent speech interaction with unified speech comprehension and generation.

Release Date: February 17, 2025

Feature	Value
Parameters	130B (Chat), 3B (TTS)
Zero-shot Voice Cloning	✅
ASR	✅
Pronunciation Control	✅
Emotion Control	✅
Languages	Chinese, English, Japanese
Streaming	✅
License	Apache-2.0

Links:

Audio Flamingo 3 (AF3) / Audio Flamingo Next

Description: NVIDIA ADLR's fully open-source Large Audio Language Model with state-of-the-art audio understanding. Audio Flamingo Next (AF-Next) is the latest generation featuring stronger general audio understanding, longer context support, and timestamp-grounded reasoning.

Release Date: July 2025 (AF3), 2026 (AF-Next)

Feature	Value
Parameters	7B
Zero-shot Voice Cloning	❌
ASR	✅
Pronunciation Control	N/A
Emotion Control	✅
Languages	Multi-lingual
Streaming	✅
Context	Up to 30 minutes
License	Apache-2.0

Key Innovation (AF-Next): Staged curriculum training with GRPO-based RL post-training. Three specialized checkpoints: Instruct, Think (reasoning), and Captioner. Temporal Audio Chain-of-Thought grounding intermediate reasoning to timestamps.

Links:

SoulX-Podcast

Description: SOTA Multi-Speaker TTS model for generating realistic long-form podcasts with dialectal diversity.

Release Date: 2025

Feature	Value
Parameters	-
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	Mandarin, English, Cantonese, Sichuanese, Henanese
Streaming	✅
Max Duration	90+ minutes
License	Apache-2.0

Links:

Chatterbox

Description: Family of SOTA open-source TTS models by Resemble AI with zero-shot voice cloning and multilingual synthesis.

Release Date: June 13, 2025 (v0.1.2)

Feature	Value
Parameters	350M-500M
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅ (Tags)
Languages	23+
Streaming	✅
License	MIT

Links:

Orpheus-TTS

Description: SOTA open-source TTS built on Llama-3b backbone demonstrating emergent capabilities of LLMs for speech synthesis.

Release Date: April 2025

Feature	Value
Parameters	3B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	Multilingual
Streaming	✅ (200ms)
License	Apache-2.0

Links:

Dia

Description: 1.6B parameter TTS model by Nari Labs for generating ultra-realistic dialogue in one pass.

Release Date: June 27, 2024

Feature	Value
Parameters	1.6B
Zero-shot Voice Cloning	✅
ASR	❌
Pronunciation Control	✅
Emotion Control	✅
Languages	English
Streaming	✅
License	Apache-2.0

Links:

VieNeu-TTS

Description: Advanced on-device Vietnamese TTS model with instant voice cloning from 3-5 seconds of reference audio.

Release Date: 2025

Feature	Value
Parameters	0.3B-0.6B
Zero-shot Voice Cloning	✅ (3-5s)
ASR	❌
Pronunciation Control	✅
Emotion Control	❌
Languages	Vietnamese
Streaming	✅ (On-device)
License	Apache-2.0

Links:

MiMo-Audio

Description: Audio Language Model by Xiaomi functioning as a Few-Shot Learner with SOTA audio understanding.

Release Date: 2025

Feature	Value
Parameters	7B
Zero-shot Voice Cloning	✅
ASR	✅
Pronunciation Control	N/A
Emotion Control	✅
Languages	Multi-lingual
Streaming	✅
License	Apache-2.0

Links:

Kimi-Audio

Description: Open-source audio foundation model by Moonshot AI for audio understanding, generation, and conversation.

Release Date: 2024

Feature	Value
Parameters	7B
Zero-shot Voice Cloning	✅
ASR	✅
Pronunciation Control	N/A
Emotion Control	✅
Languages	Multi-lingual
Streaming	✅
License	MIT/Apache-2.0

Links:

ZipVoice

Description: Fast and high-quality zero-shot TTS models based on flow matching.

Release Date: June 16, 2025

Feature	Value
Parameters	123M
Zero-shot Cloning	✅
Languages	Chinese, English
Dialogue	✅
License	Apache-2.0

Links:

Music Generation Models

Music Quick Comparison

Model	Music Gen	Languages	Streaming	License
ACE-Step 1.5	✅	50+	✅	MIT
LeVo 2	✅	Zh/En	❌	Apache-2.0
Foundation-1	✅ (Samples)	-	❌	Stability AI
Music Flamingo	❌	-	-	Apache-2.0
Magenta Realtime	✅	-	✅	Apache-2.0/CC-BY-4.0
Uni-MoE (Audio)	✅	-	✅	Apache-2.0

ACE-Step 1.5

Description: The most powerful local music generation model outperforming most commercial alternatives. Supports Mac, AMD, Intel, and CUDA devices.

Release Date: February 20, 2026 (v0.1.2)

Feature	Value
Parameters	0.6B-4B (LM), DiT variants
Music Generation	✅
Lyrics Support	✅ (50+ languages)
Voice2BGM	✅
Reference Audio	✅
Track Separation	✅
Duration	10s - 10min
VRAM	<4GB
Platforms	CUDA, MPS, ROCm, XPU, CPU
License	MIT

Links:

LeVo 2

LeVo 2 (SongGeneration 2)

Description: Open-source foundation model for commercial-grade music generation by Tencent AI Lab. It outperforms open-source baselines and rivals commercial systems in Overall Quality, Melody, Arrangement, Sound Quality, and Structure.

Release Date: 2025

Feature	Value
Architecture	Hybrid LLM-Diffusion
Music Generation	✅
Lyrics Support	✅ (Chinese, English)
Multilingual	✅ (Zh, En)
Text/Audio Prompts	✅
VRAM	12GB-22GB
License	Apache-2.0

Links:

Foundation-1

Description: Structured text-to-sample generation model for music production workflows. Generates tempo-synced, key-aware, bar-aware sample generation with support for instrument identity, timbre control, and FX processing.

Release Date: 2025

Feature	Value
Type	Text-to-Sample (Music)
Base Model	stabilityai/stable-audio-open-1.0
Instrument Control	✅
Timbre Descriptors	✅ (Warm, Bright, etc.)
FX Tags	✅ (Reverb, Delay, etc.)
Musical Notation	✅ (Chord, Melody, Arp)
VRAM	~8GB
License	Stability AI Community License

Links:

Music Flamingo

Description: Large audio-language model designed to advance music (including song) understanding. Achieves SOTA on 10+ music benchmarks.

Release Date: 2025

Feature	Value
Parameters	-
Music Understanding	✅
Music Generation	❌
Rich Captions	✅
Music QA	✅
Reasoning	✅ (Chain-of-thought)
Long-form	✅
License	Apache-2.0

Links:

Magenta Realtime

Description: Open music generation model from Google DeepMind enabling continuous generation of musical audio steered by text prompts or audio examples.

Release Date: August 2025

Feature	Value
Parameters	-
Music Generation	✅ (Real-time)
Text-to-Music	✅
Audio-to-Music	✅
Reference Audio	✅
Continuous Generation	✅
Latency	Style prompt 2s+
Context	10 seconds
Training Data	~190k hours
License	Apache-2.0 (code), CC-BY-4.0 (model)

Links:

SoulX-Singer

(Already listed in TTS - singing voice synthesis)

Feature	Value
Parameters	-
Singing Generation	✅
Zero-shot	✅
Melody Control	✅ (F0/MIDI)
Languages	Mandarin, English, Cantonese
License	Apache-2.0

Uni-MoE (Audio)

Description: MoE-based omnimodal model with voice cloning, TTS, T2M (text-to-music), and V2M (video-to-music).

Release Date: October 16, 2025 (Uni-MoE-Audio)

Feature	Value
Parameters	-
Voice Cloning	✅
TTS	✅
Text-to-Music	✅
Video-to-Music	✅
Dynamic Routing	✅
License	Apache-2.0

Links:

Anything to Audio

Models that can generate audio from multiple input modalities (video, text, image, audio). These are unified frameworks for multimodal audio synthesis.

Anything to Audio Quick Comparison

Model	Text	Video	Image	Audio	License
Woosh	✅	✅	❌	✅	Apache-2.0
PrismAudio	❌	✅	❌	❌	Apache-2.0
ThinkSound	✅	✅	❌	✅	Apache-2.0
HunyuanVideo-Foley	✅	✅	❌	❌	Research Only
MMAudio	✅	✅	✅	❌	Apache-2.0
AudioX	✅	✅	✅	✅	Apache-2.0
Uni-MoE (Audio)	✅	✅	❌	✅	Apache-2.0

AudioX / Audio-Omni

Description: Audio-Omni is the first end-to-end framework unifying understanding, generation, and editing across general sound, music, and speech domains. Presented at SIGGRAPH 2026. AudioX is a unified framework integrating text, video, image, and audio conditions.

Release Date: March 2025 (AudioX), 2026 (Audio-Omni)

Feature	Value
Parameters	-
Text-to-Audio	✅
Text-to-Music	✅
Text-to-Speech	✅
Video-to-Audio/Music	✅
Audio Editing	✅ (Add/Remove/Extract/Style)
Voice Conversion	✅
License	Apache-2.0 / CC-BY-NC-4.0

Key Innovation: First unified framework covering all three audio domains. Combines frozen multimodal LLM (Qwen2.5-Omni) with trainable Diffusion Transformer for high-fidelity synthesis. Any-to-any audio processing.

Links:

MMAudio

Description: Multimodal joint training framework for high-quality synchronized audio generation from video and/or text inputs. State-of-the-art open source model for generating sounds for videos, images, and text prompts.

Release Date: December 2024 (CVPR 2025)

Feature	Value
Parameters	-
Video-to-Audio	✅
Text-to-Audio	✅
Image-to-Audio	✅
Synchronized Audio	✅
Multimodal Joint Training	✅
License	Apache-2.0

Links:

HunyuanVideo-Foley

Description: Tencent's end-to-end video sound effect generation model for professional-grade AI Foley sound generation. Analyzes footage and creates immersive audio that matches the visual content perfectly.

Release Date: 2025

Feature	Value
Parameters	-
Video-to-Audio (Foley)	✅
Text-to-Audio	✅
High-Quality Foley	✅
Context-Aware	✅
Output Quality	48 kHz
License	Research & Non-commercial only

Links:

ThinkSound

Description: Unified Any2Audio generation framework with flow matching guided by Chain-of-Thought (CoT) reasoning. Supports generating or editing audio from video, text, audio, or their combinations. Accepted to NeurIPS 2025.

Release Date: 2025

Feature	Value
Parameters	-
Video-to-Audio	✅ (SOTA)
Text-to-Audio	✅
Audio-to-Audio	✅
Audio Editing	✅
CoT-Driven Reasoning	✅
Interactive Object-centric Editing	✅
License	Apache-2.0 (Research only)

Links:

Woosh

Description: Sony AI's sound effect foundation model for text-to-audio and video-to-audio generation. Includes Woosh-AE (audio encoder/decoder), Woosh-Flow/DFlow (T2A), and Woosh-VFlow/DVFlow (V2A) with distilled fast inference variants.

Release Date: 2026

Feature	Value
Text-to-Audio	✅
Video-to-Audio	✅
Audio Encoding	✅
Fast Inference	✅ (Distilled models)
Architecture	Flow-based generative models
License	Apache-2.0

Key Innovation: Optimized for sound effects (not general audio) with both public and private model versions. Video-conditioned generation without requiring captions. Competitive with Stable Audio Open and TangoFlux.

Links:

PrismAudio

Description: Video-to-Audio generation framework with Reinforcement Learning and specialized Chain-of-Thought (CoT) planning. Decomposes reasoning into four specialized modules (Semantic, Temporal, Aesthetic, Spatial CoT) for comprehensive video understanding. Built upon ThinkSound.

Release Date: 2025 (ICLR 2026)

Feature	Value
Parameters	518M
Video-to-Audio	✅
CoT Planning	✅ (4 modules)
Multi-Dimensional RL	✅
Fast-GRPO	✅ (Hybrid ODE-SDE)
Inference Time	0.63 seconds
License	Apache-2.0

Performance Benchmarks:

Metric	VGGSound	AudioCanvas
Semantic (CLAP)	0.47	0.52
Temporal (DeSync↓)	0.41	0.36
Aesthetic (MOS-Q)	4.21±0.35	4.12±0.28

Links:

Uni-MoE (Audio)

Description: MoE-based omnimodal model with voice cloning, TTS, T2M (text-to-music), and V2M (video-to-music).

Release Date: October 16, 2025 (Uni-MoE-Audio)

Feature	Value
Parameters	-
Voice Cloning	✅
TTS	✅
Text-to-Music	✅
Video-to-Music	✅
Dynamic Routing	✅
License	Apache-2.0

Links:

Audio Restoration & Enhancement

Audio Restoration & Enhancement Quick Comparison

Model	Type	Bandwidth Extension	Inpainting	License
NVIDIA A2SB	Restoration	✅	✅	NVIDIA Non-Commercial
NovaSR	Enhancement	✅	❌	Apache-2.0
AudioSR	Enhancement	✅	❌	MIT

NVIDIA A2SB

NVIDIA A2SB (Audio-to-Audio Schrodinger Bridges)

Description: Diffusion-based audio restoration model tailored for high-resolution music at 44.1kHz. An end-to-end, vocoder-free, multi-task model capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Can restore hour-long audio inputs without boundary artifacts.

Release Date: January 2025

Feature	Value
Architecture	End-to-end vocoder-free
Bandwidth Extension	✅
Audio Inpainting	✅
High-Resolution	✅ (44.1kHz)
Long Audio	✅ (hour-long)
Streaming	❌
License	NVIDIA OneWay NonCommercial License

Links:

NovaSR

Description: Lightning fast audio upsampler - 50kB model that upscales 16kHz audio to 48kHz at 3500x realtime.

Release Date: 2025

Feature	Value
Size	52kB
Speed	3600x realtime (A100)
Input	16kHz
Output	48kHz
VRAM	Minimal
License	Apache-2.0

Links:

AudioSR

Description: Audio super resolution model using latent diffusion to upscale low-quality audio to 48kHz.

Release Date: February 12, 2026 (v1.1.1)

Feature	Value
Input	8kHz-48kHz
Output	48kHz
VRAM	6GB min
Stereo	✅
Long Audio	✅
License	MIT

Links:

Speech Recognition (ASR)

ASR Quick Comparison

Model	Languages	Streaming	License
Cohere Transcribe	14	✅	Apache-2.0
VibeVoice-ASR	50+	✅	MIT
FunASR	50+	✅	MIT

Cohere Transcribe

Description: Open-source automatic speech recognition (ASR) model developed by Cohere. A 2 billion parameter dedicated audio-in, text-out model that ranks #1 on the English ASR leaderboard.

Release Date: March 2026

Feature	Value
Parameters	2B
Architecture	Conformer-based encoder-decoder
ASR	✅
Languages	14 (En, Fr, De, It, Es, Pt, Gr, Nl, Pl, Zh, Jp, Ko, Vi, Ar)
Streaming	✅
RTFx	Up to 3x faster than comparable models
License	Apache-2.0

Key Features:

Long-form transcription with automatic chunking (>35 seconds)
Optional punctuation control
Batched inference support
vLLM integration for production serving
Apple Silicon support via mlx-audio
WebGPU browser deployment via transformers.js

Links:

VibeVoice-ASR

Description: Microsoft's unified speech-to-text model for 60-minute long-form audio processing with speaker diarization and timestamping.

Release Date: January 21, 2026

Feature	Value
Parameters	7B
ASR	✅
Languages	50+
Streaming	✅
License	MIT

Links:

FunASR

Description: Fundamental end-to-end speech recognition toolkit with SOTA pretrained models.

Release Date: Ongoing (First: 2023)

Feature	Value
ASR	✅
VAD	✅
Punctuation	✅
Speaker Diarization	✅
Multi-talker ASR	✅
Timestamp	✅
Emotion Recognition	✅
Languages	50+
License	MIT/Model License

Links:

Additional Resources

ComfyUI Integrations

Leaderboard

Resource	Description	Link
Open ASR Leaderboard	Hugging Face leaderboard for comparing ASR model performance across languages and metrics.

Contributing

This list is continuously evolving. If you have any models to add or updates to suggest, please feel free to contribute!

Last Updated: March 2026

Awesome TTS & Voice Generation Models

Table of Contents

Text-to-Speech (TTS) Models

TTS Quick Comparison

LongCat-AudioDiT

VoxCPM2

MOSS-TTS-Nano

T5Gemma-TTS

TinyTTS

LEMAS-TTS

OmniVoice

LongCat-Next

Voxtral-4B-TTS

Irodori-TTS-500M-v2

Fish Audio S2 Pro

KittenTTS

MOSS-TTS

SoulX-Singer

SoproTTS

NeuTTS

Qwen3-TTS

GLM-TTS

VibeVoice-Realtime

Fun-CosyVoice 3.0

MioTTS-2.6B

Supertonic 2

KugelAudio

Kokoro-82M

KokoClone

IndexTTS2

Maya1

LFM2-Audio-1.5B

Step-Audio-EditX

FireRedTTS2

VoxCPM

LuxTTS

MegaTTS3

Spark-TTS

Fish Speech

Step-Audio

Audio Flamingo 3 (AF3) / Audio Flamingo Next

SoulX-Podcast

Chatterbox

Orpheus-TTS

Dia

VieNeu-TTS

MiMo-Audio

Kimi-Audio

ZipVoice

Music Generation Models

Music Quick Comparison

ACE-Step 1.5

LeVo 2 (SongGeneration 2)

Foundation-1

Music Flamingo

Magenta Realtime

SoulX-Singer

Uni-MoE (Audio)

Anything to Audio

Anything to Audio Quick Comparison

AudioX / Audio-Omni

MMAudio

HunyuanVideo-Foley

ThinkSound

Woosh

PrismAudio

Uni-MoE (Audio)

Audio Restoration & Enhancement

Audio Restoration & Enhancement Quick Comparison

NVIDIA A2SB (Audio-to-Audio Schrodinger Bridges)

NovaSR

AudioSR

Speech Recognition (ASR)

ASR Quick Comparison

Cohere Transcribe

VibeVoice-ASR

FunASR

Additional Resources

ComfyUI Integrations

Leaderboard