Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

ViiTorVoice-NAR

Hugging Face Model Hugging Face Demo 中文文档

ViiTorVoice is a non-autoregressive speech generation system for voice cloning and local speech editing. The current deployment path uses split gRPC v2 services, with an HTTP gateway for end-to-end calls.

Core capabilities:

  • Voice cloning: provide prompt audio or a prompt audio codebook, and synthesize speech for the target text.
  • Local editing: provide source audio, original text, and edited text; the system locates the changed region and resynthesizes only the local segment.
  • Emotion and paralinguistic control: insert emotion tags and paralinguistic information into text conditions, then enhance them with CFG.
  • Low-latency inference: supports first block inference, with end-to-end first-frame latency around 60 ms.

For model architecture, features, and technical details, see Technical Notes.

Inference Environment Setup

Run the initialization script from the repository root:

bash init_env.sh

The script creates .venv and installs the dependencies required for inference. Service startup uses this virtual environment by default.

Model Download

Download the model files into local_models/ under the repository root. Do not use symlinks; make sure the model files really exist under the local local_models/ directory.

Model page:

https://huggingface.co/ZzWater/ViiTorVoice-NAR
mkdir -p local_models huggingface-cli download ZzWater/ViiTorVoice-NAR \ --local-dir local_models \ --local-dir-use-symlinks False

If you use another download tool, keep the same rule: place the downloaded files under local_models/, and do not use symlinks.

Service Startup And Management

Services are managed by run_grpc_v2.sh. Use the default all-in-one startup path; all starts encoder, llm, decoder, orchestrator, and http services.

./run_grpc_v2.sh start all ./run_grpc_v2.sh status all ./run_grpc_v2.sh logs orchestrator ./run_grpc_v2.sh stop all

The HTTP service listens on 0.0.0.0:7861 by default. Local access uses http://127.0.0.1:7861. For other ports, model paths, GPU settings, log directories, and environment variables, see viitorvoice/grpc_server/deploy.env.

HTTP Inference Examples

Default local HTTP endpoint:

BASE_URL="http://127.0.0.1:7861"

Health Check

curl "$BASE_URL/health"

Voice Cloning

For no-ref-text cloning, omit ref_text:

curl -X POST "$BASE_URL/v1/voice-clone" \ -F 'ref_audio=@prompt.wav' \ -F 'text=今天天气不错,我们下午一起去公园散步吧。' \ -F 'language=zh' \ -F 'allow_missing_ref_text=true' \ --output clone_no_ref_text.wav

Emotion And Paralinguistic Control

After adding emotion or paralinguistic tags to the text, use CFG parameters to strengthen the control effect:

curl -X POST "$BASE_URL/v1/voice-clone" \ -F 'ref_audio=@prompt.wav' \ -F 'text=<|emotion-happy|>I finally finished the project, and I feel really happy.' \ -F 'language=en' \ -F 'emotion_guidance_scale=6.0' \ -F 'nvv_guidance_scale=2.0' \ --output clone_emotion.wav

The available tag set depends on the training data and model configuration. If no corresponding tag is present, the related CFG parameters do not take effect.

Local Editing

Upload source audio, original text, and the complete edited text:

curl -X POST "$BASE_URL/v1/text-local-edit" \ -F 'source_audio=@source.wav' \ -F 'original_text=Please send the meeting notes before Friday.' \ -F 'edited_text=Please send the meeting notes before Monday.' \ -F 'language=en' \ -F 'align_granularity=word' \ -F 'expand_mask_ratio=1.5' \ -F 'output_format=wav' \ --output edited.wav

For more HTTP parameters, codebook input, base64 input, and Python examples, see HTTP API Usage. To call the orchestrator gRPC service directly, see gRPC API Usage.

Acknowledgements

The model architecture and training ideas in this project are inspired by:

关于 About

No description, website, or topics provided.

语言 Languages

Python98.3%
Shell1.2%
Dockerfile0.5%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
5
Total Commits
峰值: 5次/周
Less
More

核心贡献者 Contributors