ViiTorVoice-NAR
ViiTorVoice is a non-autoregressive speech generation system for voice cloning and local speech editing. The current deployment path uses split gRPC v2 services, with an HTTP gateway for end-to-end calls.
Core capabilities:
- Voice cloning: provide prompt audio or a prompt audio codebook, and synthesize speech for the target text.
- Local editing: provide source audio, original text, and edited text; the system locates the changed region and resynthesizes only the local segment.
- Emotion and paralinguistic control: insert emotion tags and paralinguistic information into text conditions, then enhance them with CFG.
- Low-latency inference: supports first block inference, with end-to-end first-frame latency around 60 ms.
For model architecture, features, and technical details, see Technical Notes.
Inference Environment Setup
Run the initialization script from the repository root:
bash init_env.sh
The script creates .venv and installs the dependencies required for inference. Service startup uses this virtual environment by default.
Model Download
Download the model files into local_models/ under the repository root. Do not use symlinks; make sure the model files really exist under the local local_models/ directory.
Model page:
https://huggingface.co/ZzWater/ViiTorVoice-NAR
mkdir -p local_models huggingface-cli download ZzWater/ViiTorVoice-NAR \ --local-dir local_models \ --local-dir-use-symlinks False
If you use another download tool, keep the same rule: place the downloaded files under local_models/, and do not use symlinks.
Service Startup And Management
Services are managed by run_grpc_v2.sh. Use the default all-in-one startup path; all starts encoder, llm, decoder, orchestrator, and http services.
./run_grpc_v2.sh start all ./run_grpc_v2.sh status all ./run_grpc_v2.sh logs orchestrator ./run_grpc_v2.sh stop all
The HTTP service listens on 0.0.0.0:7861 by default. Local access uses http://127.0.0.1:7861. For other ports, model paths, GPU settings, log directories, and environment variables, see viitorvoice/grpc_server/deploy.env.
HTTP Inference Examples
Default local HTTP endpoint:
BASE_URL="http://127.0.0.1:7861"
Health Check
curl "$BASE_URL/health"
Voice Cloning
For no-ref-text cloning, omit ref_text:
curl -X POST "$BASE_URL/v1/voice-clone" \ -F 'ref_audio=@prompt.wav' \ -F 'text=今天天气不错,我们下午一起去公园散步吧。' \ -F 'language=zh' \ -F 'allow_missing_ref_text=true' \ --output clone_no_ref_text.wav
Emotion And Paralinguistic Control
After adding emotion or paralinguistic tags to the text, use CFG parameters to strengthen the control effect:
curl -X POST "$BASE_URL/v1/voice-clone" \ -F 'ref_audio=@prompt.wav' \ -F 'text=<|emotion-happy|>I finally finished the project, and I feel really happy.' \ -F 'language=en' \ -F 'emotion_guidance_scale=6.0' \ -F 'nvv_guidance_scale=2.0' \ --output clone_emotion.wav
The available tag set depends on the training data and model configuration. If no corresponding tag is present, the related CFG parameters do not take effect.
Local Editing
Upload source audio, original text, and the complete edited text:
curl -X POST "$BASE_URL/v1/text-local-edit" \ -F 'source_audio=@source.wav' \ -F 'original_text=Please send the meeting notes before Friday.' \ -F 'edited_text=Please send the meeting notes before Monday.' \ -F 'language=en' \ -F 'align_granularity=word' \ -F 'expand_mask_ratio=1.5' \ -F 'output_format=wav' \ --output edited.wav
For more HTTP parameters, codebook input, base64 input, and Python examples, see HTTP API Usage. To call the orchestrator gRPC service directly, see gRPC API Usage.
Acknowledgements
The model architecture and training ideas in this project are inspired by: