# FunASR vLLM Inference Engine Guide --- ## Benchmark **Test set**: 184 files, 11,541 seconds total. Models: Fun-ASR-Nano / GLM-ASR-Nano. | Model | Engine | VAD | RTFx | CER | Notes | |-------|--------|-----|------|-----|-------| | Fun-ASR-Nano | PyTorch | dynamic | 21 | 8.06% | Baseline | | Fun-ASR-Nano | **vLLM batch** | dynamic | **340** | **8.20%** | 16x speedup | | Fun-ASR-Nano | **Offline service (no SPK)** | dynamic | **102** | 8.14% | | | Fun-ASR-Nano | **Offline service (+SPK)** | dynamic | **46** | 8.19% | SPK off by default | | GLM-ASR-Nano | **vLLM batch** | fixed | **265** | 12.93% | No long-audio support | > vLLM matches PyTorch CER exactly (delta < 0.2%) while achieving 16–340x speedup. --- ## Table of Contents 1. [Installation & Environment](#1-installation--environment) 2. [vLLM Engine Architecture](#2-vllm-engine-architecture) 3. [Offline SDK Inference](#3-offline-sdk-inference) 4. [Streaming SDK Inference](#4-streaming-sdk-inference) 5. [Offline Speech Recognition Service](#5-offline-speech-recognition-service) 6. [Streaming Speech Recognition Service](#6-streaming-speech-recognition-service) 7. [Dynamic VAD](#7-dynamic-vad) 8. [API Reference](#8-api-reference) 9. [FAQ](#9-faq) --- ## 1. Installation & Environment ```bash pip install torch torchaudio pip install funasr>=1.3.0 # Install vLLM separately after choosing a version compatible with your NVIDIA driver, CUDA runtime, and PyTorch wheel. pip install safetensors tiktoken websockets regex fastapi uvicorn python-multipart cd /path/to/FunASR && pip install -e . ``` **Hardware**: GPU ≥ 8 GB VRAM, CUDA ≥ 11.8. 16 GB+ recommended. Install a PyTorch/torchaudio/vLLM combination that matches your NVIDIA driver and CUDA runtime. Do not blindly keep the newest wheel if it was built for a newer CUDA runtime than your driver supports; PyTorch can fail during CUDA initialization with `The NVIDIA driver on your system is too old` before FunASR starts. If that happens, reinstall compatible PyTorch, torchaudio, and vLLM wheels for the CUDA version reported by `nvidia-smi`, or update the NVIDIA driver first. --- ## 2. vLLM Engine Architecture ### Overall Architecture FunASR's vLLM integration splits the ASR model into two independently running components: ``` ┌──────────────────────────────────────────────────────────────┐ │ FunASR + vLLM Inference Architecture │ ├──────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────── PyTorch (single GPU) ───────────┐ │ │ │ │ │ │ │ Audio ──→ Frontend ──→ Audio Encoder ──→ Adaptor │ │ │ (fbank) (SenseVoice/ (Transformer/ │ │ │ Whisper) MLP) │ │ │ │ │ │ │ ▼ │ │ │ Audio Embeddings │ │ │ │ │ │ │ Text Prompt ──→ Tokenize ──→ Embed │ │ │ (system/user/ │ │ │ │ hotwords/language) │ │ │ │ ▼ │ │ │ [Concat Embeddings] │ │ └─────────────────────────────────┼─────────────┘ │ │ │ │ │ ▼ EmbedsPrompt │ │ ┌─────────────── vLLM Engine ────────────────────┐ │ │ │ │ │ │ │ PagedAttention + Continuous Batching │ │ │ │ KV Cache management + CUDA Graph │ │ │ │ Tensor Parallel (multi-GPU) │ │ │ │ │ │ │ │ Qwen3-0.6B / Llama-2B (LLM decoding) │ │ │ │ │ │ │ └────────────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ Generated Text │ │ │ │ │ ┌────────────────────┼──────────────────────────┐ │ │ │ (Optional) CTC Decoder ──→ Forced Alignment │ │ │ │ ──→ Character-level timestamps │ │ │ └───────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘ ``` ### Why vLLM? | Feature | PyTorch generate() | vLLM | |---------|-------------------|------| | KV Cache management | Fixed allocation, wastes memory | PagedAttention, on-demand allocation | | Batching | Manual padding required | Continuous Batching, automatic scheduling | | CUDA optimization | None | CUDA Graph + operator fusion | | Multi-GPU parallelism | Manual implementation | Tensor Parallel with one-line config | | Throughput | RTFx ~20 | **RTFx 340+** | ### Supported Models | Model | LLM component | Audio encoder | vLLM speedup | |-------|--------------|---------------|-------------| | **Fun-ASR-Nano** | Qwen3-0.6B | SenseVoice | ✓ 21.7x | | **GLM-ASR-Nano** | Llama-2B | Whisper-like | ✓ 7.6x | | LLMASR | Qwen/Vicuna | Whisper | ✓ | | Paraformer | No LLM | — | ✗ Non-autoregressive | | SenseVoice | No LLM | — | ✗ Encoder-decoder | ### Key Implementation Details 1. **Weight separation**: LLM weights are extracted from `model.pt` and converted to HuggingFace format for vLLM loading 2. **EmbedsPrompt**: Audio embeddings and text embeddings are concatenated and fed to vLLM as a single prompt embedding 3. **use_low_frame_rate**: Fun-ASR-Nano's adaptor output must be truncated to the correct token count via a formula (critical for consistency) 4. **Batch encode**: Multiple audio files pass through `extract_fbank` → `audio_encoder` → `audio_adaptor` in a single forward pass 5. **CTC timestamps**: Encoder output is retained; after text generation, forced alignment yields character-level timing --- ## 3. Offline SDK Inference Best suited for large-scale audio transcription and offline batch processing. vLLM's batching capability provides the greatest advantage in this scenario. ### Design Principles Offline SDK inference splits the ASR pipeline into two stages executed independently: ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Stage 1: Audio Encoding (PyTorch, single GPU) │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Audio file list ──→ Group (batch of 8) ──→ Frontend (Fbank) │ │ │ │ │ │ │ ▼ │ │ │ SenseVoice Encoder │ │ │ │ │ │ │ ▼ │ │ │ Audio Adaptor │ │ │ (dim transform + LFR trunc) │ │ │ │ │ │ └─── Shared text prompt encoding ────┐ ▼ │ │ (system/hotwords/language) │ audio_embeds │ │ │ │ │ │ │ ▼ │ ▼ │ │ prefix_emb ──→ [concat: prefix | audio | suffix] │ │ │ │ │ ▼ │ │ EmbedsPrompt (N samples) │ └──────────────────────────────────────────────────┼─────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Stage 2: LLM Decoding (vLLM, multi-GPU Tensor Parallel) │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ EmbedsPrompt × N ──→ vLLM Continuous Batching │ │ (PagedAttention + CUDA Graph) │ │ │ │ │ ▼ │ │ Generated token_ids × N │ │ │ │ │ ▼ │ │ Decode + post-processing (strip special tokens) │ │ │ │ │ ▼ │ │ (Optional) CTC Forced Alignment → char timestamps│ └─────────────────────────────────────────────────────────────────────┘ ``` **Key design decisions:** 1. **Weight separation**: On first run, weights with the `llm.*` prefix are extracted from `model.pt` and saved in HuggingFace safetensors format for vLLM (cached in the `Qwen3-0.6B-vllm/` directory) 2. **Embedding concatenation**: The text prompt is encoded through the LLM's `embed_tokens` layer into embeddings, then concatenated with the audio adaptor output along the sequence dimension: `[prefix_emb | audio_emb | suffix_emb]`, and submitted to vLLM as an `EmbedsPrompt` 3. **Low Frame Rate truncation**: Adaptor output must be truncated to the correct length using: `fake_token_len = ((((fbank_len - 3 + 2) // 2 - 3 + 2) // 2) - 1) // 2 + 1`, ensuring consistency with the PyTorch training pipeline 4. **Batch audio encoding**: Multiple audio files are grouped in batches of 8 through the encoder + adaptor forward pass, reducing GPU kernel launch overhead 5. **Shared text prompt**: When hotwords and language are identical within a batch, prefix_emb and suffix_emb are computed only once 6. **CTC timestamps**: Encoder output is preserved; after LLM text generation, forced alignment produces character-level timestamps **Why faster than PyTorch generate()?** | Dimension | PyTorch | vLLM | |-----------|---------|------| | KV Cache | Fixed pre-allocation (wastes memory) | PagedAttention on-demand allocation | | Batching | Manual padding alignment | Continuous Batching auto-scheduling | | CUDA | Sequential per-sample execution | CUDA Graph + operator fusion | | Multi-GPU | Manual implementation | Tensor Parallel one-line config | | Result | RTFx ~20 | **RTFx 340+** (16x speedup) | ### Universal Interface (Recommended) ```python from funasr.auto.auto_model_vllm import AutoModelVLLM model = AutoModelVLLM( model="FunAudioLLM/Fun-ASR-Nano-2512", hub="ms", # or "hf" tensor_parallel_size=2, # multi-GPU parallel gpu_memory_utilization=0.8, ) results = model.generate( ["audio1.wav", "audio2.wav"], language="中文", hotwords=["张三", "北京"], ) for r in results: print(f"[{r['key']}] {r['text']}") ``` ### Direct Interface ```python from funasr.models.fun_asr_nano.inference_vllm import FunASRNanoVLLM engine = FunASRNanoVLLM.from_pretrained( model="FunAudioLLM/Fun-ASR-Nano-2512", tensor_parallel_size=4, ) results = engine.generate( inputs="wav.scp", # supports scp/jsonl/file lists hotwords=["开放时间"], language="中文", max_new_tokens=512, ) ``` ### Command Line ```bash cd examples/industrial_data_pretraining/fun_asr_nano # Single file python demo_vllm.py --input audio.wav --language 中文 # Batch + multi-GPU python demo_vllm.py --input wav.scp --tensor-parallel-size 4 --batch-size 32 # With hotwords + save results python demo_vllm.py --input audio.wav --hotwords 张三 北京 --output results.jsonl ``` --- ## 4. Streaming SDK Inference Processes audio in 720 ms chunks incrementally, outputting progressively stable recognition results. Suited for SDK-integrated real-time subtitle scenarios. ### Design Principles ``` Audio stream (720 ms chunks) │ Cumulative re-encoding (each chunk covers all audio from the start) ▼ ┌──────────────────────────┐ │ Stage 1: First 10 chunks │ ← No prev_text; batch generation │ Identify stable output │ └──────────┬───────────────┘ ▼ ┌──────────────────────────┐ │ Stage 2: Subsequent │ ← Use stable output as prev_text └──────────┬───────────────┘ ▼ Each chunk: [fixed region (confirmed)] + [8-char unfixed (may change)] ``` ### Usage ```python from funasr.models.fun_asr_nano.inference_vllm_streaming import FunASRNanoStreamingVLLM engine = FunASRNanoStreamingVLLM.from_pretrained( model="FunAudioLLM/Fun-ASR-Nano-2512", chunk_ms=720, rollback_chars=8, ) for result in engine.streaming_generate("audio.wav", language="中文"): if result["is_final"]: print(f"Final: {result['text']}") else: print(f"[{result['audio_duration_ms']:.0f}ms] Confirmed: {result['fixed_text']}") ``` ### Output Characteristics | Accumulated audio | Output quality | |-------------------|---------------| | < 1.5 s | Empty or noise | | 1.5–3.0 s | Partially correct | | > 3.0 s | Accurate output | > Note: `repetition_penalty=1.3` is hardcoded internally to prevent short-chunk repetition degradation. --- ## 5. Offline Speech Recognition Service ### 5.1 Service Architecture ``` Client serve_vllm.py │ │ │── HTTP / OpenAI / WebSocket ─────────→│ │ │ │ ┌────┴────────────────────────┐ │ │ 1. Receive complete audio │ │ │ 2. Dynamic VAD (≤60 s/seg) │ │ │ 3. vLLM batch all segments │ │ │ 4. CTC timestamps (per-char)│ │ │ 5. Speaker diarization (opt)│ │ └────┬────────────────────────┘ │ │ │←── JSON result ───────────────────────│ ``` **Characteristics**: - Processes audio only after it arrives in full — ideal for file transcription - Dynamic VAD preserves long segments (≤60 s), reducing boundary-cut losses - Batch inference over all VAD segments maximizes throughput - Automatically outputs character-level timestamps - Speaker diarization is off by default; clients can enable it ### 5.2 Starting the Service ```bash CUDA_VISIBLE_DEVICES=0 python examples/industrial_data_pretraining/fun_asr_nano/serve_vllm.py \ --port 8899 \ --model FunAudioLLM/Fun-ASR-Nano-2512 \ --gpu-memory-utilization 0.5 ``` ### 5.3 Protocol 1: HTTP REST — `POST /asr` The most feature-complete interface, supporting speaker diarization, timestamps, and hotwords. **Request**: `multipart/form-data` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `file` | file | required | Audio file (wav/mp3/flac) | | `language` | string | None | Language ("中文" / "English" / ...), None for auto | | `hotwords` | string | "" | Hotwords, comma-separated | | `spk` | bool | false | Enable speaker diarization | | `timestamp` | bool | true | Output character-level timestamps | **Response**: ```json { "text": "Full transcription text", "segments": [ { "text": "Segment text", "start": 1.7, "end": 14.8, "speaker": "SPK0", "words": [ {"word": "砸", "start": 2.02, "end": 2.08}, {"word": "了", "start": 2.26, "end": 2.32} ] } ], "duration": 227.4, "processing_time": 3.422, "rtf": 0.015 } ``` **Client examples**: ```bash # cURL curl -X POST http://localhost:8899/asr \ -F "file=@meeting.wav" -F "language=中文" -F "spk=true" ``` ```python # Python requests import requests resp = requests.post("http://localhost:8899/asr", files={"file": open("audio.wav", "rb")}, data={"language": "中文", "spk": "true"}) result = resp.json() ``` ```javascript // JavaScript fetch const form = new FormData(); form.append("file", audioBlob, "audio.wav"); form.append("language", "中文"); form.append("spk", "true"); const resp = await fetch("http://localhost:8899/asr", { method: "POST", body: form }); const result = await resp.json(); ``` ### 5.4 Protocol 2: OpenAI Whisper Compatible — `POST /v1/audio/transcriptions` Compatible with the OpenAI Whisper API standard; works directly with the OpenAI SDK. **Request**: `multipart/form-data` | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `file` | file | required | Audio file | | `model` | string | "fun-asr-nano" | Model name (compatibility field) | | `language` | string | None | Language | | `response_format` | string | "json" | "json" / "text" / "verbose_json" | | `timestamp_granularities` | string | "word" | "word" / "segment" | | `spk` | bool | false | Speaker diarization (FunASR extension) | **Response** (`verbose_json`): ```json { "task": "transcribe", "language": "zh", "duration": 5.17, "text": "我一直没有照顾孩子,但是我想要抚养权。", "segments": [ { "id": 0, "start": 0.0, "end": 5.15, "text": "我一直没有照顾孩子,但是我想要抚养权。", "words": [{"word": "我", "start": 0.42, "end": 0.48}, ...] } ] } ``` **Client examples**: ```python # OpenAI SDK (recommended) from openai import OpenAI client = OpenAI(base_url="http://localhost:8899/v1", api_key="none") result = client.audio.transcriptions.create( model="fun-asr-nano", file=open("audio.wav", "rb"), response_format="verbose_json", ) print(result.text) ``` ```bash # cURL curl -X POST http://localhost:8899/v1/audio/transcriptions \ -F "file=@audio.wav" -F "model=fun-asr-nano" -F "response_format=verbose_json" ``` ### 5.5 Protocol 3: WebSocket — `ws://host:port/ws` WebSocket interface for the offline service. Send complete audio, then receive results. Speaker clustering is performed automatically on STOP, and results include the `spk` field. **Client → Server**: | Message | Description | |---------|-------------| | `"START"` | Begin session | | `"LANGUAGE:中文"` | Set language (optional) | | `"HOTWORDS:word1,word2"` | Set hotwords (optional) | | `[binary]` | PCM16 16 kHz mono audio data | | `"STOP"` | End session; request recognition result | **Server → Client**: ```json {"event": "started"} {"event": "language_set", "language": "中文"} {"sentences": [{"text":"...","start":..,"end":..}], "is_final": true, "duration_ms": 5170} {"event": "stopped"} ``` **Client example**: ```python import asyncio, websockets, json, numpy as np, soundfile as sf async def offline_ws(audio_path): audio, sr = sf.read(audio_path) pcm = (audio * 32768).astype(np.int16) async with websockets.connect("ws://localhost:8899/ws") as ws: await ws.send("START") await ws.recv() await ws.send("LANGUAGE:中文") await ws.recv() # Send complete audio await ws.send(pcm.tobytes()) await ws.send("STOP") # Receive result async for msg in ws: data = json.loads(msg) if data.get("is_final"): for s in data["sentences"]: print(f"[{s['start']/1000:.1f}s] {s['text']}") break asyncio.run(offline_ws("audio.wav")) ``` --- ## 6. Streaming Speech Recognition Service ### 6.1 Service Architecture ``` Client (microphone / audio stream) serve_realtime_ws.py │ │ │── WebSocket PCM16 16 kHz ──────────→│ │ (~100 ms per frame, continuous) │ │ │ │ ┌────┴─────────────────────────┐ │ │ Real-time loop: │ │ │ ├─ Dynamic VAD (60 ms chunk) │ │ │ ├─ Endpoint → vLLM decode │ │ │ ├─ No endpoint → partial │ │ │ └─ Streaming SPK assignment │ │ └────┬─────────────────────────┘ │ │ │←── JSON real-time push ─────────────│ ``` **Characteristics**: - Audio arrives frame by frame; processing starts immediately - Natural sentence segmentation based on VAD endpoints - Confirmed segment text is locked and never changes; partial text updates in real time - Streaming speaker assignment + global re-clustering on STOP - First-word latency ~480 ms ### 6.2 Starting the Service ```bash CUDA_VISIBLE_DEVICES=0 python examples/industrial_data_pretraining/fun_asr_nano/serve_realtime_ws.py \ --port 10095 --language 中文 --hotword-file hotword_list ``` ### 6.3 WebSocket Protocol **Connection**: `ws://host:10095` **Client → Server**: | Message | Format | Description | |---------|--------|-------------| | Start | `"START"` | Initialize session | | Hotwords | `"HOTWORDS:word1,word2"` | Optional | | Language | `"LANGUAGE:中文"` | Optional | | Audio | `binary` | PCM16 16 kHz mono | | End | `"STOP"` | Final decode + SPK re-clustering | **Server → Client**: ```json {"event": "started"} {"sentences": [{"text":"你好","start":300,"end":1200,"spk":0}], "partial": "世界", "is_final": false} {"sentences": [...], "is_final": true} {"event": "stopped"} ``` **Fields**: `sentences[]` = locked segments, `partial` = text being spoken (may change), `is_final` = true after STOP. **Sequence diagram**: ``` Client Server │── START ───────→│ │←─ started ──────│ │── [audio] ─────→│ │←─ {partial} ────│ │── [audio] ─────→│ │←─ {sentences+partial} ─│ (VAD cut a sentence) │── STOP ────────→│ │←─ {is_final:true} ────│ │←─ stopped ─────│ ``` ### 6.4 Client Usage **Python CLI**: ```bash python client_python.py --server ws://localhost:10095 --mic python client_python.py --server ws://localhost:10095 --file audio.wav ``` **Browser**: Open `client_mic.html` **Custom Python**: ```python import asyncio, websockets, numpy as np, json async def stream(audio_path): import soundfile as sf audio, sr = sf.read(audio_path) pcm = (audio * 32768).astype(np.int16) async with websockets.connect("ws://localhost:10095") as ws: await ws.send("START") await ws.recv() for i in range(0, len(pcm), 1600): await ws.send(pcm[i:i+1600].tobytes()) await asyncio.sleep(0.05) await ws.send("STOP") async for msg in ws: data = json.loads(msg) if data.get("is_final"): for s in data["sentences"]: print(f"[{s['start']/1000:.1f}s] {s['text']}") break asyncio.run(stream("audio.wav")) ``` --- ## 7. Dynamic VAD fsmn-vad enables dynamic silence thresholds by default. Offline and streaming modes use different configurations. | Accumulated duration | Offline (preserve long segs ≤60 s) | Streaming (balance latency) | |---------------------|-----------------------------------|-----------------------------| | ≤ 5 s | 2000 ms | 2000 ms | | 5–10 s | 2000 ms | 1500 ms | | 10–15 s | 1000 ms | 1000 ms | | 15–20 s | 1000 ms | 800 ms | | 20–30 s | 800 ms | 800 ms | | 30–45 s | 600 ms | 400 ms | | 45–60 s | 200–400 ms | 100 ms | | > 60 s | 100 ms | 100 ms | Offline mode favors longer segments to reduce boundary-cut losses; streaming mode tightens faster to reduce latency. ### Customization ```python model.generate(input="audio.wav", silence_schedule=[(5000,1500), (20000,800), (float('inf'),300)]) ``` > GLM-ASR does not support long-segment inference; pass `dynamic_silence=False` when using it. --- ## 8. API Reference | Parameter | AutoModelVLLM | serve_vllm.py | serve_realtime_ws.py | |-----------|--------------|---------------|---------------------| | model | ✓ | --model | --model | | gpu_memory_utilization | ✓ | --gpu-memory-utilization | --gpu-memory-utilization | | tensor_parallel_size | ✓ | — | --tensor-parallel-size | | max_model_len | ✓ | --max-model-len | --max-model-len | | language | generate() param | API param | --language / LANGUAGE: | | hotwords | generate() param | API param | --hotword-file / HOTWORDS: | --- ## 9. FAQ **Q: Offline or streaming?** Complete files → offline (high throughput). Microphone / live stream → streaming (low latency). **Q: Can GLM-ASR use dynamic VAD?** It does not support long-segment inference. Use `dynamic_silence=False`. **Q: Performance impact of SPK?** RTFx drops from 102 to 46. CER is unchanged. Disabled by default. **Q: Entry points for custom development?** Offline: `serve_vllm.process_audio()` / `FunASRNanoVLLM.generate()` Streaming: `serve_realtime_ws.RealtimeASRSession` **Q: Slow first startup?** vLLM initialization takes 60–90 s (KV Cache + CUDA Graph warmup). Subsequent inferences are instant.