VoxCPM Text-to-Speech
Synthesize speech from Chinese/English/multilingual text using OpenBMB's VoxCPM. Two model tiers, three input modes, thirty languages, nine Chinese dialects. Apache-2.0 licensed.
Verified 2026-07-04: voxcpm 2.0.3 on PyPI, dependency graph resolves cleanly (torch ≥ 2.5, transformers ≥ 4.36, gradio 6.x). PyPI reachable. Model weights host on Hugging Face (openbmb/VoxCPM2, openbmb/VoxCPM-0.5B).
This environment (QoderWork sandbox) has no GPU and cannot reach huggingface.co directly — audio generation is not exercised here. Deploy on a machine with CUDA / Apple Silicon / a HF mirror to actually synthesize.
Prerequisites
| Component | Requirement |
|---|---|
| Python | ≥ 3.10, < 3.13 (officially tested) |
| PyTorch | ≥ 2.5.0 |
| CUDA | ≥ 12.0 (recommended); MPS or CPU also work (slower) |
| GPU VRAM | 5 GB for VoxCPM-0.5B, 8 GB for VoxCPM2 (2B) |
| Disk | ~2 GB per model (weights auto-download from HF) |
| Network | Access to huggingface.co (or set HF_ENDPOINT to a mirror) |
Install:
pip install voxcpm
For China deployments where HF is slow, use ModelScope mirror:
export HF_ENDPOINT=https://hf-mirror.com
# or use ModelScope directly:
pip install modelscope
Quick Start
CLI (simplest)
# Basic synthesis to WAV
voxcpm design --text "你好,欢迎收听三分钟 AI 快报。" --output greeting.wav
# With voice-style natural-language description
voxcpm design --text "(年轻女声,温柔甜美) 今天天气真好,我们来聊聊 AI。" --output warm.wav
# English
voxcpm design --text "Welcome to today's AI briefing." --output en.wav
# Batch
voxcpm batch --input-file lines.txt --output-dir out/
Python SDK (programmatic)
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="你好,这是一段测试语音。",
cfg_value=2.0, # classifier-free guidance strength (higher = more expressive)
inference_timesteps=10, # diffusion steps (more = better quality, slower)
seed=42 # reproducibility
)
sf.write("demo.wav", wav, model.tts_model.sample_rate) # 24 kHz output
Streaming synthesis (for long text)
for chunk in model.generate_streaming(text="很长的一段文本..."):
# chunk is np.ndarray of PCM samples
process(chunk)
Input Modes
1. Design (natural-language voice control)
Wrap voice description in () at the start of the text:
"(年轻女声,语气甜美,稍快语速) 今天要给大家介绍一款有意思的 AI 工具。"
"(中年男声,稳重、专业) 欢迎收听本期节目。"
"(带四川口音的女生,轻松) 这个咋回事嘛!"
Available cues (composable):
- Voice type: 男声 / 女声 / 童声 / 老年男声 / 中年女声 ...
- Tone: 温柔 / 严肃 / 兴奋 / 平静 / 忧郁 / 慵懒 ...
- Pace: 快语速 / 慢语速 / 稍快 / 稍慢
- Style: 播音腔 / 口语化 / 说书人 / 儿童剧 / 广告腔
- Dialect: 四川话 / 粤语 / 吴语 / 东北话 / 河南话 / 陕西话 / 山东话 / 天津话 / 闽南话
2. Voice cloning (zero-shot)
Provide a 3-30 second reference audio + its transcript:
wav = model.generate(
text="要合成的新文本。",
prompt_wav_path="reference.wav",
prompt_text="参考音频对应的原文(准确抄写)"
)
The generated voice matches the reference speaker.
3. Combined (clone + style tweak)
wav = model.generate(
text="(稍快,兴奋) 要合成的新文本。",
reference_wav_path="reference.wav" # softer form of clone; keeps timbre but style overrides
)
Common Workflows
End-to-end: script → audio
Pair with podcast-scriptwriter. Given a script:
A(轻松):欢迎收听《三分钟 AI 快报》。
B(认真):本周动作不小。
import re
from voxcpm import VoxCPM
import soundfile as sf, numpy as np
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
voices = {
'A': "reference_female.wav", # or use "(年轻女声,语气理性犀利)" as prefix
'B': "reference_male.wav",
}
segments = []
for line in open('script.txt', encoding='utf-8'):
m = re.match(r'^([AB])(([^)]+)):(.+)$', line.strip())
if not m: continue
speaker, tone, text = m.groups()
styled = f"({tone}) {text}"
wav = model.generate(text=styled, prompt_wav_path=voices[speaker], cfg_value=2.0, inference_timesteps=10)
segments.append(wav)
segments.append(np.zeros(int(0.4 * model.tts_model.sample_rate))) # 0.4s pause
full = np.concatenate(segments)
sf.write('episode.wav', full, model.tts_model.sample_rate)
Multi-language podcast
Feed the same script translated to en/ja/ko/fr; VoxCPM auto-detects language per line. Keep the same reference audio to preserve host identity across languages.
High-throughput vLLM serving
For volume, use the OpenAI-compatible vLLM Omni server:
vllm serve openbmb/VoxCPM2 --omni --port 8000
Then hit /v1/audio/speech like OpenAI TTS.
Fallback: edge-tts (no GPU)
If VoxCPM is impossible in the environment (no GPU, HF blocked, tight time), fall back to Microsoft Edge TTS (free, no API key, 400+ voices):
pip install edge-tts
edge-tts --voice zh-CN-XiaoxiaoNeural --text "你好世界" --write-media hello.mp3
Trade-offs: no voice cloning, fixed voice pool, network-dependent, not open-source. Acceptable stop-gap.
Limitations
- GPU strongly recommended — CPU works but ~10-50× slower.
- First-run downloads ~2 GB from HF; use
HF_ENDPOINTfor a mirror if slow. - CUDA 12+ requires recent driver — older drivers may need CUDA 11 build; check
torch.cuda.is_available(). - Long text (> 2 min): use
generate_streamingand stitch chunks; puregeneratemay hit VRAM ceiling. - English-only clone: works but sample quality slightly lower than Chinese-native.
- Dialects require prefix format —
(粤语) 早晨。not粤语:早晨。. - No word-level timing output — for karaoke use whisper-like STT alignment on the produced WAV.
- Sample rate is fixed at 24 kHz — resample downstream if needed.
- Python 3.13+ untested — pin to 3.10-3.12.
Failure Modes
| Symptom | Cause | Action |
|---|---|---|
torch.cuda.OutOfMemoryError | Model too big for GPU | Switch to openbmb/VoxCPM-0.5B, or use CPU |
| Model download stalls | HF connectivity | Set HF_ENDPOINT=https://hf-mirror.com |
| Robotic / metallic voice | cfg_value too high | Try 1.5-2.5; sweet spot is 2.0 |
| Voice clone doesn't match | Reference too short/noisy | Use 5-15 s clean sample; enable load_denoiser=True |
| Language auto-detected wrong | Text mixed | Split by language, generate per-segment |
CLI voxcpm: command not found | Not in venv PATH | pip install voxcpm in the active env; or python -m voxcpm ... |
References
- Repo: https://github.com/OpenBMB/VoxCPM ⭐ 32.4k
- Docs: https://voxcpm.readthedocs.io/
- HF model: https://huggingface.co/openbmb/VoxCPM2 (2B) or https://huggingface.co/openbmb/VoxCPM-0.5B (small)
- Demo: https://openbmb.github.io/VoxCPM-demopage/
- License: Apache-2.0
- Fallback: Microsoft Edge TTS (
pip install edge-tts) — no API key, no clone