ModelOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Dec 5, 2025seen 5d

openbmb/VoxCPM1.5

Open original ↗

Captured source

source ↗
published Dec 5, 2025seen 5dcaptured 14hhttp 200method plaintask text-to-speechlicense apache-2.0library voxcpmparams 802Mdownloads 11klikes 362

🎙️ VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

  • VoxCPM1.5

🎉 VoxCPM1.5 Updates

Release Date: December 5, 2025

VoxCPM1.5 brings improvements in audio quality and efficiency:

| Feature | VoxCPM | VoxCPM1.5 | |---------|------------|------------| | Audio VAE Sampling Rate | 16kHz | 44.1kHz | | LM Token Rate | 12.5Hz | 6.25Hz | | Patch Size | 2 | 4 | | SFT Support | ✅ | ✅ | | LoRA Support | ✅ | ✅ |

Key Improvements:

  • 🔊 Higher Quality: 44.1kHz sampling rate preserves more high-frequency details for better voice cloning
  • More Efficient: Reduced token rate (6.25Hz) lowers computational cost while maintaining performance
  • 🎓 Fine-tuning Support: Train personalized voice models with SFT or LoRA

Note: Output quality depends on the prompt speech quality. VoxCPM-0.5B remains fully supported with backward compatibility.

📚 Model Overview

VoxCPM is a novel tokenizer-free Text-to-Speech (TTS) system that redefines realism in speech synthesis. By modeling speech in a continuous space, it overcomes the limitations of discrete tokenization and enables two flagship capabilities: context-aware speech generation and true-to-life zero-shot voice cloning.

Unlike mainstream approaches that convert speech to discrete tokens, VoxCPM uses an end-to-end diffusion autoregressive architecture that directly generates continuous speech representations from text. Built on MiniCPM-4 backbone, it achieves implicit semantic-acoustic decoupling through hierachical language modeling and FSQ constraints, greatly enhancing both expressiveness and generation stability.

🚀 Key Features

  • Context-Aware, Expressive Speech Generation - VoxCPM comprehends text to infer and generate appropriate prosody, delivering speech with remarkable expressiveness and natural flow. It spontaneously adapts speaking style based on content, producing highly fitting vocal expression trained on a massive 1.8 million-hour bilingual corpus.
  • True-to-Life Voice Cloning - With only a short reference audio clip, VoxCPM performs accurate zero-shot voice cloning, capturing not only the speaker’s timbre but also fine-grained characteristics such as accent, emotional tone, rhythm, and pacing to create a faithful and natural replica.
  • High-Efficiency Synthesis - VoxCPM supports streaming synthesis with a Real-Time Factor (RTF) as low as 0.17 on a consumer-grade NVIDIA RTX 4090 GPU, making it possible for real-time applications.

Quick Start

🔧 Install from PyPI

pip install voxcpm

1. Model Download (Optional)

By default, when you first run the script, the model will be downloaded automatically, but you can also download the model in advance.

  • Download VoxCPM1.5
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM1.5")
  • Or Download VoxCPM-0.5B
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM-0.5B")
  • Download ZipEnhancer and SenseVoice-Small. We use ZipEnhancer to enhance speech prompts and SenseVoice-Small for speech prompt ASR in the web demo.
from modelscope import snapshot_download
snapshot_download('iic/speech_zipenhancer_ans_multiloss_16k_base')
snapshot_download('iic/SenseVoiceSmall')

2. Basic Usage

import soundfile as sf
import numpy as np
from voxcpm import VoxCPM

model = VoxCPM.from_pretrained("openbmb/VoxCPM1.5")

# Non-streaming
wav = model.generate(
text="VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.",
prompt_wav_path=None, # optional: path to a prompt speech for voice cloning
prompt_text=None, # optional: reference text
cfg_value=2.0, # LM guidance on LocDiT, higher for better adherence to the prompt, but maybe worse
inference_timesteps=10, # LocDiT inference timesteps, higher for better result, lower for fast speed
normalize=False, # enable external TN tool, but will disable native raw text support
denoise=False, # enable external Denoise tool, but it may cause some distortion and restrict the sampling rate to 16kHz
retry_badcase=True, # enable retrying mode for some bad cases (unstoppable)
retry_badcase_max_times=3, # maximum retrying times
retry_badcase_ratio_threshold=6.0, # maximum length restriction for bad case detection (simple but effective), it could be adjusted for slow pace speech
)

sf.write("output.wav", wav, model.tts_model.sample_rate)
print("saved: output.wav")

# Streaming
chunks = []
for chunk in model.generate_streaming(
text = "Streaming text to speech is easy with VoxCPM!",
# supports same args as above
):
chunks.append(chunk)
wav = np.concatenate(chunks)

sf.write("output_streaming.wav", wav, model.tts_model.sample_rate)
print("saved: output_streaming.wav")

3. CLI Usage

After installation, the entry point is voxcpm (or use python -m voxcpm.cli).

# 1) Direct synthesis (single text)
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." --output out.wav

# 2) Voice cloning (reference audio + transcript)
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
--output out.wav \
# --denoise

# (Optinal) Voice cloning (reference audio + transcript file)
voxcpm --text "VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech." \
--prompt-audio path/to/voice.wav \
--prompt-file "/path/to/text-file" \
--output out.wav \
# --denoise

# 3) Batch processing (one text per line)
voxcpm --input examples/input.txt --output-dir outs
# (optional) Batch + cloning
voxcpm --input examples/input.txt --output-dir outs \
--prompt-audio path/to/voice.wav \
--prompt-text "reference transcript" \
# --denoise

# 4) Inference parameters (quality/speed)
voxcpm --text "..." --output out.wav \
--cfg-value 2.0 --inference-timesteps 10 --normalize

# 5) Model loading
# Prefer local path
voxcpm --text "..." --output out.wav --model-path /path/to/VoxCPM_model_dir
# Or from Hugging Face (auto download/cache)
voxcpm --text "..." --output out.wav \
--hf-model-id openbmb/VoxCPM1.5 --cache-dir ~/.cache/huggingface --local-files-only

# 6) Denoiser control
voxcpm --text "..." --output out.wav \…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release with good community traction.