RepoZhipu AI (GLM)Zhipu AI (GLM)published Dec 6, 2025seen 5d

zai-org/GLM-TTS

Python

Open original ↗

Captured source

source ↗
published Dec 6, 2025seen 5dcaptured 15hhttp 200method plain

zai-org/GLM-TTS

Description: GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning

Language: Python

License: Apache-2.0

Stars: 1020

Forks: 129

Open issues: 45

Created: 2025-12-06T04:50:56Z

Pushed: 2026-04-10T08:50:23Z

Default branch: main

Fork: no

Archived: no

README:

GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning

[中文阅读](README_zh.md)

📜 Paper | 🤗 HuggingFace | 🤖 ModelScope | 🛠️Audio.Z.AI

Model Introduction

GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. This system adopts a two-stage architecture: first, it uses LLM to generate speech token sequences, then uses Flow model to convert tokens into high-quality audio waveforms. By introducing a Multi-Reward Reinforcement Learning framework, GLM-TTS can generate more expressive and emotional speech, significantly improving the expressiveness of traditional TTS systems.

News & Updates

  • [2025.12.11] 🎉 The project is officially open-sourced, featuring inference scripts and a series of model weights.
  • [2025.12.17] GLM-TTS Technical Report is available on arXiv: 2512.14291.
  • [Coming Soon] 2D Vocos vocoder update in progress.
  • [Coming Soon] Model Weights Optimized via Reinforcement Learning

Features

  • Zero-shot Voice Cloning: Clone any speaker's voice with just 3-10 seconds of prompt audio
  • RL-enhanced Emotion Control: Achieve more natural emotional expression and prosody control through multi-reward reinforcement learning framework
  • Streaming Inference: Support real-time streaming audio generation, suitable for interactive applications
  • High-quality Synthesis: Generate natural and expressive speech with quality comparable to commercial systems
  • Multi-language Support: Primarily supports Chinese, while also supporting English mixed text
  • Phoneme-level Modeling: Support phoneme-level text-to-speech conversion
  • Flexible Inference Methods: Support multiple sampling strategies and inference modes

Quick Start

Environment Setup

Ensure you use Python 3.10 - Python 3.12 versions.

For GPU

# Clone repository
git clone https://github.com/zai-org/GLM-TTS.git
cd GLM-TTS

# Install dependencies
pip install -r requirements.txt

# Install reinforcement learning related dependencies (optional)
cd grpo/modules
git clone https://github.com/s3prl/s3prl
git clone https://github.com/omine-me/LaughterSegmentation
# Download wavlm_large_finetune.pth and place it in grpo/ckpt directory

For NPU

Obtain CANN image

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci7
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/cann:8.5.1-910b-ubuntu22.04-py3.11
docker run --rm \
--name vllm-ascend-env \
--shm-size=1g \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
# Clone repository
git clone https://github.com/zai-org/GLM-TTS.git
cd GLM-TTS

pip config set global.extra-index-url "https://download.pytorch.org/whl/cpu/"
python -m pip install -r requirements_npu.txt --no-build-isolation

# Install reinforcement learning related dependencies (optional)
cd grpo/modules
git clone https://github.com/s3prl/s3prl
git clone https://github.com/omine-me/LaughterSegmentation
# Download wavlm_large_finetune.pth and place it in grpo/ckpt directory

Download Pre-trained Models

We support downloading the complete model weights (including Tokenizer, LLM, Flow, Vocoder, and Frontend) from HuggingFace or ModelScope.

# Create model directory
mkdir -p ckpt

# Option 1: Download from HuggingFace
pip install -U huggingface_hub
huggingface-cli download zai-org/GLM-TTS --local-dir ckpt

# Option 2: Download from ModelScope
pip install -U modelscope
modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt

Running Inference Demo

Command Line Inference

python glmtts_inference.py \
--data=example_zh \
--exp_name=_test \
--use_cache \
# --phoneme # Add this flag to enable phoneme capabilities.

Shell Script Inference

bash glmtts_inference.sh

Interactive Web Interface

python -m tools.gradio_app

System Architecture

Overview

GLM-TTS adopts a two-stage design: in the first stage, a large language model (LLM) based on Llama architecture converts input text into speech token sequences; in the second stage, the Flow Matching model converts these token sequences into high-quality mel-spectrogram, and finally generates audio waveforms through a vocoder. The system supports zero-shot voice cloning by extracting speaker features from prompt audio without fine-tuning for specific speakers.

Fine-grained Pronunciation Control (Phoneme-in)

For scenarios demanding high pronunciation accuracy, such as educational assessments and audiobooks, GLM-TTS introduces the Phoneme-in mechanism to address automatic pronunciation ambiguity in polyphones (e.g., "行" which can be read as *xíng* or *háng*) and rare characters. This mechanism supports "Hybrid Phoneme + Text" input, enabling precise, targeted control over specific vocabulary pronunciation.

  • Hybrid Training

During training, random G2P (Grapheme-to-Phoneme) conversion is applied to parts of the text. This strategy compels the model to adapt to hybrid input sequences, preserving its ability to understand pure text while enhancing generalization for phoneme inputs.

  • Targeted Inference

Inference follows a G2P -> Table Lookup Replacement -> Hybrid Input workflow: 1. Global Conversion: Obtain the complete phoneme sequence for the input text. 2. Dynamic Replacement: Using a "Dynamic Controllable Dictionary," automatically identify polyphones or rare characters and replace them with specified target phonemes. 3. Hybrid Generation: Feed the combination of replaced phonemes and original text into GLM-TTS as a hybrid input. This ensures…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

New TTS model with strong stars.