meituan-longcat/LongCat-Audio-Codec
Python
Captured source
source ↗meituan-longcat/LongCat-Audio-Codec
Description: LongCat Audio Tokenizer and Detokenizer
Language: Python
License: MIT
Stars: 301
Forks: 23
Open issues: 1
Created: 2025-10-16T12:06:05Z
Pushed: 2026-05-09T10:22:27Z
Default branch: main
Fork: no
Archived: no
README:
LongCat-Audio-Codec is an audio tokenizer and detokenizer solution designed for speech large language models. It works by generating semantic and acoustic tokens in parallel, enabling high-fidelity audio reconstruction at extremely low bitrates with excellent backend support for Speech LLM.
✨ Key Features
- High Fidelity at Ultra-Low Bitrates: As a codec, it achieves high-intelligibility audio reconstruction at extremely low bitrates.
- Low-Frame-Rate Tokenizer: As a tokenizer,it extracting semantic tokens and acoustic tokens in parallel at a low frame rate of 16.6Hz, with flexible acoustic codebook configurations to adapt to different downstream tasks.
- Low-Latency Streaming Detokenizer: Equipped with a specially designed streaming-capable detokenizer that requires minimal future information to deliver high-quality audio output with low latency.
- Super-Resolution Capability: Integrates audio super-resolution processing into the detokenizer, enabling the generation of high-quality audio with a higher sample rate than the original input.
🚀Quick Start
🛠️Installation
1. Create a conda environment and install pytorch
Note: The torch version listed in below is just an example. Please install the version of PyTorch that matches your specific hardware configuration
conda create -n LongCat-Audio-Codec python=3.10 conda activate LongCat-Audio-Codec pip install torch==2.7.1 torchaudio==2.7.1
2. Other dependencies:
pip install -r requirements.txt
📦Model Preparation
1. Model Download
| Models | Download Link | Notes | | ----------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | LongCatAudioCodec_encoder | 🤗 Huggingface | encoder weights with semantic encoder and acoustic encoder | | LongCatAudioCodec_encoder_cmvn | 🤗 Huggingface | coefficients of Cepstral Mean and Variance Normalization, used by semantic encoder | | LongCatAudioCodec_decoder16k_4codebooks | 🤗 Huggingface | 16k decoder, supply 1 semantic codebook and at most 3 acoustic codebooks | | LongCatAudioCodec_decoder24k_2codebooks | 🤗 Huggingface | 24k decoder, supply 1 semantic codebook and 1 acoustic codebook, SFT on limited speakers | | LongCatAudioCodec_decoder24k_4codebooks | 🤗 Huggingface | 24k decoder, supply 1 semantic codebook and at most 3 acoustic codebooks |
2. Link LongCat-Audio-Codec Model to Right Path
After downloading the model checkpoint files (.pt files), you have two options for making them accessible to the inference script:
Option 1: Place Models in the Default ckpts Directory (Recommended)
For a quick setup,all models and configuration files must be placed within the LongCat-Audio-Codec/ project root directory.
The final, correct project structure should look exactly like this:
LongCat-Audio-Codec/ longcat-team@meituan.com or join our WeChat Group if you have any questions. #### WeChat Group
Notability
notability 6.0/10Solid new audio codec repo, 301 stars