What does this repo signal mean?

Moonshot AI (Kimi) published MoonshotAI/Kimi-Audio (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo MoonshotAI/Kimi-Audio · language Python · High-starred audio repo. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Moonshot AI (Kimi) Repo: MoonshotAI/Kimi-Audio

Captured source

source ↗

GitHub/github.com/MoonshotAI/Kimi-Audio

MoonshotAI/Kimi-Audio repository metadata

Source ↗

published Apr 25, 2025seen 5dcaptured 15hhttp 200method plain

MoonshotAI/Kimi-Audio

Description: Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation

Language: Python

Stars: 4649

Forks: 361

Open issues: 112

Created: 2025-04-25T10:00:18Z

Pushed: 2025-06-21T15:30:28Z

Default branch: master

Fork: no

Archived: no

README:

Kimi-Audio-7B 🤗 | Kimi-Audio-7B-Instruct 🤗 | 📑 Paper

We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.

🔥🔥🔥 News!!

May 29, 2025: 👋 We release a finetuning example of Kimi-Audio-7B.
April 27, 2025: 👋 We release pretrained model weights of Kimi-Audio-7B.
April 25, 2025: 👋 We release the inference code and model weights of Kimi-Audio-7B-Instruct.
April 25, 2025: 👋 We release the audio evaluation toolkit Kimi-Audio-Evalkit. We can easily reproduce the our results and baselines by this toolkit!
April 25, 2025: 👋 We release the technical report of Kimi-Audio.

[Introduction](#introduction)
[Architecture Overview](#architecture-overview)
[Quick Start](#quick-start)
[Evaluation](#evaluation)
[Speech Recognition](#automatic-speech-recognition-asr)
[Audio Understanding](#audio-understanding)
[Audio-to-Text Chat](#audio-to-text-chat)
[Speech Conversation](#speech-conversation)
[Finetune](#finetune)
[Evaluation Toolkit](#evaluation-toolkit)
[Generation Testset](#generation-testset)
[License](#license)
[Acknowledgements](#acknowledgements)
[Citation](#citation)
[Contact Us](#contact-us)

Introduction

Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

Universal Capabilities: Handle diverse tasks like automatic speech recognition (ASR), audio question answering (AQA), automatic audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation.
State-of-the-Art Performance: Achieve SOTA results on numerous audio benchmarks (see [Evaluation](#evaluation) and the Technical Report).
Large-Scale Pre-training: Pre-train on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
Novel Architecture: Employ a hybrid audio input (continuous acoustic vectors + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
Efficient Inference: Feature a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
Open-Source: Release the code and model checkpoints for both pre-training and instruction fine-tuning, and release a comprehensive evaluation toolkit to foster community research and development.

Architecture Overview

Kimi-Audio consists of three main components:

1. Audio Tokenizer: Converts input audio into:

Discrete semantic tokens (12.5Hz) using vector quantization.
Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).

2. Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens. 3. Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.

Getting Started

Step1: Get the Code

git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive
pip install -r requirements.txt

Kimi‑Audio can now be installed directly via pip.

pip install torch
pip install git+https://github.com/MoonshotAI/Kimi-Audio.git

Quick Start

This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.

import soundfile as sf
from kimia_infer.api.kimia import KimiAudio

# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)

# --- 2. Define Sampling Parameters ---
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}

# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
# You can provide context or instructions as text
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
# Provide the audio file path
{"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]

# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别，这是一个篇章的结束，也是新篇章的开始。"

# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
# Start conversation with an audio query
{"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]

# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "当然可以，这很简单。一二三四五六七八九十。"

# --- 5. Example 3: Audio-to-Audio/Text Conversation with Multiturn ---

messages = [
{"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q1.wav"},
#…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

High-starred audio repo