What does this repo signal mean?

StepFun published stepfun-ai/Step-Audio2 (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo stepfun-ai/Step-Audio2 · language Python · Notable audio model repo with 1.46k stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

StepFun Repo: stepfun-ai/Step-Audio2

Captured source

source ↗

GitHub/github.com/stepfun-ai/Step-Audio2

stepfun-ai/Step-Audio2 repository metadata

Source ↗

published Jul 15, 2025seen Jun 5captured Jun 11http 200method plain

stepfun-ai/Step-Audio2

Description: Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Language: Python

License: Apache-2.0

Stars: 1460

Forks: 107

Open issues: 55

Created: 2025-07-15T09:14:32Z

Pushed: 2026-03-16T04:06:21Z

Default branch: main

Fork: no

Archived: no

README:

Step-Audio 2

🔥🔥🔥 News!!

Sep 15, 2025: 👋 We release Step-Audio 2 mini Think and its corresponding [examples](examples-think.py).
Sep 3, 2025: 👋 We release our vLLM backend and corresponding [examples](examples-vllm.py).
Aug 29, 2025: 👋 We are pleased to open-source Step-Audio 2 mini, Step-Audio 2 mini Base and their corresponding inference [examples](examples.py). Technical report is also updated.
Jul 24, 2025: 👋 We release demonstration videos for Step-Audio 2.
Jul 23, 2025: 👋 We release our benchmark for paralinguistic information understanding, StepEval-Audio-Paralinguistic.
Jul 23, 2025: 👋 We release our benchmark for tool calling, StepEval-Audio-Toolcall.
Jul 23, 2025: 👋 We release the technical report of Step-Audio 2.

WeChat Developer Group

Introduction

Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.

Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information.

Emotional Reasoning: Analyzing user's paralinguistic information such as age and emotion, leading to more accurate and intelligent interpretation of the audio context.

Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.

State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and Technical Report).

+ Open-source: Step-Audio 2 mini, Step-Audio 2 mini Base and Step-Audio 2 mini Think are released under [Apache 2.0](LICENSE) license.

Model Download

| Models | 🤗 Hugging Face | ModelScope | |-------|-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | stepfun-ai/Step-Audio-2-mini-Base | | Step-Audio 2 mini Think | stepfun-ai/Step-Audio-2-mini-Think | stepfun-ai/Step-Audio-2-mini-Think |

Model Usage

🔧 Dependencies and Installation

conda create -n stepaudio2 python=3.10
conda activate stepaudio2
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml

git clone https://github.com/stepfun-ai/Step-Audio2.git
cd Step-Audio2
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini
# git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base

🔧 vLLM docker image

We highly recommend using our vLLM backend for faster and streaming inference, also deploying across multiple GPUs.

# (Optional) build the docker image yourself (very slow and requires 32GiB of memory)
# docker build -t stepfun2025/vllm:step-audio-2-v20250909 .

# run vLLM docker
docker run --rm -ti --gpus all \
-v Step-Audio-2-mini:/Step-Audio-2-mini \
-p 8000:8000 \
stepfun2025/vllm:step-audio-2-v20250909 \
-- vllm serve /Step-Audio-2-mini \
--served-model-name step-audio-2-mini \
--port 8000 \
--max-model-len 16384 \
--max-num-seqs 32 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser step_audio_2 \
--tokenizer-mode step_audio_2 \
--chat_template_content_format string \
--audio-parser step_audio_2_tts_ta4 \
--trust-remote-code

🚀 Inference Scripts

python examples.py
# python examples-base.py
# python examples-vllm.py
# python examples-think.py

🚀 Local web demonstration

pip install gradio
python web_demo.py
# python web_demo_vllm.py

Online demonstration

StepFun realtime console

Both Step-Audio 2 and Step-Audio 2 mini are available in our StepFun realtime console with web search tool enabled.
You will need an API key from the StepFun Open Platform.

StepFun AI Assistant

Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled.
Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner.

WeChat group

You can scan the following QR code to join our WeChat group for communication and discussion.

Evaluation

Automatic speech recognition

CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable audio model repo with 1.46k stars