RepoStepFunStepFunpublished Jul 15, 2025seen 5d

stepfun-ai/Step-Audio2

Python

Open original ↗

Captured source

source ↗
published Jul 15, 2025seen 5dcaptured 14hhttp 200method plain

stepfun-ai/Step-Audio2

Description: Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Language: Python

License: Apache-2.0

Stars: 1460

Forks: 107

Open issues: 55

Created: 2025-07-15T09:14:32Z

Pushed: 2026-03-16T04:06:21Z

Default branch: main

Fork: no

Archived: no

README:

Step-Audio 2

🔥🔥🔥 News!!

WeChat Developer Group

Introduction

Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

  • Advanced Speech and Audio Understanding: Promising performance in ASR and audio understanding by comprehending and reasoning semantic information, para-linguistic and non-vocal information.
  • Intelligent Speech Conversation: Achieving natural and intelligent interactions that are contextually appropriate for various conversational scenarios and paralinguistic information.
  • Emotional Reasoning: Analyzing user's paralinguistic information such as age and emotion, leading to more accurate and intelligent interpretation of the audio context.
  • Tool Calling and Multimodal RAG: By leveraging tool calling and RAG to access real-world knowledge (both textual and acoustic), Step-Audio 2 can generate responses with fewer hallucinations for diverse scenarios, while also having the ability to switch timbres based on retrieved speech.
  • State-of-the-Art Performance: Achieving state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. (See [Evaluation](#evaluation) and Technical Report).

+ Open-source: Step-Audio 2 mini, Step-Audio 2 mini Base and Step-Audio 2 mini Think are released under [Apache 2.0](LICENSE) license.

Model Download

| Models | 🤗 Hugging Face | ModelScope | |-------|-------|-------| | Step-Audio 2 mini | stepfun-ai/Step-Audio-2-mini | stepfun-ai/Step-Audio-2-mini | | Step-Audio 2 mini Base | stepfun-ai/Step-Audio-2-mini-Base | stepfun-ai/Step-Audio-2-mini-Base | | Step-Audio 2 mini Think | stepfun-ai/Step-Audio-2-mini-Think | stepfun-ai/Step-Audio-2-mini-Think |

Model Usage

🔧 Dependencies and Installation

conda create -n stepaudio2 python=3.10
conda activate stepaudio2
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml

git clone https://github.com/stepfun-ai/Step-Audio2.git
cd Step-Audio2
git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini
# git clone https://huggingface.co/stepfun-ai/Step-Audio-2-mini-Base

🔧 vLLM docker image

We highly recommend using our vLLM backend for faster and streaming inference, also deploying across multiple GPUs.

# (Optional) build the docker image yourself (very slow and requires 32GiB of memory)
# docker build -t stepfun2025/vllm:step-audio-2-v20250909 .

# run vLLM docker
docker run --rm -ti --gpus all \
-v Step-Audio-2-mini:/Step-Audio-2-mini \
-p 8000:8000 \
stepfun2025/vllm:step-audio-2-v20250909 \
-- vllm serve /Step-Audio-2-mini \
--served-model-name step-audio-2-mini \
--port 8000 \
--max-model-len 16384 \
--max-num-seqs 32 \
--tensor-parallel-size 1 \
--enable-auto-tool-choice \
--tool-call-parser step_audio_2 \
--tokenizer-mode step_audio_2 \
--chat_template_content_format string \
--audio-parser step_audio_2_tts_ta4 \
--trust-remote-code

🚀 Inference Scripts

python examples.py
# python examples-base.py
# python examples-vllm.py
# python examples-think.py

🚀 Local web demonstration

pip install gradio
python web_demo.py
# python web_demo_vllm.py

Online demonstration

StepFun realtime console

StepFun AI Assistant

  • Step-Audio 2 is also available in our StepFun AI Assistant mobile App with both web and audio search tools enabled.
  • Please scan the following QR code to download it from your app store then tap the phone icon in the top-right corner.

WeChat group

You can scan the following QR code to join our WeChat group for communication and discussion.

Evaluation

Automatic speech recognition

CER for Chinese, Cantonese and Japanese and WER for Arabian and English. N/A indicates that the language is…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable audio model repo with 1.46k stars