ModelStepFunStepFunpublished Jan 14, 2026seen 5d

stepfun-ai/Step-Audio-R1.1

Open original ↗

Captured source

source ↗
published Jan 14, 2026seen 5dcaptured 14hhttp 200method plaintask audio-text-to-textlicense apache-2.0library transformersparams 33Bdownloads 178likes 181

Overview of Step-Audio-R1.1

 

Introduction

Step-Audio R1.1 (Realtime) is a major upgrade to Step-Audio-R1, designed for interactive spoken dialogue with both real-time responsiveness and strong reasoning capability.

Unlike conventional streaming speech models that trade intelligence for latency, R1.1 enables *thinking while speaking*, achieving high intelligence without sacrificing speed.

Mind-Paced Speaking (Low Latency)

Based on the research [*Mind-Paced Speaking*](MPS.pdf), the Realtime variant adopts a Dual-Brain Architecture:

  • A Formulation Brain responsible for high-level reasoning
  • An Articulation Brain dedicated to speech generation

This decoupling allows the model to perform Chain-of-Thought reasoning during speech output, maintaining ultra-low latency while handling complex tasks in real time.

Acoustic-Grounded Reasoning (High Intelligence)

To address the *inverted scaling* issue—where reasoning over transcripts can degrade performance—Step-Audio R1.1 grounds its reasoning directly in acoustic representations rather than text alone.

Through iterative self-distillation, extended deliberation becomes a strength instead of a liability. This enables effective test-time compute scaling and leads to state-of-the-art performance, including top-ranking results on the AA benchmark.

!image

!image

!image

Online demonstration

StepFun Audio Studio

WeChat group

You can scan the following QR code to join our WeChat group for communication and discussion.

Model Usage

📜 Requirements

  • GPU: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
  • Operating System: Linux.
  • Python: >= 3.10.0.

⬇️ Download Model

First, you need to download the Step-Audio-R1 model weights.

Method A · Git LFS

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-R1.1

Method B · Hugging Face CLI

hf download stepfun-ai/Step-Audio-R1.1 --local-dir ./Step-Audio-R1.1

🚀 Deployment and Execution

We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.

🐳 Method 1 · Run with Docker (Recommended)

A customized vLLM image is required.

1. Pull the image:

docker pull stepfun2025/vllm:step-audio-2-v20250909

2. Start the service: Assuming the model is downloaded in the Step-Audio-R1 folder in the current directory.

docker run --rm -ti --gpus all \
-v $(pwd)/Step-Audio-R1.1:/Step-Audio-R1.1 \
-p 9999:9999 \
stepfun2025/vllm:step-audio-2-v20250909 \
-- vllm serve /Step-Audio-R1.1 \
--served-model-name Step-Audio-R1.1 \
--port 9999 \
--max-model-len 16384 \
--max-num-seqs 32 \
--tensor-parallel-size 4 \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("\n", "") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"''"'"' -}}{%- endif -%}{{- '"'"'tool_json_schemas\n'"'"' + tools|tojson + '"'"''"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"''"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'human\n'"'"' + render_content(message["content"]) + '"'"''"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"''"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"''"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"''"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'assistant\n\n'"'"' -}}{%- endif -%}' \
--enable-log-requests \
--interleave-mm-strings \
--trust-remote-code

After the service starts, it will listen on localhost:9999.

🐳 Method 2 · Run from Source (Compile vLLM)

Step-Audio-R1 requires a customized vLLM backend.

1. Download Source Code:

git clone https://github.com/stepfun-ai/vllm.git
cd vllm

2. Prepare Environment:

python3 -m venv .venv
source .venv/bin/activate

3. Install and Compile: vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.

# Use pre-compiled C++ extensions (Recommended)
VLLM_USE_PRECOMPILED=1 pip install -e .

4. Switch Branch: After compilation, switch to the branch that supports Step-Audio.

git checkout feat/step-audio-support

5. Start the Service:

# Ensure you are in the vllm directory and the virtual environment is activated
source .venv/bin/activate

python3 -m vllm.entrypoints.openai.api_server \
--model ../Step-Audio-R1.1 \
--served-model-name Step-Audio-R1.1 \
--port 9999 \
--host 0.0.0.0 \
--max-model-len 65536 \
--max-num-seqs 128 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-log-requests \
--interleave-mm-strings \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{-…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

222 downloads, minor release.