What does this model signal mean?

StepFun published stepfun-ai/Step-Audio-R1.1. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 264 HF downloads · A speech recognition model from StepFun.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

StepFun Model: stepfun-ai/Step-Audio-R1.1

Captured source

source ↗

Hugging Face/huggingface.co/stepfun-ai/Step-Audio-R1.1

stepfun-ai/Step-Audio-R1.1 model card

Source ↗

published Jan 14, 2026seen Jun 6captured Jun 11http 200method plaintask audio-text-to-textlicense apache-2.0library transformersparams 33Bdownloads 264likes 186

Overview of Step-Audio-R1.1

&ensp;

Introduction

Step-Audio R1.1 (Realtime) is a major upgrade to Step-Audio-R1, designed for interactive spoken dialogue with both real-time responsiveness and strong reasoning capability.

Unlike conventional streaming speech models that trade intelligence for latency, R1.1 enables *thinking while speaking*, achieving high intelligence without sacrificing speed.

Mind-Paced Speaking (Low Latency)

Based on the research [*Mind-Paced Speaking*](MPS.pdf), the Realtime variant adopts a Dual-Brain Architecture:

A Formulation Brain responsible for high-level reasoning
An Articulation Brain dedicated to speech generation

This decoupling allows the model to perform Chain-of-Thought reasoning during speech output, maintaining ultra-low latency while handling complex tasks in real time.

Acoustic-Grounded Reasoning (High Intelligence)

To address the *inverted scaling* issue—where reasoning over transcripts can degrade performance—Step-Audio R1.1 grounds its reasoning directly in acoustic representations rather than text alone.

Through iterative self-distillation, extended deliberation becomes a strength instead of a liability. This enables effective test-time compute scaling and leads to state-of-the-art performance, including top-ranking results on the AA benchmark.

!image

Online demonstration

StepFun Audio Studio

Both Step-Audio-R1.1 are available in our StepFun Audio Studio.
You will need an API key from the StepFun Open Platform.

WeChat group

You can scan the following QR code to join our WeChat group for communication and discussion.

Model Usage

📜 Requirements

GPU: NVIDIA GPUs with CUDA support (tested on 4×L40S/H100/H800/H20).
Operating System: Linux.
Python: >= 3.10.0.

⬇️ Download Model

First, you need to download the Step-Audio-R1 model weights.

Method A · Git LFS

git lfs install
git clone https://huggingface.co/stepfun-ai/Step-Audio-R1.1

Method B · Hugging Face CLI

hf download stepfun-ai/Step-Audio-R1.1 --local-dir ./Step-Audio-R1.1

🚀 Deployment and Execution

We provide two ways to serve the model: Docker (recommended) or compiling the customized vLLM backend.

🐳 Method 1 · Run with Docker (Recommended)

A customized vLLM image is required.

1. Pull the image:

docker pull stepfun2025/vllm:step-audio-2-v20250909

2. Start the service: Assuming the model is downloaded in the Step-Audio-R1 folder in the current directory.

docker run --rm -ti --gpus all \
-v $(pwd)/Step-Audio-R1.1:/Step-Audio-R1.1 \
-p 9999:9999 \
stepfun2025/vllm:step-audio-2-v20250909 \
-- vllm serve /Step-Audio-R1.1 \
--served-model-name Step-Audio-R1.1 \
--port 9999 \
--max-model-len 16384 \
--max-num-seqs 32 \
--tensor-parallel-size 4 \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{- content.replace("\n", "") -}}{%- elif content is mapping -%}{{- content['"'"'value'"'"'] if '"'"'value'"'"' in content else content['"'"'text'"'"'] -}}{%- elif content is iterable -%}{%- for item in content -%}{%- if item.type == '"'"'text'"'"' -%}{{- item['"'"'value'"'"'] if '"'"'value'"'"' in item else item['"'"'text'"'"'] -}}{%- elif item.type == '"'"'audio'"'"' -%}{%- endif -%}{%- endfor -%}{%- endif -%}{%- endmacro -%}{%- if tools -%}{{- '"'"'system\n'"'"' -}}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- render_content(messages[0]['"'"'content'"'"']) + '"'"''"'"' -}}{%- endif -%}{{- '"'"'tool_json_schemas\n'"'"' + tools|tojson + '"'"''"'"' -}}{%- else -%}{%- if messages[0]['"'"'role'"'"'] == '"'"'system'"'"' -%}{{- '"'"'system\n'"'"' + render_content(messages[0]['"'"'content'"'"']) + '"'"''"'"' -}}{%- endif -%}{%- endif -%}{%- for message in messages -%}{%- if message["role"] == "user" -%}{{- '"'"'human\n'"'"' + render_content(message["content"]) + '"'"''"'"' -}}{%- elif message["role"] == "assistant" -%}{{- '"'"'assistant\n'"'"' + (render_content(message["content"]) if message["content"] else '"'"''"'"') -}}{%- set is_last_assistant = true -%}{%- for m in messages[loop.index:] -%}{%- if m["role"] == "assistant" -%}{%- set is_last_assistant = false -%}{%- endif -%}{%- endfor -%}{%- if not is_last_assistant -%}{{- '"'"''"'"' -}}{%- endif -%}{%- elif message["role"] == "function_output" -%}{%- else -%}{%- if not (loop.first and message["role"] == "system") -%}{{- '"'"''"'"' + message["role"] + '"'"'\n'"'"' + render_content(message["content"]) + '"'"''"'"' -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '"'"'assistant\n\n'"'"' -}}{%- endif -%}' \
--enable-log-requests \
--interleave-mm-strings \
--trust-remote-code

After the service starts, it will listen on localhost:9999.

🐳 Method 2 · Run from Source (Compile vLLM)

Step-Audio-R1 requires a customized vLLM backend.

1. Download Source Code:

git clone https://github.com/stepfun-ai/vllm.git
cd vllm

2. Prepare Environment:

python3 -m venv .venv
source .venv/bin/activate

3. Install and Compile: vLLM contains both C++ and Python code. We mainly modified the Python code, so the C++ part can use the pre-compiled version to speed up the process.

# Use pre-compiled C++ extensions (Recommended)
VLLM_USE_PRECOMPILED=1 pip install -e .

4. Switch Branch: After compilation, switch to the branch that supports Step-Audio.

git checkout feat/step-audio-support

5. Start the Service:

# Ensure you are in the vllm directory and the virtual environment is activated
source .venv/bin/activate

python3 -m vllm.entrypoints.openai.api_server \
--model ../Step-Audio-R1.1 \
--served-model-name Step-Audio-R1.1 \
--port 9999 \
--host 0.0.0.0 \
--max-model-len 65536 \
--max-num-seqs 128 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-log-requests \
--interleave-mm-strings \
--chat-template '{%- macro render_content(content) -%}{%- if content is string -%}{{-...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

222 downloads, minor release.