RepoQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Sep 21, 2025seen 6d

QwenLM/Qwen3-Omni

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Sep 21, 2025seen 6dcaptured 15hhttp 200method plain

QwenLM/Qwen3-Omni

Description: Qwen3-omni is a natively end-to-end, omni-modal LLM developed by the Qwen team at Alibaba Cloud, capable of understanding text, audio, images, and video, as well as generating speech in real time.

Language: Jupyter Notebook

License: Apache-2.0

Stars: 3823

Forks: 262

Open issues: 9

Created: 2025-09-21T09:46:10Z

Pushed: 2026-04-23T10:58:01Z

Default branch: main

Fork: no

Archived: no

README:

Qwen3-Omni

💜 Qwen Chat&nbsp&nbsp | &nbsp&nbsp🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope&nbsp&nbsp | &nbsp&nbsp📑 Blog&nbsp&nbsp | &nbsp&nbsp📚 Cookbooks&nbsp&nbsp | &nbsp&nbsp📑 Paper&nbsp&nbsp

🖥️ Hugging Face Demo&nbsp&nbsp | &nbsp&nbsp 🖥️ ModelScope Demo&nbsp&nbsp | &nbsp&nbsp💬 WeChat (微信)&nbsp&nbsp | &nbsp&nbsp🫨 Discord&nbsp&nbsp | &nbsp&nbsp📑 API

We release Qwen3-Omni, the natively end-to-end multilingual omni-modal foundation models. It is designed to process diverse inputs including text, images, audio, and video, while delivering real-time streaming responses in both text and natural speech. Click the video below for more information 😃

English Version

Chinese Version

News

  • 2025.09.26: ⭐️⭐️⭐️ Qwen3-Omni reaches top-1 on Hugging Face Trending!
  • 2025.09.22: 🎉🎉🎉 We have released Qwen3-Omni. For more details, please check our blog!

Contents

  • [Overview](#overview)
  • [Introduction](#introduction)
  • [Model Architecture](#model-architecture)
  • [Cookbooks for Usage Cases](#cookbooks-for-usage-cases)
  • [QuickStart](#quickstart)
  • [Model Description and Download](#model-description-and-download)
  • [Transformers Usage](#transformers-usage)
  • [vLLM Usage](#vllm-usage)
  • [DashScope API Usage](#dashscope-api-usage)
  • [Usage Tips (Recommended Reading)](#usage-tips-recommended-reading)
  • [Interaction with Qwen3-Omni](#interaction-with-qwen3-omni)
  • [Online Demo](#online-demo)
  • [Real-Time Interaction](#real-time-interaction)
  • [Launch Local Web UI Demo](#launch-local-web-ui-demo)
  • [Docker](#-docker)
  • [Evaluation](#evaluation)
  • [Performance of Qwen3-Omni](#performance-of-qwen3-omni)
  • [Setting for Evaluation](#setting-for-evaluation)
  • [Citation](#citation)

Overview

Introduction

Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. Key features:

  • State-of-the-art across modalities: Early text-first pretraining and mixed multimodal training provide native multimodal support. While achieving strong audio and audio-video results, unimodal text and image performance does not regress. Reaches SOTA on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36; ASR, audio understanding, and voice conversation performance is comparable to Gemini 2.5 Pro.
  • Multilingual: Supports 119 text languages, 19 speech input languages, and 10 speech output languages.
  • Speech Input: English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, Urdu.
  • Speech Output: English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean.
  • Novel Architecture: MoE-based Thinker–Talker design with AuT pretraining for strong general representations, plus a multi-codebook design that drives latency to a minimum.
  • Real-time Audio/Video Interaction: Low-latency streaming with natural turn-taking and immediate text or speech responses.
  • Flexible Control: Customize behavior via system prompts for fine-grained control and easy adaptation.
  • Detailed Audio Captioner: Qwen3-Omni-30B-A3B-Captioner is now open source: a general-purpose, highly detailed, low-hallucination audio captioning model that fills a critical gap in the open-source community.

Model Architecture

Cookbooks for Usage Cases

Qwen3-Omni supports a wide range of multimodal application scenarios, covering various domain tasks involving audio, image, video, and audio-visual modalities. Below are several cookbooks demonstrating the usage cases of Qwen3-Omni and these cookbooks include our actual execution logs. You can first follow the [QuickStart](#quickstart) guide to download the model and install the necessary inference environment dependencies, then run and experiment locally—try modifying prompts or switching model types, and enjoy exploring the capabilities of Qwen3-Omni!

Category Cookbook Description Open

Audio Speech Recognition Speech recognition, supporting multiple languages and long audio.

Speech Translation Speech-to-Text / Speech-to-Speech translation.

Music Analysis Detailed analysis and appreciation of any music, including style, genre, rhythm, etc.

Sound Analysis Description and analysis of various sound effects and audio signals.

Audio Caption Audio captioning, detailed description of any audio input.

Mixed Audio Analysis Analysis of mixed audio content, such as speech, music, and environmental sounds.

Visual OCR OCR for complex images.

Object Grounding Target detection and grounding.

Image Question Answering arbitrary questions about any image.

Image Math Solving complex mathematical problems in images, highlighting the capabilities of the Thinking model.

Video Description Detailed description of video content.

Video Navigation Generating navigation commands from first-person motion videos.

Video Scene Transition Analysis of scene transitions in videos.

Audio-Visual Audio Visual Question Answering arbitrary questions in audio-visual scenarios, demonstrating the model's ability to model temporal alignment between audio and video.

Audio Visual Interaction Interactive communication with the model using audio-visual inputs, including task specification via audio.

Audio Visual Dialogue Conversational interaction with the model using audio-visual inputs, showcasing its capabilities in casual chat and assistant-like behavior.

Agent Audio Function Call Using audio input to perform function calls, enabling agent-like behaviors.

Downstream Task Fine-tuning Omni Captioner Introduction and capability demonstration of Qwen3-Omni-30B-A3B-Captioner,…

Excerpt shown — open the source for the full document.

Notability

notability 10.0/10

Major frontier model release, huge traction