RepoQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Jan 8, 2026seen 6d

QwenLM/Qwen3-VL-Embedding

Python

Open original ↗

Captured source

source ↗
published Jan 8, 2026seen 6dcaptured 16hhttp 200method plain

QwenLM/Qwen3-VL-Embedding

Language: Python

License: Apache-2.0

Stars: 1282

Forks: 107

Open issues: 54

Created: 2026-01-08T03:42:57Z

Pushed: 2026-04-08T05:01:00Z

Default branch: main

Fork: no

Archived: no

README:

Qwen3-VL-Embedding & Qwen3-VL-Reranker

State-of-the-art multimodal embedding and reranking models built on Qwen3-VL, supporting text, images, screenshots, videos, and mixed-modal inputs for advanced information retrieval and cross-modal understanding.

---

Table of Contents

  • [Overview](#overview)
  • [Features](#features)
  • [Model Architecture](#model-architecture)
  • [Installation](#installation)
  • [Usage](#usage)
  • [Examples](#examples)
  • [Model Performance](#model-performance)
  • [Citation](#citation)

---

Overview

The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.

Building on the success of our text-oriented Qwen3-Embedding and Qwen3-Reranker series, these multimodal models extend best-in-class performance to visual and video understanding tasks. The models work in tandem: the Embedding model handles the initial recall stage by generating semantically rich vectors, while the Reranking model manages the re-ranking stage with precise relevance scoring, significantly enhancing final retrieval accuracy.

---

Features

  • 🎨 Multimodal Versatility: Seamlessly process inputs containing text, images, screenshots, and video within a unified framework. Achieve state-of-the-art performance across diverse tasks including image-text retrieval, video-text matching, visual question answering (VQA), and multimodal content clustering.
  • 🔄 Unified Representation Space: Leverage the Qwen3-VL architecture to generate semantically rich vectors that capture both visual and textual information in a shared space, facilitating efficient similarity estimation and retrieval across different modalities.
  • 🎯 High-Precision Reranking: The reranking model accepts input pairs (Query, Document)—where both can consist of arbitrary single or mixed modalities—and outputs precise relevance scores for superior retrieval accuracy.
  • 🌍 Exceptional Practicality:
  • Support for over 30 languages, ideal for global applications
  • Customizable instructions for task-specific optimization
  • Flexible vector dimensions with Matryoshka Representation Learning (MRL)
  • Strong performance with quantized embeddings for efficient deployment
  • Easy integration into existing retrieval pipelines

---

Model Architecture

Model Specifications

| Model | Size | Layers | Sequence Length | Embedding Dimension | Quantization Support | MRL Support | Instruction Aware | |---|---|---|---|---|---|---|---| | Qwen3-VL-Embedding-2B | 2B | 28 | 32K | 2048 | ✅ | ✅ | ✅ | | Qwen3-VL-Embedding-8B | 8B | 36 | 32K | 4096 | ✅ | ✅ | ✅ | | Qwen3-VL-Reranker-2B | 2B | 28 | 32K | - | - | - | ✅ | | Qwen3-VL-Reranker-8B | 8B | 36 | 32K | - | - | - | ✅ |

LoRA Configs

| Model | rank | alpha | target_modules | |------|------|-------|----------------| | Qwen3-VL-Embedding | 32 | 32 | q_proj v_proj k_proj up_proj down_proj gate_proj | | Qwen3-VL-Reranker | 32 | 32 | q_proj v_proj k_proj up_proj down_proj gate_proj |

Architecture Design

Qwen3-VL-Embedding: Dual-Tower Architecture

  • Receives single-modal or mixed-modal input and maps it into a high-dimensional semantic vector
  • Extracts the hidden state vector corresponding to the [EOS] token from the base model's last layer as the final semantic representation
  • Enables efficient, independent encoding necessary for large-scale retrieval

Qwen3-VL-Reranker: Single-Tower Architecture

  • Receives an input pair (Query, Document) and performs pointwise reranking
  • Utilizes Cross-Attention mechanism for deeper, finer-grained inter-modal interaction and information fusion
  • Expresses relevance score by predicting the generation probability of special tokens (yes and no)

Feature Comparison

| | Qwen3-VL-Embedding | Qwen3-VL-Reranker | |---------|-------------------|-------------------| | Core Function | Semantic Representation, Embedding Generation | Relevance Scoring, Pointwise Re-ranking | | Input | Single modality or mixed modalities | (Query, Document) pair with single- or mixed-modal inputs | | Architecture | Dual-Tower | Single-Tower | | Mechanism | Efficient Retrieval | Deep Inter-Modal Interaction, Precise Alignment | | Output | Semantic Vector | Relevance Score |

Both models are built through a multi-stage training paradigm that fully leverages the powerful general multimodal semantic understanding capabilities of Qwen3-VL, providing high-quality semantic representations and precise re-ranking mechanisms for complex, large-scale multimodal retrieval tasks.

---

Installation

Setup Environment

# Clone the repository
git clone https://github.com/QwenLM/Qwen3-VL-Embedding.git
cd Qwen3-VL-Embedding

# Run the script to setup the environment
bash scripts/setup_environment.sh

The setup script will automatically:

  • Install uv if not already installed
  • Install all project dependencies

After setup completes, activate the environment:

source .venv/bin/activate

Download Models

Our models are available on both Hugging Face and ModelScope.

| Model | Hugging Face | ModelScope | |-------|--------------|------------| | Qwen3-VL-Embedding-2B |Link | Link | | Qwen3-VL-Embedding-8B |Link | Link | | Qwen3-VL-Reranker-2B |Link | Link | | Qwen3-VL-Reranker-8B |Link | Link |

**Install download…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable new VLM embedding model from Qwen