QwenLM/Qwen3-VL-Embedding
Python
Captured source
source ↗QwenLM/Qwen3-VL-Embedding
Language: Python
License: Apache-2.0
Stars: 1282
Forks: 107
Open issues: 54
Created: 2026-01-08T03:42:57Z
Pushed: 2026-04-08T05:01:00Z
Default branch: main
Fork: no
Archived: no
README:
Qwen3-VL-Embedding & Qwen3-VL-Reranker
State-of-the-art multimodal embedding and reranking models built on Qwen3-VL, supporting text, images, screenshots, videos, and mixed-modal inputs for advanced information retrieval and cross-modal understanding.
---
Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Model Architecture](#model-architecture)
- [Installation](#installation)
- [Usage](#usage)
- [Examples](#examples)
- [Model Performance](#model-performance)
- [Citation](#citation)
---
Overview
The Qwen3-VL-Embedding and Qwen3-VL-Reranker model series are the latest additions to the Qwen family, built upon the recently open-sourced and powerful Qwen3-VL foundation model. Specifically designed for multimodal information retrieval and cross-modal understanding, this suite accepts diverse inputs including text, images, screenshots, and videos, as well as inputs containing a mixture of these modalities.
Building on the success of our text-oriented Qwen3-Embedding and Qwen3-Reranker series, these multimodal models extend best-in-class performance to visual and video understanding tasks. The models work in tandem: the Embedding model handles the initial recall stage by generating semantically rich vectors, while the Reranking model manages the re-ranking stage with precise relevance scoring, significantly enhancing final retrieval accuracy.
---
Features
- 🎨 Multimodal Versatility: Seamlessly process inputs containing text, images, screenshots, and video within a unified framework. Achieve state-of-the-art performance across diverse tasks including image-text retrieval, video-text matching, visual question answering (VQA), and multimodal content clustering.
- 🔄 Unified Representation Space: Leverage the Qwen3-VL architecture to generate semantically rich vectors that capture both visual and textual information in a shared space, facilitating efficient similarity estimation and retrieval across different modalities.
- 🎯 High-Precision Reranking: The reranking model accepts input pairs (Query, Document)—where both can consist of arbitrary single or mixed modalities—and outputs precise relevance scores for superior retrieval accuracy.
- 🌍 Exceptional Practicality:
- Support for over 30 languages, ideal for global applications
- Customizable instructions for task-specific optimization
- Flexible vector dimensions with Matryoshka Representation Learning (MRL)
- Strong performance with quantized embeddings for efficient deployment
- Easy integration into existing retrieval pipelines
---
Model Architecture
Model Specifications
| Model | Size | Layers | Sequence Length | Embedding Dimension | Quantization Support | MRL Support | Instruction Aware | |---|---|---|---|---|---|---|---| | Qwen3-VL-Embedding-2B | 2B | 28 | 32K | 2048 | ✅ | ✅ | ✅ | | Qwen3-VL-Embedding-8B | 8B | 36 | 32K | 4096 | ✅ | ✅ | ✅ | | Qwen3-VL-Reranker-2B | 2B | 28 | 32K | - | - | - | ✅ | | Qwen3-VL-Reranker-8B | 8B | 36 | 32K | - | - | - | ✅ |
LoRA Configs
| Model | rank | alpha | target_modules | |------|------|-------|----------------| | Qwen3-VL-Embedding | 32 | 32 | q_proj v_proj k_proj up_proj down_proj gate_proj | | Qwen3-VL-Reranker | 32 | 32 | q_proj v_proj k_proj up_proj down_proj gate_proj |
Architecture Design
Qwen3-VL-Embedding: Dual-Tower Architecture
- Receives single-modal or mixed-modal input and maps it into a high-dimensional semantic vector
- Extracts the hidden state vector corresponding to the
[EOS]token from the base model's last layer as the final semantic representation - Enables efficient, independent encoding necessary for large-scale retrieval
Qwen3-VL-Reranker: Single-Tower Architecture
- Receives an input pair
(Query, Document)and performs pointwise reranking - Utilizes Cross-Attention mechanism for deeper, finer-grained inter-modal interaction and information fusion
- Expresses relevance score by predicting the generation probability of special tokens (
yesandno)
Feature Comparison
| | Qwen3-VL-Embedding | Qwen3-VL-Reranker | |---------|-------------------|-------------------| | Core Function | Semantic Representation, Embedding Generation | Relevance Scoring, Pointwise Re-ranking | | Input | Single modality or mixed modalities | (Query, Document) pair with single- or mixed-modal inputs | | Architecture | Dual-Tower | Single-Tower | | Mechanism | Efficient Retrieval | Deep Inter-Modal Interaction, Precise Alignment | | Output | Semantic Vector | Relevance Score |
Both models are built through a multi-stage training paradigm that fully leverages the powerful general multimodal semantic understanding capabilities of Qwen3-VL, providing high-quality semantic representations and precise re-ranking mechanisms for complex, large-scale multimodal retrieval tasks.
---
Installation
Setup Environment
# Clone the repository git clone https://github.com/QwenLM/Qwen3-VL-Embedding.git cd Qwen3-VL-Embedding # Run the script to setup the environment bash scripts/setup_environment.sh
The setup script will automatically:
- Install
uvif not already installed - Install all project dependencies
After setup completes, activate the environment:
source .venv/bin/activate
Download Models
Our models are available on both Hugging Face and ModelScope.
| Model | Hugging Face | ModelScope | |-------|--------------|------------| | Qwen3-VL-Embedding-2B |Link | Link | | Qwen3-VL-Embedding-8B |Link | Link | | Qwen3-VL-Reranker-2B |Link | Link | | Qwen3-VL-Reranker-8B |Link | Link |
**Install download…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable new VLM embedding model from Qwen