Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens
Captured source
source ↗Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens | Qwen
We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens January 27, 2025 · 8 min · 1589 words · Qwen Team | Translations: 简体中文
Tech Report HuggingFace ModelScope Qwen Chat HuggingFace Demo ModelScope Demo DISCORD Introduction # Two months after upgrading Qwen2.5-Turbo to support context length up to one million tokens, we are back with the open-source Qwen2.5-1M models and the corresponding inference framework support. Here’s what you can expect from this release: Opensource Models: We’re releasing two new checkpoints, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M , marking the first time we’ve upgraded our opensource Qwen models to handle 1M-token contexts.
Inference Framework: To help developers deploy the Qwen2.5-1M series models more efficiently, we’ve fully open-sourced our inference framework based on vLLM . With integration with sparse attention methods, our framework can process 1M-token inputs 3x to 7x faster.
Technical Report : We’re also sharing the technical details behind the Qwen2.5-1M series, including design insights for training and inference frameworks, as well as ablation experiments.
You can experience Qwen2.5-1M models online by visiting our demo on Huggingface and Modelscope . Additionally, we recently introduced Qwen Chat , an advanced AI assistant from the Qwen series. With Qwen Chat, you can engage in conversations, write code, perform searches, generate images and videos, and utilize various tools. Notably, Qwen Chat also features the Qwen2.5-Turbo model, which supports long-context processing with a context length of up to 1M tokens. Model Performance # Let’s start by diving into the performance of the Qwen2.5-1M series models, covering both long-context and short text tasks. Long-Context Tasks # First off, we evaluate the Qwen2.5-1M models on the Passkey Retrieval task with a context length of 1 million tokens. The results show that these models can accurately retrieve hidden information from documents containing up to 1M tokens, with only minor errors observed in the 7B model. For more complex long-context understanding tasks, we select RULER , LV-Eval , LongbenchChat used in this blog . From these results, we can draw a few key conclusions: Significantly Superior to the 128k Version: The Qwen2.5-1M series models significantly outperform their 128K counterparts in most long-context tasks, especially for sequences exceeding 64K in length. Notable Performance Advantage: The Qwen2.5-14B-Instruct-1M model not only beats Qwen2.5-Turbo but also consistently outperforms GPT-4o-mini across multiple datasets, offering a robust open-source alternative for long-context tasks.
Short-Context Tasks # Besides performance on long sequences, we’re equally interested in how these models handle short sequences. So, we compare the Qwen2.5-1M models and their 128K versions on widely used academic benchmarks, throwing in GPT-4o-mini for comparison. Here’s what we find: Both Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M maintain performance on short text tasks that is similar to their 128K versions, ensuring the fundamental capabilities haven’t been compromised by the addition of long-sequence processing abilities. Compared to GPT-4o-mini, both Qwen2.5-14B-Instruct-1M and Qwen2.5-Turbo achieve similar performance on short text tasks while supporting a context length that’s eight times longer.
Key Techniques # Here, we’ll briefly introduce the key techniques behind building Qwen2.5-1M. For more details, please check out our technical report . Long-Context Training # Training with long sequences demands substantial computational resources, so we adopt a progressive approach to expand the context length for Qwen2.5-1M through multiple stages: We begin with an intermediate checkpoint of pre-trained Qwen2.5, which had a 4K token context length. In Pretraining , we gradually increase the context length from 4K to 256K tokens while using Adjusted Base Frequency , raising the RoPE base from 10,000 to 10,000,000. In Supervised Fine-tuning , we split this into two stages to preserve performance on shorter sequences: Stage 1: Fine-tuned only on short instructions (up to 32K tokens) using the same data and steps as the 128K versions of Qwen2.5. Stage 2: Mixed short (up to 32K) and long (up to 256K) instructions to enhance long-context task performance while maintaining short-task quality.
In Reinforcement Learning , we train models on short texts up to 8K tokens, which sufficiently improves alignment with human preferences and generalizes well to long-context tasks.
The final instruction-tuned models are capable of handling sequences up to 256K tokens. Length Extrapolation # During training, we develop an instruction-tuned model with a context length of 256K tokens. To extend this to 1M tokens, we employ length extrapolation techniques. The degradation of LLMs based on RoPE in long-context tasks is mainly due to unseen, large relative positional distances between queries and keys in computing attention weight. We employ Dual Chunk Attention (DCA), which addresses this issue by remapping relative positions to smaller values, avoiding the large distances not seen during training. We evaluat the Qwen2.5-1M models and their 128K counterparts with and without the length extrapolation method. We can find: Even models trained on just 32K tokens, such as the Qwen2.5-7B-Instruct, achieve nearly perfect accuracy in passkey retrieval tasks with 1M-token contexts. This underscores the remarkable ability of DCA to extend supported context lengths, without any training required. Sparse Attention # For long-context language models, inference speed is crucial for user experience. We introduce a sparse attention mechanism based on MInference to accelerate the prefill phase. Furthermore, we propose several improvements: Integrating with Chunked Prefill: Directly processing sequences of 1M tokens results in substantial memory overhead to store the activations in MLP layers, consuming 71GB of VRAM in Qwen2.5-7B. By integrating with chunk prefill with a chunk length of 32,768 tokens, activation VRAM usage is reduced by 96.7%, leading to a significant decrease in...
Excerpt shown — open the source for the full document.
Notability
notability 8.0/101M context length extension, notable Qwen release
Qwen (Alibaba Cloud) has a writing signal matching infrastructure, product and customer.