Qwen3: Think Deeper, Act Faster
Captured source
source ↗Qwen3: Think Deeper, Act Faster | Qwen
We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now
Qwen3: Think Deeper, Act Faster April 29, 2025 · 10 min · 2036 words · Qwen Team | Translations: 简体中文
QWEN CHAT GitHub Hugging Face ModelScope Kaggle DEMO DISCORD Introduction # Today, we are excited to announce the release of Qwen3 , the latest addition to the Qwen family of large language models. Our flagship model, Qwen3-235B-A22B , achieves competitive results in benchmark evaluations of coding, math, general capabilities, etc., when compared to other top-tier models such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. Additionally, the small MoE model, Qwen3-30B-A3B , outcompetes QwQ-32B with 10 times of activated parameters, and even a tiny model like Qwen3-4B can rival the performance of Qwen2.5-72B-Instruct. We are open-weighting two MoE models: Qwen3-235B-A22B , a large model with 235 billion total parameters and 22 billion activated parameters, and Qwen3-30B-A3B , a smaller MoE model with 30 billion total parameters and 3 billion activated parameters. Additionally, six dense models are also open-weighted, including Qwen3-32B , Qwen3-14B , Qwen3-8B , Qwen3-4B , Qwen3-1.7B , and Qwen3-0.6B , under Apache 2.0 license. Models Layers Heads (Q / KV) Tie Embedding Context Length Qwen3-0.6B 28 16 / 8 Yes 32K Qwen3-1.7B 28 16 / 8 Yes 32K Qwen3-4B 36 32 / 8 Yes 32K Qwen3-8B 36 32 / 8 No 128K Qwen3-14B 40 40 / 8 No 128K Qwen3-32B 64 64 / 8 No 128K Models Layers Heads (Q / KV) # Experts (Total / Activated) Context Length Qwen3-30B-A3B 48 32 / 4 128 / 8 128K Qwen3-235B-A22B 94 64 / 4 128 / 8 128K The post-trained models, such as Qwen3-30B-A3B , along with their pre-trained counterparts (e.g., Qwen3-30B-A3B-Base ), are now available on platforms like Hugging Face , ModelScope , and Kaggle . For deployment, we recommend using frameworks like SGLang and vLLM . For local usage, tools such as Ollama , LMStudio , MLX , llama.cpp , and KTransformers are highly recommended. These options ensure that users can easily integrate Qwen3 into their workflows, whether in research, development, or production environments. We believe that the release and open-sourcing of Qwen3 will significantly advance the research and development of large foundation models. Our goal is to empower researchers, developers, and organizations around the world to build innovative solutions using these cutting-edge models. Feel free to try Qwen3 out in Qwen Chat Web ( chat.qwen.ai ) and mobile APP!
Key Features # Hybrid Thinking Modes
Qwen3 models introduce a hybrid approach to problem-solving. They support two modes: Thinking Mode: In this mode, the model takes time to reason step by step before delivering the final answer. This is ideal for complex problems that require deeper thought. Non-Thinking Mode: Here, the model provides quick, near-instant responses, suitable for simpler questions where speed is more important than depth.
This flexibility allows users to control how much “thinking” the model performs based on the task at hand. For example, harder problems can be tackled with extended reasoning, while easier ones can be answered directly without delay. Crucially, the integration of these two modes greatly enhances the model’s ability to implement stable and efficient thinking budget control. As demonstrated above, Qwen3 exhibits scalable and smooth performance improvements that are directly correlated with the computational reasoning budget allocated. This design enables users to configure task-specific budgets with greater ease, achieving a more optimal balance between cost efficiency and inference quality. Multilingual Support
Qwen3 models are supporting 119 languages and dialects . This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models. Language Family Languages & Dialects Indo-European English, French, Portuguese, German, Romanian, Swedish, Danish, Bulgarian, Russian, Czech, Greek, Ukrainian, Spanish, Dutch, Slovak, Croatian, Polish, Lithuanian, Norwegian Bokmål, Norwegian Nynorsk, Persian, Slovenian, Gujarati, Latvian, Italian, Occitan, Nepali, Marathi, Belarusian, Serbian, Luxembourgish, Venetian, Assamese, Welsh, Silesian, Asturian, Chhattisgarhi, Awadhi, Maithili, Bhojpuri, Sindhi, Irish, Faroese, Hindi, Punjabi, Bengali, Oriya, Tajik, Eastern Yiddish, Lombard, Ligurian, Sicilian, Friulian, Sardinian, Galician, Catalan, Icelandic, Tosk Albanian, Limburgish, Dari, Afrikaans, Macedonian, Sinhala, Urdu, Magahi, Bosnian, Armenian Sino-Tibetan Chinese (Simplified Chinese, Traditional Chinese, Cantonese), Burmese Afro-Asiatic Arabic (Standard, Najdi, Levantine, Egyptian, Moroccan, Mesopotamian, Ta’izzi-Adeni, Tunisian), Hebrew, Maltese Austronesian Indonesian, Malay, Tagalog, Cebuano, Javanese, Sundanese, Minangkabau, Balinese, Banjar, Pangasinan, Iloko, Waray (Philippines) Dravidian Tamil, Telugu, Kannada, Malayalam Turkic Turkish, North Azerbaijani, Northern Uzbek, Kazakh, Bashkir, Tatar Tai-Kadai Thai, Lao Uralic Finnish, Estonian, Hungarian Austroasiatic Vietnamese, Khmer Other Japanese, Korean, Georgian, Basque, Haitian, Papiamento, Kabuverdianu, Tok Pisin, Swahili Improved Agentic Capabilities
We have optimized the Qwen3 models for coding and agentic capabilities, and also we have strengthened the support of MCP as well. Below we provide examples to show how Qwen3 thinks and interacts with the environment.
Pre-training # In terms of pretraining, the dataset for Qwen3 has been significantly expanded compared to Qwen2.5. While Qwen2.5 was pre-trained on 18 trillion tokens, Qwen3 uses nearly twice that amount, with approximately 36 trillion tokens covering 119 languages and dialects. To build this large dataset, we collected data not only from the web but also from PDF-like documents. We used Qwen2.5-VL to extract text from these documents and Qwen2.5 to improve the quality of the extracted content. To increase the amount of math and code data, we used Qwen2.5-Math and Qwen2.5-Coder to generate synthetic data. This includes textbooks, question-answer pairs, and code snippets. The pre-training process consists of three stages. In the first stage (S1), the model was pretrained on over 30 trillion tokens with a context length of 4K tokens....
Excerpt shown — open the source for the full document.
Notability
notability 9.0/10Major frontier model release from Qwen