What does this writing signal mean?

Qwen (Alibaba Cloud) Writing: Qwen2.5-LLM: Extending the boundary of LLMs

Captured source

qwenlm.github.io/qwenlm.github.io/blog/qwen2.5-llm

Qwen2.5-LLM: Extending the boundary of LLMs

published Sep 19, 2024seen Jun 5captured Jun 7http 200method plain

Qwen2.5-LLM: Extending the boundary of LLMs | Qwen

We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

Qwen2.5-LLM: Extending the boundary of LLMs September 19, 2024 · 13 min · 2680 words · Qwen Team | Translations: 简体中文

GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD Introduction # In this blog, we delve into the details of our latest Qwen2.5 series language models. We have developed a range of decoder-only dense models, with seven of them open-sourced, spanning from 0.5B to 72B parameters. Our research indicates a significant interest among users in models within the 10-30B range for production use, as well as 3B models for mobile applications. To meet these demands, we are open-sourcing Qwen2.5-3B, Qwen2.5-14B, and Qwen2.5-32B. Furthermore, we are excited to offer additional models, including Qwen-Plus and Qwen-Turbo, available through API services via Alibaba Cloud Model Studio . Compared with the Qwen2 series, the Qwen2.5 series has the following upgrades: Full-scale Open-source : Considering that users have a strong interest in models in the 10-30B range for production and 3B models for mobile applications, Qwen2.5, in addition to continuing to open source the four models of 0.5/1.5/7/72B of the same size as Qwen2, also added two medium-sized cost-effective models of Qwen2.5-14B and Qwen2.5-32B and a mobile-side model called Qwen2.5-3B . All models are highly competitive compared to open-source models of the same level. For example, Qwen2.5-32B beats Qwen2-72B and Qwen2.5-14B outperforms Qwen2-57B-A14B in our comprehensive evaluations.

Larger and Higher Quality Pre-training Dataset : The size of the pre-training dataset is expanded from 7 trillion tokens to a maximum of 18 trillion tokens.

Knowledge Enhancement : Qwen2.5 has acquired significantly more knowledge. On MMLU benchmarks, Qwen2.5-7/72B are improved from 70.3 to 74.2 and 84.2 to 86.1 compared to Qwen2-7/72B. We observe that Qwen2.5 also has significant improvements on the GPQA/MMLU-Pro/MMLU-redux/ARC-c benchmarks.

Coding Enhancement : Thanks to the technical breakthrough of Qwen2.5-Coder, Qwen2.5 has greatly improved capabilities in coding. Qwen2.5-72B-Instruct achieves 55.5 , 75.1 , and 88.2 scores on LiveCodeBench (2305-2409), MultiPL-E and MBPP, respectively, outperforming Qwen2-72B-Instruct with 32.2, 69.2, and 80.2.

Math Enhancement : After integrating Qwen2-math’s technology, the mathematical ability of Qwen2.5 has also been rapidly improved. On the MATH benchmark, the scores of Qwen2.5-7B/72B-Instruct have been increased from 52.9/69.0 of Qwen2-7B/72B-Instruct to 75.5/83.1 .

Better Human Preference : Qwen2.5 is capable of generating responses that align more closely with human preferences. Specifically, the Arena-Hard score for Qwen2.5-72B-Instruct has increased significantly from 48.1 to 81.2 , and the MT-Bench score has improved from 9.12 to 9.35 , compared to Qwen2-72B-Instruct.

Other Core Capabilities Enhancement : Qwen2.5 achieves significant improvements in instruction following , generating long texts (increased from 1k to over 8K tokens ), understanding structured data (e.g., tables), and generating structured outputs , especially JSON. Furthermore, Qwen2.5 models are generally more resilient to the diversity of system prompts , enhancing role-play implementation and condition-setting for chatbots.

Model Card # Here is a model card detailing the key parameters of the Qwen2.5 LLM models. This release includes seven open-sourced models with sizes ranging from 0.5B to 72B. Most models support a context length of 128K (131,072) tokens and can generate up to 8K tokens, enabling the production of extensive text outputs. The majority of these models are licensed under Apache 2.0, while Qwen2.5-3B and Qwen2.5-72B are governed by the Qwen Research License and Qwen License, respectively. Models Params Non-Emb Params Layers Heads (KV) Tie Embedding Context Length Generation Length License Qwen2.5-0.5B 0.49B 0.36B 24 14 / 2 Yes 32K 8K Apache 2.0 Qwen2.5-1.5B 1.54B 1.31B 28 12 / 2 Yes 32K 8K Apache 2.0 Qwen2.5-3B 3.09B 2.77B 36 16 / 2 Yes 32K 8K Qwen Research Qwen2.5-7B 7.61B 6.53B 28 28 / 4 No 128K 8K Apache 2.0 Qwen2.5-14B 14.7B 13.1B 48 40 / 8 No 128K 8K Apache 2.0 Qwen2.5-32B 32.5B 31.0B 64 40 / 8 No 128K 8K Apache 2.0 Qwen2.5-72B 72.7B 70.0B 80 64 / 8 No 128K 8K Qwen Performance # This section presents the performance metrics for both base language models and instruction-tuned models across various benchmark evaluations, encompassing a diverse array of domains and tasks. Qwen2.5 Base Language Model Evaluation # The evaluation of base models primarily emphasizes their performance in natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, and multilingual capabilities. The evaluation datasets include: General Tasks : MMLU (5-shot), MMLU-Pro (5-shot), MMLU-redux (5-shot), BBH (3-shot), ARC-C (25-shot), TruthfulQA (0-shot), Winogrande (5-shot), HellaSwag (10-shot) Math & Science Tasks : GPQA (5-shot), Theorem QA (5-shot), GSM8K (4-shot), MATH (4-shot) Coding Tasks : HumanEval (0-shot), HumanEval+ (0-shot), MBPP (0-shot), MBPP+ (0-shot), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript) Multilingual Tasks : Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot) Qwen2.5-72B Performance # Datasets Llama-3-70B Mixtral-8x22B Llama-3-405B Qwen2-72B Qwen2.5-72B General Tasks MMLU 79.5 77.8 85.2 84.2 86.1 MMLU-Pro 52.8 51.6 61.6 55.7 58.1 MMLU-redux 75.0 72.9 - 80.5 83.9 BBH 81.0 78.9 85.9 82.4 86.3 ARC-C 68.8 70.7 - 68.9 72.4 TruthfulQA 45.6 51.0 - 54.8 60.4 WindoGrande 85.3 85.0 86.7 85.1 83.9 HellaSwag 88.0 88.7 - 87.3 87.6 Mathematics & Science Tasks GPQA 36.3 34.3 - 37.4 45.9 Theoremqa 32.3 35.9 - 42.8 42.4 MATH 42.5 41.7 53.8 50.9 62.1 MMLU-stem 73.7 71.7 - 79.6 82.7 GSM8K 77.6 83.7 89.0 89.0 91.5 Coding Tasks HumanEval 48.2 46.3 61.0 64.6 59.1 HumanEval+ 42.1 40.2 - 56.1 51.2 MBPP 70.4 71.7 73.0 76.9 84.7 MBPP+ 58.4 58.1 - 63.9 69.2 MultiPL-E 46.3 46.7 - 59.6 60.5 Multilingual Tasks Multi-Exam 70.0 63.5 - 76.6 78.7 Multi-Understanding 79.9 77.7 - 80.7 89.6 Multi-Mathematics 67.1 62.9 - 76.0 76.7...

Excerpt shown — open the source for the full document.

Notability

notability 9.0/10

Major frontier model release from Qwen team.