WritingQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Sep 19, 2024seen 6d

Qwen2.5-Math: The world's leading open-sourced mathematical LLMs

Open original ↗

Captured source

source ↗
published Sep 19, 2024seen 6dcaptured 3dhttp 200method plain

Qwen2.5-Math: The world's leading open-sourced mathematical LLMs | Qwen

We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

Qwen2.5-Math: The world's leading open-sourced mathematical LLMs September 19, 2024 · 6 min · 1258 words · Qwen Team | Translations: 简体中文

GITHUB HUGGING FACE MODELSCOPE DISCORD 🚨 Qwen2.5-Math mainly supports solving English and Chinese math problems through CoT and TIR. We do not recommend using this series of models for other tasks.

Introduction # A month ago, we released the first series of mathematical LLMs - Qwen2-Math - of our Qwen family. Today, we have upgraded it and open-sourced Qwen2.5-Math series, including base models Qwen2.5-Math-1.5B/7B/72B , instruction-tuned models Qwen2.5-Math-1.5B/7B/72B-Instruct , and mathematical reward model Qwen2.5-Math-RM-72B . Unlike Qwen2-Math series which only supports using Chain-of-Thought (CoT) to solve English math problems, Qwen2.5-Math series is expanded to support using both CoT and Tool-integrated Reasoning (TIR) to solve math problems in both Chinese and English. The Qwen2.5-Math series models have achieved significant performance improvements compared to the Qwen2-Math series models on the Chinese and English mathematics benchmarks with CoT. While CoT plays a vital role in enhancing the reasoning capabilities of LLMs, it faces challenges in achieving computational accuracy and handling complex mathematical or algorithmic reasoning tasks, such as finding the roots of a quadratic equation or computing the eigenvalues of a matrix. TIR can further improve the model’s proficiency in precise computation, symbolic manipulation, and algorithmic manipulation. Qwen2.5-Math-1.5B/7B/72B-Instruct achieve 79.7, 85.3, and 87.8 respectively on the MATH benchmark using TIR. Qwen2.5-Math: Base Models # The overall specialization pipelines of Qwen2-Math and Qwen2.5-Math are shown in the figure above. After training of Qwen2-Math base models, we further upgrade them to Qwen2.5-Math models through three primary avenues: Utilizing Qwen2-Math-72B-Instruct models to synthesize additional high-quality mathematical pre-training data.

Aggregating more high-quality mathematical data, particularly in Chinese, from web sources, books, and codes across multiple recall cycles.

Leveraging the Qwen2.5 series base model for parameter initialization, which shows more powerful language understanding, code generation, and text reasoning capabilities.

Ultimately, we construct Qwen Math Corpus v2 for Qwen2.5-Math-1.5B/7B/72B pre-training, maintaining a context length of 4K. Compared to Qwen Math Corpus v1 used for Qwen2-Math training, the total token count of Qwen Math Corpus v2 has increased from 700B to over 1T. We evaluate our Qwen2.5-Math base models on three widely used English math benchmarks GSM8K, Math, and MMLU-STEM. In addition, we also evaluate three Chinese math benchmarks CMATH, GaoKao Math Cloze, and GaoKao Math QA. All evaluations are tested with few-shot chain-of-thought prompting. Compared to Qwen2-Math-1.5B/7B/72B, Qwen2.5-Math-1.5B/7B/72B have achieved significant improvements on all benchmarks. For example, Qwen2.5-Math-1.5B/7B/72B obtains 5.4, 5.0, 6.3 scores improvement on MATH, and 3.4, 12.2, 19.8 scores improvement on Gaokao Math QA. Qwen2.5-Math-Instruct: Instruction-Tuned Models # Similar to Qwen2-Math-Instruct, we train a math-specific reward model Qwen2.5-Math-RM-72B based on Qwen2.5-Math-72B. This RM is used for constructing the SFT data through Rejection Sampling and also in the reinforcement learning with Group Relative Policy Optimization (GRPO) after SFT. In the development of Qwen2.5-Math-Instruct, an additional iteration is conducted using the Qwen2-Math-Instruct models and Qwen2.5-Math-RM-72B to polish the quality of responses further during Rejection Sampling. Compared with the post-training of Qwen2-Math, we further introduced TIR data and SFT data in Chinese and English for Qwen2.5 post-training. We evaluate Qwen2.5-Math-Instruct on mathematical benchmarks in both English and Chinese. In addition to the widely-used benchmarks, such as GSM8K and Math, we also involve more exams that are more challenging to fully inspect the capabilities of Qwen2.5-Math-Instruct, such as OlympiadBench, CollegeMath, GaoKao, AIME2024, and AMC2023. For Chinese mathematical benchmarks, we use CMATH, Gaokao (Chinese College Entrance Examination 2024), and CN Middle School 24 (China High School Entrance Examination 2024). We report greedy, Maj@8 and RM@8 performance on all benchmarks in the zero-shot setting, except for the multi-choice benchmarks (including MMLU STEM and multiple-choice problems in GaoKao and CN Middle School 24) with a 5-shot setting. The Qwen2.5-Math-72B-Instruct model outperforms the Qwen2-Math-72B-Instruct model by an average margin of 4.4 and 6.1 points in English and Chinese, respectively, establishing itself as the best open-source mathematical model currently available. The flagship model, Qwen2.5-Math-72B-Instruct, significantly outperforms both open-source models and leading closed-source models (e.g., GPT-4o, Gemini Math-Specialized 1.5 Pro). Under the TIR setting of RM@8, a high score of 92.9 was achieved on MATH. With the aid of synthesized pre-training and supervised fine-tuning data from the 72B model, Qwen2.5-Math-7B-Instruct surpasses Qwen2-Math-Instruct 72B in performance. Under CoT and TIR settings, it achieves MATH scores of 83.6 and 85.3, respectively. Even our smallest 1.5B model, achieves a MATH score of around 80 when utilizing the Python Interpreter, outperforming the majority of current models in this domain. In more complex mathematical competition evaluations such as AIME 2024 and AMC 2023, Qwen2.5-Math-Instruct also performs well across various settings, including Greedy, Maj@64, RM@64, and RM@256. With the support of the Qwen2.5-Math-RM-72B, Qwen2.5-Math-1.5B-Instruct, using the RM@256 in CoT mode, successfully solves 29 out of 40 problems on AMC 2023. Moreover, Qwen2.5-Math-72B-Instruct nearly achieves a perfect score in TIR mode, solving almost all the problems. On the extremely difficult AIME 2024 benchmark, Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro manage to solve only 1 or 2 questions out of 30. In contrast, Qwen2.5-Math-72B-Instruct solves 9 problems in Greedy decoding CoT mode and 12 problems in TIR mode. With the help of the RM,...

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

Major open-source math LLM release