Qwen2-Math

Qwen2-Math

Over the past year, we have devoted considerable effort to researching and improving the reasoning capabilities of large language models, particularly their ability to tackle arithmetic and mathematical problems. Today, we’re excited to introduce Qwen2-Math, a specialized series of math-focused large language models within our Qwen2 series, including Qwen2-Math-Instruct-1.5B/7B/72B.

These models are built on the Qwen2 LLMs and significantly surpass the mathematical abilities of both open-source and closed-source models (e.g., GPT4o). We aim for Qwen2-Math to contribute to the scientific community by solving advanced mathematical problems that require complex, multi-step logical reasoning.

The flagship model, Qwen2-Math-72B-Instruct, outperforms proprietary models, including GPT-4o and Claude 3.5, in math related downstream tasks!

Requirements

  • transformers>=4.40.0 for Qwen2-Math models. The latest version is recommended.

Qwen2-Math can be deployed and inferred in the same way as Qwen2. Here we show a code snippet to show you how to use the chat model with transformers.

Performance

We evaluated our Qwen2-Math-Base models using three widely recognized English math benchmarks: GSM8K, Math, and MMLU-STEM. Additionally, we assessed three Chinese math benchmarks: CMATH, GaoKao Math Cloze, and GaoKao Math QA. All evaluations employed few-shot chain-of-thought prompting.

Performance

For Qwen2-Math-Instruct, we conducted evaluations on mathematical benchmarks in both English and Chinese. Beyond the commonly used benchmarks like GSM8K and Math, we included more challenging exams to thoroughly test the capabilities of Qwen2-Math-Instruct, such as OlympiadBench, CollegeMath, GaoKao, AIME2024, and AMC2023. For the Chinese benchmarks, we used CMATH, the 2024 Gaokao (Chinese college entrance examination), and the 2024 CN Middle School 24 (China High School Entrance Examination).

Performance

We reported performance across all benchmarks in the zero-shot setting using greedy, Maj@8, and RM@8 metrics, with the exception of multi-choice benchmarks (including MMLU STEM and multiple-choice problems in GaoKao and CN Middle School 24), which were evaluated using a 5-shot setting. Qwen2-Math-Instruct demonstrated superior performance among models of the same size, with RM@8 consistently outperforming Maj@8, particularly in the 1.5B and 7B models. This highlights the effectiveness of our Math Reward Model.

HuggingFace Collections:

https://huggingface.co/collections/Qwen/qwen2-math-66b4c9e072eda65b5ec7534d

Evaluation

Our evaluation is adapted from math-evaluation-harness. Feel free to reproduce the results of all instruction models in the Qwen2-Math series with scripts in evaluation.

Author: Qwen.

Read related articles: