PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation

Luo, Jing; Chen, Longze; Luo, Run; Zhu, Liang; Ao, Chang; Li, Jiaming; Chen, Yukun; Cheng, Xin; Yang, Wen; Su, Jiayuan; Argha, Ahmadreza; Alinejad-Rokny, Hamid; Li, Chengming; Ni, Shiwen; Yang, Min

Computer Science > Computation and Language

arXiv:2410.01504 (cs)

[Submitted on 2 Oct 2024 (v1), last revised 21 Feb 2025 (this version, v2)]

Title:PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation

Authors:Jing Luo, Longze Chen, Run Luo, Liang Zhu, Chang Ao, Jiaming Li, Yukun Chen, Xin Cheng, Wen Yang, Jiayuan Su, Ahmadreza Argha, Hamid Alinejad-Rokny, Chengming Li, Shiwen Ni, Min Yang

View PDF HTML (experimental)

Abstract:While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models still face challenges with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage focuses on learning from Persona Diversification, and the second stage emphasizes learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a persona-driven data augmentation technique. This technique innovatively classifies personas based on occupations, significantly enhancing the dataset's diversity and quality. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on Qwen2.5-7B) achieves an accuracy of 61.2% on MATH and 87.8% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 128.9K data points-merely 32.6% of MetaMathQA and 49.5% of MathInstruct-yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2410.01504 [cs.CL]
	(or arXiv:2410.01504v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.01504

Submission history

From: Jing Luo [view email]
[v1] Wed, 2 Oct 2024 12:57:12 UTC (254 KB)
[v2] Fri, 21 Feb 2025 06:33:16 UTC (185 KB)

Computer Science > Computation and Language

Title:PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:PersonaMath: Boosting Mathematical Reasoning via Persona-Driven Data Augmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators