23ChatGPT As A Math Questioner
23ChatGPT As A Math Questioner
Evaluating ChatGPT on
Generating Pre-university Math Questions
Phuoc Pham Van Long∗ Duc Anh Vu∗ Nhat M. Hoang∗
Nanyang Technological University Nanyang Technological University Nanyang Technological University
Singapore Singapore Singapore
[email protected] [email protected] [email protected]
Singapore Singapore
[email protected] [email protected]
ABSTRACT KEYWORDS
Mathematical questioning is crucial for assessing students’ problem- ACM proceedings, LATEX, text tagging
solving skills. Since manually creating such questions requires sub- ACM Reference Format:
stantial effort, automatic methods have been explored. Existing Phuoc Pham Van Long, Duc Anh Vu[1], Nhat M. Hoang[1], Xuan Long
state-of-the-art models rely on fine-tuning strategies and struggle Do[1], and Anh Tuan Luu. 2024. ChatGPT as a Math Questioner? Evaluating
to generate questions that heavily involve multiple steps of logi- ChatGPT on Generating Pre-university Math Questions. In Proceedings of
cal and arithmetic reasoning. Meanwhile, large language models ACM SAC Conference (SAC’24). ACM, New York, NY, USA, Article 4, 9 pages.
(LLMs) such as ChatGPT have excelled in many NLP tasks involving https://doi.org/xx.xxx/xxx_x
logical and arithmetic reasoning. Nonetheless, their applications in
generating educational questions are underutilized, especially in the 1 INTRODUCTION
field of mathematics. To bridge this gap, we take the first step to con- Math problems are essential educational tools for evaluating stu-
duct an in-depth analysis of ChatGPT in generating pre-university dents’ logical and problem-solving abilities [10, 36]. Engaging stu-
math questions. Our analysis is categorized into two main settings: dents in answering those expert-designed questions has been shown
context-aware and context-unaware. In the context-aware setting, to improve their learning outcomes [12, 28]. Nonetheless, manually
we evaluate ChatGPT on existing math question-answering bench- crafting such questions demands substantial human effort and ex-
marks covering elementary, secondary, and ternary classes. In the pertise, making it time-consuming, non-generalizable, and impracti-
context-unaware setting, we evaluate ChatGPT in generating math cal for scalability [22]. Therefore, automatic tools to generate mathe-
questions for each lesson from pre-university math curriculums matical questions have received growing attention [20, 37]. Existing
that we crawl. Our crawling results in TopicMath1 , a compre- state-of-the-art frameworks primarily rely on fine-tuning strategies
hensive and novel collection of pre-university math curriculums [34, 35, 37, 42]. However, these approaches are criticized for their
collected from 121 math topics and 428 lessons from elementary, limitations in generating questions that necessitate multi-step rea-
secondary, and tertiary classes. Through this analysis, we aim to soning [15]. Recent progress in large language models (LLMs), like
provide insight into the potential of ChatGPT as a math questioner1 . ChatGPT [24], has garnered significant interest and demonstrated
remarkable efficacy in numerous natural language processing (NLP)
CCS CONCEPTS tasks through the use of prompts. Nevertheless, their potential and
benefits in crafting educational questions, especially within mathe-
• Artificial Intelligence; • Computational Linguistics → Large
matics, remain underinvestigated.
Language Models; • Educational Question Generation;
In this work, we take the first step to conduct an in-depth analysis
of the potential of applying ChatGPT in automatically generating
∗ Equal contribution.
† Corresponding
pre-university math questions. We categorize our analysis into
author.
1 Our codes and data are publicly available at https://github.com/dxlong2000/ChatGPT- two main scenarios: (1) context-aware, where the model is given a
as-a-Math-Questioner. context to generate math questions either with or without an ex-
pected answer, and (2) context-unaware, where the model generates
math questions based solely on an instructional prompt. Under the
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed context-aware setting, ChatGPT is evaluated on 3 math question-
for profit or commercial advantage and that copies bear this notice and the full citation answering benchmarks from elementary, secondary, and ternary
on the first page. Copyrights for components of this work owned by others than ACM classes respectively. In context-unaware scenarios, where no prior
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a context is available, assessing ChatGPT is more challenging due
fee. Request permissions from [email protected]. to significant variations in model performance based on different
SAC’24, April 8 –April 12, 2024, Avila, Spain instructional prompts. Nonetheless, this setting is more realistic
© 2024 Association for Computing Machinery.
ACM ISBN 979-8-4007-0243-3/24/04. . . $15.00 and helpful since teachers may not have any contexts or stories
https://doi.org/xx.xxx/xxx_x beforehand to ask for generating math questions.
SAC’24, April 8 –April 12, 2024, Avila, Spain Pham et al.
In addition, our evaluation reveals that the performance of the SVAMP, GSM8K, and MATH, covering pre-university grades and
model varies when generating questions in different math top- various difficulty levels.
ics. Therefore, to exhaustively evaluate the model in the context-
unaware setting, we hire expert students who are high-school na- 3 PROBLEM FORMULATION
tional math olympians from universities, to crawl 428 math lessons
We study ChatGPT3 on generating math problems in both context-
from Khan Academy2 with their mathematical definitions and ex-
aware and context-unaware settings across various pre-university
emplary problems from 121 math topics covering most of the topics
difficulty levels, including elementary, secondary, and tertiary.
from grade 1-st to tertiary classes. We then instruct ChatGPT to
generate question-answer pairs for each lesson, given desired diffi- • Context-aware. We evaluate models in both answer-aware and
culty levels. Through our analysis, we derive a number of worthy answer-unaware modes. In the answer-aware setting, we provide
findings. Our contributions are summarized below: context 𝐶 and evaluate the models by generating math questions,
(i) We are the first to conduct a comprehensive analysis of the with each sample represented as (𝐶, 𝑄, 𝐴), where 𝐶 is the context,
feasibility of leveraging ChatGPT in generating pre-university math 𝑄 is the question, and 𝐴 is the answer. The models are then fine-
questions. tuned/run inference to generate 𝑄 given 𝐶 and 𝐴. In the answer-
(ii) We study two main settings in generating mathematical ques- unaware setting, models generate questions conditioned solely on
tions. We further dive our evaluation deeply into a large number of 𝐶, with 𝐴 being unavailable.
math topics and lessons covering most from pre-university classes.
(iii) We contribute TopicMath, a novel and comprehensive col- • Context-unaware. The absence of context 𝐶 poses a unique
lection of expert-authored pre-university math curriculums. challenge for assessing ChatGPT’s math problem generation ca-
(iv) We provide 11 findings about the capability of ChatGPT in pabilities. Nonetheless, this scenario is essential, as teachers often
generating pre-university math questions. We hope these findings seek to prompt language models like ChatGPT for math questions
can offer good insights for teachers & researchers in utilizing mod- without prior context. To address this, we manually collect math
ern AI technologies like ChatGPT for serving educational purposes. curricula for three pre-university levels, and proposing a prompt-
ing framework to create PRE-UMATH, a novel dataset with 16𝐾
2 RELATED WORK question-answer pairs spanning 121 pre-university math topics
and 428 lessons. Our evaluation of PRE-UMATH provides valuable
2.1 Large Language Models & Prompting insights into ChatGPT’s math question generation capability.
Recently, LLMs have shown remarkable zero-shot and few-shot abil-
ities in various language generation contexts [2, 25, 39]. However, 4 CONTEXT-AWARE METHODOLOGY
they still face challenges in more complex tasks like mathemat-
ical reasoning [9, 11], often requiring expensive computational 4.1 Fine-tuning Baselines
resources for fine-tuning. To address this, researchers have been ex- We fine-tune the baselines to generate the question 𝑄, given the
ploring novel prompting methods to instruct LLMs in these tasks, in- context 𝐶 with or without the expected answer 𝐴 by concatenating
cluding chain-of-thought (CoT) prompting [40]. This enables LLMs the input in the format: Context: C [with/without] Answer: A.
to perform intermediate reasoning steps, significantly enhanced The model then learns to generate 𝑄.
LLMs’ reasoning abilities, especially for complex mathematical and
decision-making tasks. 4.2 Prompting ChatGPT
We prompt ChatGPT to generate a math question using 𝐶 with
2.2 Pre-university Math Problems Generation or without 𝐴. Empirical experiments in Table 1 demonstrate that
Pre-university math problems have received increasing attention imposing constraints produces questions closer to ground truth.
from the AI research community, with benchmarks such as SVAMP Hence, we propose the following constraints for this task. To en-
[27] for elementary-level math, secondary school-level GSM8K [5] sure coherence and comparison with grountruth questions, we
offers diverse solution templates, and the MATH [10] dataset pro- instruct ChatGPT to generate concise questions (1) without ex-
vides complex reasoning for tertiary/olympiad problems along with cessive context repetition. This constraint minimizes disparities
step-by-step solutions. Recently, interest has grown in other tertiary with the ground-truth question, improving fluency (e.g., before:
math topics like geometry problems and mathematical theorem [Context] [Question] [Redundant Context]; after: [Context]
proving [3, 32]. Additionally, automatic question generation (QG) [Question]). To maintain consistency, we emphasize that the gen-
in education has gained attention for enhancing teaching activities erated question should (2) match the tense of the provided con-
[16]. Additionally, in education, QG has gained attention with the text. This constraint helps to ensure that the question appears
use of LLMs, particularly ChatGPT, has gained significant inter- grammatically correct and coherent within the given context (e.g.,
est for generating practice questions in various subjects [13, 38]. before: [Past-tense Context] [Present-tense Question]; af-
However, its potential for generating pre-university mathematics ter: [Past-tense Context] [Past-tense Question]). In order
problems remains largely unexplored. This study, therefore, evalu- to promote brevity and clarity in the generated questions, we set a
ates ChatGPT’s performance using three well-established datasets: (3) word limit of no more than 20 words.
2 https://www.khanacademy.org/ 3 https://openai.com/blog/chatgpt
SIG Proceedings Paper in LaTeX Format SAC’24, April 8 –April 12, 2024, Avila, Spain
Table 1: Comparisons between performances of ChatGPT with and without contraints: Contextual Independence, Tense Matching and Word
Limit
5 CONTEXT-UNAWARE METHODOLOGY
5.1 TopicMath Creation
Math problems cover a wide range of math topics with varying
levels of logical complexity. Our preliminary experiments suggest
that the performance of ChatGPT on different math topics might not
be the same. To evaluate its capability thoroughly, we specifically
examine the model on multiple topics and lessons that we crawl
in elementary, secondary, and tertiary classes. After exhaustive
searches, since there are no standard math curriculums consisting
of topics and lessons for these classes internationally, we choose
Khan Academy4 as our source of curriculums since it has been
well-recognized both in the US and internationally. Our crawling
results in TopicMath, a comprehensive and novel collection of
pre-university math curriculums collected from 121 math topics Source: Khan Academy
and 428 lessons from elementary, secondary, and tertiary classes. Question: What is the value of x in the figure shown below: ...
Its statistics are presented in Table 2. Edited ChatGPT’s answer: Based on the information given.. and
—————————————————————————————– Congruence of Triangles - SSS to ... According to the Congruence
Class
#collected #total #collected #total %removed of Triangles - SSS ... Hence, angle ABC = angle BCD = 88
topics topics subtopics subtopics subtopics
Grade 1 2 3 4 17 76.47% Substituting ... + 90 88 degrees + 39 degrees = 180 degrees.
Grade 2 6 8 24 33 27.27% Therefore, the value of x in the figure is 51 53 degrees.
Grade 3 11 14 36 65 44.62%
Grade 4 11 14 20 75 73.33% Table 3: Data collecting and annotating process with the edit
Grade 5 15 16 30 64 52.13%
Grade 6 11 11 30 74 59.46% rate of 12.75%. Red denotes the deleted part of ChatGPT’s
Grade 7 7 7 23 47 51.06% answer, green denotes the corrected.
Grade 8 7 7 32 52 38.46%
High
51 81 229 476 51.89%
school from elementary school to tertiary, the rate of removed lessons is
Total 121 161 428 903 52.60%
52.60% (Table 2)
Table 2: TopicMath statistics: Topics and Subtopics Collected
(2) Create Examples’ Answers. After getting topics, lessons, def-
by Grade Level
initions, and exemplary questions with their difficulties, we ask
annotators to prompt ChatGPT via zero-shot Chain-of-Thought
(1) Curriculum Collection. We first hire six undergraduate stu- [14] to obtain the questions’ explainable solutions. These solutions
dents in mathematics who achieved high-school national math- are then reviewed and edited as needed. As per the data presented
ematical olympiad medals. They are divided into three groups, in Table 3, the average edit rate in the token level stands at 12.74%.
each group consists of two students. Students in each group are
instructed as follows to collect the math curriculums from Khan (3) Curriculum Expert Verification. In our final step, we hire three
Academy. First, they are instructed to collect all the math topics educators who have degrees in education and currently are math
(chapters’ titles) and lessons’ titles from 14 courses in Khan Acad- teachers in elementary, secondary, and tertiary schools. They are
emy, ranging from elementary school to tertiary. In addition to the invited to verify the correctness and appropriateness subjectively
titles, students are also asked to collect one exemplary question per of all the collected topics, lessons, definitions, and question-answer
lesson in the Example section, rate its difficulty following our def- pairs with their difficulties. If any collected data is found to be
initions in Section 6.3, and the lessons’ definitions from the About inappropriate or theoretically incorrect, educators have the option
section or the FAQ and Review sections at the end of each chap- to edit or discard it. We found their approval rate of 87.70%. Finally,
ter. If students could not find any definition in the above sections, we collect 121 topics and 428 lessons with 428 examples. We name
they were asked to attempt to find the lesson’s definition from the this dataset as TopicMath.
introductory Video. If the students could not find an appropriate
definition or example for a lesson, the lesson would not be collected. 5.2 TopicMath Analysis
Among the 14 courses from Khan Academy available, spanning Topic & Subtopic Distribution. We observe that the number of
grade 1 topics collected is the smallest since its difficulty levels
4 https://www.khanacademy.org/ are not diverse and the number of mathematical operators and
SAC’24, April 8 –April 12, 2024, Avila, Spain Pham et al.
methods is limited, there are fewer collected grade-1-level math ROUGE-L score less than 0.7 with any of all the questions gener-
questions compared to other levels. Meanwhile, grade 5 has the ated from all the lessons, otherwise, we filter it out. To promote
highest number of topics since a significant number of grade-5-level token diversity in generating educational questions, we utilize two
math questions proved to be highly compatible with our collection strategies. For grades 1-8, we ask ChatGPT to enrich the gener-
criteria and constraints. ated questions by providing objects and stories via adding "You
could introduce characters, objects or scenarios to
Removal Ratio Analysis. In the process of collecting math defini-
make your math problem context more diverse in terms
tions, questions, and answers in grade 1, our annotators observe the
of token" to the prompts. For tertiary classes, the problems might
absence of definitions and an overabundance of similar question
be complicated and require more rigorous and abstract thinking.
types. Consequently, a significant portion of grade-1-level ques-
Therefore, instead of requiring a real-life context, we instruct the
tions had to be excluded from our collection due to the stringent
model to introduce more variables in naming the objects via sup-
criteria and constraints we employ.
plementing "Your questions are required to be diverse in
terms of tokens, which can be achieved by paraphrasing
5.3 Prompting ChatGPT to Generate the question or introducing/renaming variables" to the
Educational Questions from Math Topics prompts, so the abstract contexts could be generalized. We name our
generated dataset from ChatGPT for evaluations as PRE-UMATH,
Algorithm 1: Prompting ChatGPT to generate math ques- consisting of 16𝐾 QA pairs. Our prompt template is below.
tions
1 Input: 𝑐𝑜𝑢𝑟𝑠𝑒, 𝑡𝑜𝑝𝑖𝑐, 𝑠𝑢𝑏𝑡𝑜𝑝𝑖𝑐, 𝑔𝑒𝑛_𝑙𝑖𝑚𝑖𝑡
2 Initialize:
Base prompt for generating pre-university math questions.
3 𝑞_𝑝𝑜𝑜𝑙 = set()
4 for 𝑙𝑒𝑠𝑠𝑜𝑛 in 𝑐𝑜𝑢𝑟𝑠𝑒 [𝑡𝑜𝑝𝑖𝑐 ] [𝑠𝑢𝑏𝑡𝑜𝑝𝑖𝑐 ] do Define Difficulty Level is:...
5 𝑑𝑒 𝑓 𝑖𝑛𝑖𝑡𝑖𝑜𝑛 = 𝑙𝑒𝑠𝑠𝑜𝑛[”𝑑𝑒 𝑓 𝑖𝑛𝑖𝑡𝑖𝑜𝑛”]
Define the [subtopic name] topic as: [subtopic definition].
6 𝑞𝑎_𝑞𝑢𝑒𝑢𝑒 = 𝑙𝑒𝑠𝑠𝑜𝑛[”𝑞𝑎_𝑞𝑢𝑒𝑢𝑒”]
Generate a math problem with its answer at Difficulty
7 for 𝑖 in range(𝑔𝑒𝑛_𝑙𝑖𝑚𝑖𝑡 ) do
Level [difficulty] in the topic of [topic]: [subtopic name].
8 random.shuffle(𝑞𝑎_𝑞𝑢𝑒𝑢𝑒)
Your questions are required to be diverse ...
9 𝑞𝑎_𝑝𝑎𝑖𝑟 = 𝑞𝑎_𝑞𝑢𝑒𝑢𝑒 [0]
You are also given an example: [prompt demonstration]
10 𝑑𝑖 𝑓 𝑓 𝑖𝑐𝑢𝑙𝑡 𝑦 = 𝑞𝑎_𝑝𝑎𝑖𝑟 [”𝑑𝑖 𝑓 𝑓 𝑖𝑐𝑢𝑙𝑡 𝑦”]
Generated question:...
11 𝑝𝑟𝑜𝑚𝑝𝑡 = create_prompt(𝑡𝑜𝑝𝑖𝑐, 𝑠𝑢𝑏𝑡𝑜𝑝𝑖𝑐, 𝑑𝑒 𝑓 𝑖𝑛𝑖𝑡𝑖𝑜𝑛,
𝑑𝑖 𝑓 𝑓 𝑖𝑐𝑢𝑙𝑡 𝑦, 𝑞𝑎_𝑝𝑎𝑖𝑟 )
12 𝑞, 𝑎 = get_chatgpt_answer(𝑝𝑟𝑜𝑚𝑝𝑡 )
13 if not filter_question(𝑞, 𝑞_𝑝𝑜𝑜𝑙) then
14 𝑞𝑎_𝑞𝑢𝑒𝑢𝑒.push((𝑞, 𝑎))
We also conduct an in-depth analysis of PRE-UMATH to better
15 𝑞_𝑝𝑜𝑜𝑙.push(𝑞)
understand how large and diverse in terms of topic, lesson, and
16 end
difficulty our evaluations are. Its statistics are presented in Table 4.
17 end
Our analysis offers several key insights. First, regarding difficulty,
18 end
PRE-UMATH encompasses question-answer pairs from five dis-
tinct difficulty levels, with Level 4 being the most prevalent at
54.7% and Level 3 at 13.5%. In terms of topic and lesson distribution,
Class #topics #subtopics
#QA Difficulty distribution grade 1 exhibits the fewest topics (2), while tertiary classes have
pairs 1 2 3 4 5
Grade 1 2 4 61 61 0 0 0 0 the highest (51), likely due to the broad subject range. Additionally,
Grade 2 6 24 327 327 0 0 0 0 lesson distribution mirrors topic distribution, with grade 1 having
Grade 3 11 36 506 506 0 0 0 0
Grade 4 11 20 350 0 350 0 0 0 the least number of lessons and tertiary classes having the most.
Grade 5 15 30 558 0 558 0 0 0
Grade 6 11 30 1165 0 0 1165 0 0
Regarding the generated QA pairs distribution, we observe that
Grade 7 7 23 853 0 0 853 0 0 in certain lessons such as Polynomial factorization (437 QA pairs)
Grade 8 7 32 1231 0 0 1231 0 0
High and Trigonometry (516 QA pairs), ChatGPT can generate substan-
51 229 11032 0 0 60 8796 2176
school tial numbers of QA pairs, whilst other lessons such as Absolute
Total 121 428 16083 894 908 3309 8796 2176
value & piecewise functions (19 QA pairs) these numbers are sig-
Table 4: PRE-UMATH statistics by grades. nificantly fewer. This is because, in certain lessons, problems can
We prompt ChatGPT to generate QA pairs from TopicMath have multiple conditions and mathematical scenarios which result
for our evaluation purposes. Our inference strategy involves using in a high number of variants being generated, while questions in
prompts that promote diversification in tokens, topic alignment, other lessons can be either too narrow or too specific, leading to
and difficulty. The algorithm is presented in Algorithm 1. Specif- limited variants. Therefore, the number of generated QA pairs is
ically, we create a list of generated QA pairs for each lesson in not always monotonically increasing with the number of lessons.
TopicMath. Given a lesson, our prompt consists of its definition According to grade, we obtain the number of generated QA pairs
and the topic’s name it belongs to, and one demonstration ran- for tertiary classes as highest (11,032) while for secondary and ele-
domly selected from its list of generated QA pairs. After getting mentary classes, grades 6 and 8 have the highest numbers whilst
the newly generated QA pair, we accept it if its question has a grade 1 is the lowest.
SIG Proceedings Paper in LaTeX Format SAC’24, April 8 –April 12, 2024, Avila, Spain
6 EXPERIMENTATION education generally. The scoring criteria for metrics are provided
in Section 6.3.
6.1 Context-aware Experimentation
• Datasets. We use SVAMP [27], GSM8K [5], and MATH [10] as 6.3 Human Rating System
our math question generation benchmarks. SVAMP suits elementary-
level math, GSM8K targets secondary school students whilst MATH This section presents the human evaluation criteria employed to
encompasses tertiary and olympiad levels, covering a wide range assess the quality of datasets in both context-aware and context-
of mathematical topics. unaware settings. These criteria were thoughtfully selected, taking
into consideration their widespread usage, to ensure an effective
• Data Pre-processing. While the SVAMP and GSM8K datasets evaluation of the datasets’ quality.
provide context and question separately, the MATH dataset lacks For evaluating both answer-aware and answer-unaware settings,
this separation. To address this, we firstly segment MATH contexts our human evaluators assess questions based on multiple criteria.
into individual sentences, then the annotators identify the most suit- These criteria encompass Difficulty, Relevancy, Grammaticality, and
able sentence for forming a question and the rest becomes context. Answerability. When evaluating difficulty, we employ a 1 to 5 scale,
In cases where the information is insufficient, the annotators can with 1 signifying suitability for lower elementary school students
exclude these samples. Finally, in contrast to GSM8K and MATH, (grades 1-3) and 5 representing a level of challenge appropriate for
which provide separate train and test sets, we split the SVAMP mathematics contests and tertiary-level students. Relevancy scores
dataset into train and test sets due to the absence of this division. span from 1 to 5, with 1 indicating low relevance (0-20%) and 5
denoting high relevance (80-100%) between the context and the
• Baselines. We use GPT-2 [30], BART [17], T5 [31], MixQG generated question. Grammaticality is rated with options of 1, 3,
[23], Flan-T5 [4], and ProphetNet [29] as our fine-tuned question or 5, where 1 reflects the presence of severe grammatical errors,
generation baselines. They were initialized with [41] checkpoints 3 suggests the question is good but contains minor grammatical
and fine-tuned for 10𝑘 iterations using the AdamW optimizer [21]. errors, and 5 indicates a question that is both grammatically and
Learning rates of 1𝑒 − 5, 5𝑒 − 5, and 5𝑒 − 5 were respectively used factually correct. As for answerability, we consider two scenarios.
for SVAMP, GSM8K, and MATH. In the answer-aware setting, a score of 1 means the question is
not answerable, and a score of 3 indicates that the question is
• Automatic Evaluation. In the answer-aware setup, out aim is
answerable but does not match the ground-truth answer, while
to generate questions that closely resemble ground-truth one as
a score of 5 means the answer matches the ground-truth. In the
possible. Folloing previous works [7, 8, 23], we use BLEU-4 [26],
answer-unaware setting, only a score of 5 is used, indicating that
ROUGE-L [19], METEOR [1] as our n-gram evaluation metrics, and
the question is answerable.
use BERTScore [43] to measure the similarity between the gener-
To evaluate questions within the PRE-UMATH framework, hu-
ated candidate and ground-truth questions. In the answer-unaware
man evaluators employ a set of diverse criteria, encompassing Dif-
setting where the answer and the ground-truth question is un-
ficulty, Grammaticality, Answerability, Topic Alignment, Difficulty
available, we follow [6, 33] and measure the Diversity of generated
Alignment, Answer Quality, and Usefulness. Difficulty is rated from
questions by Distinct-1,2 [18] and the Relevancy with respect to
1 to 5 scale, indicating the question’s suitability for varying edu-
the context using BERTScore.
cational levels, from elementary to olympiad. Grammaticality is
• Human Evaluation. To further assess the quality of the gen- scored at 1, 3, or 5, reflecting the presence of grammatical errors.
erated questions with human preferences, we conduct a human Answerability is rated either 1 (unanswerable) or 5 (answerable).
study on 200 randomly selected cases from each dataset. The best- Topic alignment is rated as either 1 (not relevant to neither topic
performing fine-tuned baseline (based on the average of all metrics) nor lesson), 3 (relevant to topic but not the lesson), or 5 (relevant
and ChatGPT are selected for evaluation. Then, three English-native to both topic and lesson). Difficulty alignment is assessed with op-
educators are hired to evaluate models (1-5) based on 5 criteria: Dif- tions of 1 (deviation of more than 1 level from the given difficulty),
ficulty, Relevancy, Grammaticality, Answerability, and Usefulness. 3 (one-level deviation), and 5 (match the given difficulty level).
The detailed scoring criteria for metrics are provided in Section 6.3. Answer Quality is either 1 (incorrect step-by-step explanation), 3
(partially correct step-by-step explanation but not the final answer),
and 5 (both a correct step-by-step explanation and the correct final
6.2 Context-unaware Experimentation
answer). Finally, usefulness scores gauge the utility of generated
In the context-unaware setting, since there are no ground-truth questions and solutions, with is either 1 (not useful), 3 (useful but
questions, we only rely on human evaluations. We perform human require editing), or 5 (useful and no editing required).
evaluation on 500 randomly selected samples, with 100 questions
coming from each prompted difficulty. We hire three educators 7 RESULTS AND DISCUSSIONS
who are native English speakers to evaluate ChatGPT on 5 criteria:
(1) Grammaticality to assess the grammatical accuracy of the 7.1 Automatic Evaluation
generated question; (2) Answerability measuring the answerable It is worth noting that our automatic evaluations are only conducted
plausibility of the generated question; (3) Topic Alignment assess- on context-aware setting. In the answer-aware setting, fine-tuning
ing question relevancy to the topic; (4) Difficulty Alignment to baselines consistently outperform ChatGPT across all automatic
compare the expected and generated difficulty; (5) Usefulness to metrics on the three benchmarks. However, in the answer-unaware
assess the mathematical usefulness of the generated question to the setting, we derive interesting insights. Firstly, ChatGPT generates
SAC’24, April 8 –April 12, 2024, Avila, Spain Pham et al.
Model Mode
SVAMP
Dist-1 Dist-2 Rel.
GSM8K
Dist-1 Dist-2 Rel. Dist-1
MATH
Dist-2 Rel.
(tertiary-level) or the whole context (lower-levels). This repetition
GPT-2 fine-tune 21.88 61.89 86.28 12.72 50.76 86.53 4.72 19.40 81.08
leads to the generation of overly lengthy questions, as exemplified
BART fine-tune 16.07 41.34 88.11 15.45 46.64 87.13 4.62 16.20 88.80 by instances such as: “Context: A football team played 22 games.
T5 fine-tune 15.80 41.85 88.22 15.67 46.96 87.54 4.05 12.68 88.48
Mix-QG fine-tune 15.33 39.14 88.30 16.59 47.48 87.56 4.84 14.92 83.33 They won 8 more than they lost.”, “Generated Question: How many
Flan-T5 fine-tune 15.63 40.15 88.28 16.17 46.97 87.56 3.35 10.39 83.03 games did the football team win if they played 22 games and won 8
ProphetNet fine-tune 16.35 39.60 88.01 9.28 35.76 85.54 2.29 16.36 59.62
more than they lost?”. Subsequent human evaluations unveil that
GPT-3 zero-shot 17.36 47.10 88.81 15.65 51.79 87.85 10.62 31.50 84.85
GPT-3.5 zero-shot 17.48 43.02 87.76 16.31 50.55 87.36 10.32 29.37 83.94 this issue predominantly afflicts the GSM8K dataset, occurring
ChatGPT zero-shot 19.91 51.67 86.25
89.40 17.15 56.11 88.28 11.27 35.66
about 50% of the time.
Table 6: Answer-unaware question generation results in the context-
aware setting. (4) With an expected answer, ChatGPT tends to generate answerable
questions whilst, without it, this likelihood is lower. When additional
more diverse questions in terms of token levels on the challenging hypotheses are required to construct a complete question (e.g.,
GSM8K and MATH benchmarks, which underscores its practical “Context: Mary is two years younger than Joan, who is five years
potential for educational purposes. Conversely, GPT-2 excels on older than Jessa”), our empirical evaluation indicates that ChatGPT
SVAMP dataset by yielding higher distinct scores compared to tends to struggle in the absence of an expected answer as a reference.
ChatGPT. This might be because the questions generated by GPT-2 For instance, within the answer-unaware setup and considering
are generally short and consist of non-sense tokens. For example: the context mentioned, ChatGPT only asks “How old is Jessa?”.
Context: “At the stop 8 more people got on the train.
There were 11 people on the train.”; Question: “@@ Is (5) Without an expected answer, ChatGPT frequently generates
there a limit on the number of people on the bus?” trivial questions. In the answer-unaware scenario, ChatGPT often
exhibits a combination of the aforementioned behaviors (2) and
7.2 Context-aware Human Evaluation (4), where it redundantly repeats information from the context and
formulates it as a question. This behavior occurs irrespective of
Through our careful manual evaluations, we have obtained 6 in-
the context’s complexity, resulting in the generation of simplistic
sightful findings.
questions that merely require looking up information in the context
(1) ChatGPT generates questions with minimal grammatical errors. itself. For instance, when the context is as straightforward as “Dar-
As shown in Table 7, ChatGPT consistently attains grammaticality rell and Allen’s ages are in the ratio of 7:11”, ChatGPT redundantly
scores exceeding > 4.9, underscoring its proficiency in generating repeats the entire context and asks, “What is the ratio of Darrell’s
grammatically correct texts across all pre-university levels. Notably, age to Allen’s age?”. While this phenomenon occurs less frequently
we observe that the grammatical errors predominantly emerge than (4), about 5% of the generated questions.
when ChatGPT attempts to generate highly complex problems.
(6) Even with good contextual understanding, ChatGPT struggles in
(2) ChatGPT generates questions that are highly relevant to the understanding the relationship between mathematical objects. This
input context. Our manual evaluations reveal that the questions problem manifests in both the answer-aware and answer-unaware
generated by ChatGPT are highly relevant to the input contexts. settings. In the answer-aware mode, ChatGPT tends to inaccurately
Remarkably, these questions exhibit minimal presence of unrelated order subtraction operations while in the answer-unaware, it ex-
characters or variables not found in the context, resulting in nearly hibits reluctance in generating questions related to the relationships
perfect relevancy scores across most datasets and sub-settings (see between objects. This phenomenon occurs about 2% of the time.
Table 7). However, an intriguing observation emerges with a lower
relevancy score in the answer-aware setting for MATH compared 7.3 Context-unaware Human Evaluation
to its answer-unaware counterpart. Along with the same observation about question grammaticality,
we provide 5 more findings in the context-unaware setting.
(3) ChatGPT frequently repeats information from the context. De-
spite explicit constraints outlined in the prompt template regard- (1) ChatGPT generates questions with high diversity in terms of
ing repetition, the model occasionally reiterates random segments context. Across all three class levels, ChatGPT consistently excels
SIG Proceedings Paper in LaTeX Format SAC’24, April 8 –April 12, 2024, Avila, Spain
Answer-unaware GPT-2 1.02 4.53 4.06 3.41 2.96 Mix-QG 1.66 4.92 4.98 3.42 3.47 GPT2 1.02 3.21 2.46 1.53 2.09
Answer-unaware GPT-3 1.14 4.82 4.96 4.84 4.27 GPT-3 1.66 4.95 4.85 3.81 3.61 GPT-3 3.44 4.81 4.85 4.16 3.75
Answer-unaware ChatGPT 1.06 4.99 4.98 4.81 4.61 ChatGPT 1.74 4.96 4.98 4.69 4.14 ChatGPT 3.47 4.98 4.92 4.24 4.07
Kripp’s Alpha 66.23 71.18 75.56 74.09 58.43 64.33 71.57 73.68 75.09 60.55 64.26 70.01 71.23 65.09 69.91
(4) If ChatGPT generates hard questions, it could not handle the ACKNOWLEDGEMENTS
complexity and generates nonsense. ChatGPT demonstrates profi- This project is supported by MoE Tier 1 research grant number
ciency in introducing new objects within questions but struggles to RS21/20, Singapore. Do Xuan Long is supported by the A*STAR
SAC’24, April 8 –April 12, 2024, Avila, Spain Pham et al.
Computing and Information (ACIS) scholarship, A*STAR, Singa- Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
pore. [18] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A
Diversity-Promoting Objective Function for Neural Conversation Models. In
Proceedings of the 2016 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. Association for
REFERENCES Computational Linguistics, San Diego, California, 110–119. https://doi.org/10.
[1] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for 18653/v1/N16-1014
MT Evaluation with Improved Correlation with Human Judgments. In IEEvalua- [19] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
tion@ACL. https://api.semanticscholar.org/CorpusID:7164502 In Annual Meeting of the Association for Computational Linguistics. https://api.
[2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, semanticscholar.org/CorpusID:964287
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [20] Tianqiao Liu, Qiang Fang, Wenbiao Ding, Hang Li, Zhongqin Wu, and Zitao Liu.
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, 2021. Mathematical Word Problem Generation from Commonsense Knowledge
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Graph and Equations. arXiv:2010.06196 [cs.CL]
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin [21] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya In International Conference on Learning Representations. https://openreview.net/
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. forum?id=Bkg6RiCqY7
arXiv:2005.14165 [cs.CL] [22] Owen HT Lu, Anna YQ Huang, Danny CL Tsai, and Stephen JH Yang. 2021.
[3] Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xi- Expert-authored and machine-generated short-answer questions for assessing
aodan Liang. 2022. UniGeo: Unifying Geometry Logical Reasoning via Re- students learning performance. Educational Technology & Society 24, 3 (2021),
formulating Mathematical Expression. In Proceedings of the 2022 Conference 159–173.
on Empirical Methods in Natural Language Processing. Association for Com- [23] Lidiya Murakhovs’ka, Chien-Sheng Wu, Philippe Laban, Tong Niu, Wenhao
putational Linguistics, Abu Dhabi, United Arab Emirates, 3313–3323. https: Liu, and Caiming Xiong. 2022. MixQG: Neural Question Generation with Mixed
//aclanthology.org/2022.emnlp-main.218 Answer Types. In Findings of the Association for Computational Linguistics: NAACL
[4] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, 2022. Association for Computational Linguistics, Seattle, United States, 1486–1497.
Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling https://doi.org/10.18653/v1/2022.findings-naacl.111
instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022). [24] OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
[5] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, [25] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John
et al. 2021. Training verifiers to solve math word problems. arXiv preprint Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
arXiv:2110.14168 (2021). Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training
[6] Xuan Long Do, Bowei Zou, Shafiq Joty, Anh Tai Tran, Liangming Pan, Nancy F language models to follow instructions with human feedback. In Advances in Neu-
Chen, and Ai Ti Aw. 2023. Modeling What-to-ask and How-to-ask for Answer- ral Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
unaware Conversational Question Generation. arXiv preprint arXiv:2305.03088 and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=TG8KACxEON
(2023). [26] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
[7] Xuan Long Do, Bowei Zou, Liangming Pan, Nancy F. Chen, Shafiq Joty, and Ai Ti Method for Automatic Evaluation of Machine Translation. In Annual Meeting of
Aw. 2022. CoHS-CQG: Context and History Selection for Conversational Question the Association for Computational Linguistics. https://api.semanticscholar.org/
Generation. In Proceedings of the 29th International Conference on Computational CorpusID:11080756
Linguistics. International Committee on Computational Linguistics, Gyeongju, [27] Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP Models really
Republic of Korea, 580–591. https://aclanthology.org/2022.coling-1.48 able to Solve Simple Math Word Problems? arXiv:2103.07191 [cs.CL]
[8] Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to Ask: Neural Ques- [28] Michael Prince. 2004. Does active learning work? A review of the research.
tion Generation for Reading Comprehension. In Proceedings of the 55th Annual Journal of engineering education 93, 3 (2004), 223–231.
Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- [29] Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen,
pers). Association for Computational Linguistics, Vancouver, Canada, 1342–1352. Ruofei Zhang, and Ming Zhou. 2020. ProphetNet: Predicting Future N-gram for
https://doi.org/10.18653/v1/P17-1123 Sequence-to-Sequence Pre-training. arXiv:2001.04063 [cs.CL]
[9] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, [30] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
Jamie Callan, and Graham Neubig. 2022. PAL: Program-aided Language Models. et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
arXiv preprint arXiv:2211.10435 (2022). 1, 8 (2019), 9.
[10] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, [31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits
Problem Solving With the MATH Dataset. In Thirty-fifth Conference on Neural of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine
Information Processing Systems Datasets and Benchmarks Track (Round 2). https: Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
//openreview.net/forum?id=7Bywt2mQsCe [32] Mrinmaya Sachan, Avinava Dubey, Eduard H. Hovy, Tom M. Mitchell, Dan Roth,
[11] Shima Imani, Liang Du, and Harsh Shrivastava. 2023. Mathprompter: Mathe- and Eric P. Xing. 2019. Discourse in Multimedia: A Case Study in Extracting
matical reasoning using large language models. arXiv preprint arXiv:2303.05398 Geometry Knowledge from Textbooks. Computational Linguistics 45, 4 (Dec.
(2023). 2019), 627–665. https://doi.org/10.1162/coli_a_00360
[12] Jeffrey D Karpicke and Henry L Roediger III. 2008. The critical importance of [33] Lei Shen, Fandong Meng, Jinchao Zhang, Yang Feng, and Jie Zhou. 2021. GTM:
retrieval for learning. science 319, 5865 (2008), 966–968. A Generative Triple-wise Model for Conversational Question Generation. In
[13] Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Proceedings of the 59th Annual Meeting of the Association for Computational Lin-
Dementieva, Frank Fischer, Urs Gasser, George Louis Groh, Stephan Günnemann, guistics and the 11th International Joint Conference on Natural Language Processing
Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia (Volume 1: Long Papers). Association for Computational Linguistics, Online, 3495–
Nerdel, J. Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina 3506. https://doi.org/10.18653/v1/2021.acl-long.271
Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn, and Gjergji Kasneci. 2023. [34] Kumar Shridhar, Jakub Macina, Mennatallah El-Assady, Tanmay Sinha, Manu Ka-
ChatGPT for good? On opportunities and challenges of large language models pur, and Mrinmaya Sachan. 2022. Automatic Generation of Socratic Subquestions
for education. Learning and Individual Differences (2023). for Teaching Math Word Problems. arXiv preprint arXiv:2211.12835 (2022).
[14] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke [35] Rahul Singhal, Martin Henz, and Kevin McGee. 2014. Automated Generation
Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv of Geometry Questions for High School Mathematics. In Proceedings of the 6th
abs/2205.11916 (2022). International Conference on Computer Supported Education - Volume 2 (Barcelona,
[15] Saurabh Kulshreshtha and Anna Rumshisky. 2022. Reasoning Circuits: Few- Spain) (CSEDU 2014). SCITEPRESS - Science and Technology Publications, Lda,
shot Multihop Question Generation with Structured Rationales. arXiv preprint Setubal, PRT, 14–25. https://doi.org/10.5220/0004795300140025
arXiv:2211.08466 (2022). [36] Lieven Verschaffel, Stanislaw Schukajlow, Jon Star, and Wim Van Dooren. 2020.
[16] Ghader Kurdi, Jared Leo, Bijan Parsia, Uli Sattler, and Salam Al-Emari. 2019. A Word problems in mathematics education: A survey. ZDM 52 (2020), 1–16.
Systematic Review of Automatic Question Generation for Educational Purposes. [37] Zichao Wang, Andrew Lan, and Richard Baraniuk. 2021. Math Word Problem
International Journal of Artificial Intelligence in Education 30 (2019), 121–204. Generation with Mathematical Consistency and Problem Context Constraints.
[17] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Processing. Association for Computational Linguistics, Online and Punta Cana,
Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Dominican Republic, 5986–5999. https://doi.org/10.18653/v1/2021.emnlp-main.
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of 484
the Association for Computational Linguistics. Association for Computational
SIG Proceedings Paper in LaTeX Format SAC’24, April 8 –April 12, 2024, Avila, Spain
[38] Zichao Wang, Jakob Valdez, Debshila Basu Mallick, and Richard Baraniuk. 2022. Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
Towards Human-Like Educational Question Generation with Large Language Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
Models. In International Conference on Artificial Intelligence in Education. and Alexander M. Rush. 2020. HuggingFace’s Transformers: State-of-the-art
[39] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Natural Language Processing. arXiv:1910.03771 [cs.CL]
Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language [42] Qinzhuo Wu, Qi Zhang, and Xuanjing Huang. 2022. Automatic Math Word
Models Are Zero-Shot Learners. arXiv:2109.01652 [cs.CL] Problem Generation With Topic-Expression Co-Attention Mechanism and Re-
[40] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, inforcement Learning. IEEE/ACM Transactions on Audio, Speech, and Language
Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Rea- Processing 30 (2022), 1061–1072. https://doi.org/10.1109/TASLP.2022.3155284
soning in Large Language Models. In Advances in Neural Information Processing [43] Tianyi Zhang, Varsha Kishore, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi.
Systems. 2020. BERTScore: Evaluating Text Generation with BERT. In International Confer-
[41] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, ence on Learning Representations. https://openreview.net/forum?id=SkeHuCVFDr
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe