0% found this document useful (0 votes)
25 views15 pages

Automatic Generation of Socratic Subquestions For Teaching Math Word Problems

The document discusses automatically generating Socratic subquestions to guide math word problem solving. The authors hypothesize that generating sequential, focused, and goal-driven questions can help both humans and machines solve multi-step math word problems. They explore using large language models conditioned on desirable question properties to generate questions that improve the performance of a math word problem solver. Preliminary user studies found that the difficulty level of problems impacts whether questioning improves or hinders human performance.

Uploaded by

louiswirja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views15 pages

Automatic Generation of Socratic Subquestions For Teaching Math Word Problems

The document discusses automatically generating Socratic subquestions to guide math word problem solving. The authors hypothesize that generating sequential, focused, and goal-driven questions can help both humans and machines solve multi-step math word problems. They explore using large language models conditioned on desirable question properties to generate questions that improve the performance of a math word problem solver. Preliminary user studies found that the difficulty level of problems impacts whether questioning improves or hinders human performance.

Uploaded by

louiswirja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Automatic Generation of Socratic Subquestions

for Teaching Math Word Problems


Kumar Shridhar ∗♠ Jakub Macina ∗ ♠Φ Mennatallah El-Assady ♠Φ
Tanmay Sinha H Manu Kapur H Mrinmaya Sachan ♠

Department of Computer Science, ETH Zurich
Φ
ETH AI Center
H
Professorship for Learning Sciences and Higher Education, ETH Zurich

Janet’s ducks lay 16 eggs per day. She eats three for breakfast every
Abstract morning and bakes muffins for her friends every day with four. She sells
the remainder at the farmers' market daily for $2 per fresh duck egg.

Socratic questioning is an educational method How much in dollars does she make every day at the farmers' market?
that allows students to discover answers to Goal-driven Focused
How many eggs does Janet sell?
arXiv:2211.12835v1 [cs.CL] 23 Nov 2022

complex problems by asking them a series Is duck an animal?


of thoughtful questions. Generation of di- How many eggs does each duck lay?
How much does Janet make at the farmers' market?
dactically sound questions is challenging, re-
quiring understanding of the reasoning pro-
Figure 1: Math word problems can be precedurally
cess involved in the problem. We hypothe-
solved in multiple reasoning steps. One operationaliza-
size that such questioning strategy can not only
tion of Socratic questioning is to map each step in the
enhance the human performance, but also as-
procedure to a question. Asking (machines/humans)
sist the math word problem (MWP) solvers.
the right set of questions in a certain sequence (shown
In this work, we explore the ability of large
in green) can be an effective way to do so. In order to
language models (LMs) in generating sequen-
be effective, the Socratic questioning should be focused
tial questions for guiding math word problem-
and goal-driven.
solving. We propose various guided question
generation schemes based on input condition-
ing and reinforcement learning. On both auto-
matic and human quality evaluations, we find
Figure 1 shows an example of a math word problem
that LMs constrained with desirable question where this questioning strategy might be beneficial.
properties generate superior questions and im- We hypothesize that these questions can not only
prove the overall performance of a math word help humans in understanding the problem better
problem solver. We conduct a preliminary user and improve their performance but can also assist
study to examine the potential value of such MWP solvers.
question generation models in the education
domain. Results suggest that the difficulty Even though question generation (QG) models
level of problems plays an important role in have been studied for factual SQuAD-like ques-
determining whether questioning improves or tions (Rajpurkar et al., 2016; Puri et al., 2020),
hinders human performance. We discuss the
these models fail to generate sequentially-coherent
future of using such questioning strategies in
education. questions (Reddy et al., 2019; Choi et al., 2018).
Furthermore, domain-specific questioning is chal-
https://github.com/eth-nlped/ lenging as the QG model needs to understand the
scaffolding-generation reasoning process required to provide fine-grained
responses. Moreover, the role of a teacher us-
1 Introduction ing questioning is to interject questions that focus
Questioning can be a valuable way of supporting on the most critical points in an explanation and
student thinking. It can be conceived as a scaffold take the understanding forward (Anghileri, 2006).
(Wood et al., 1976; Quintana et al., 2004), where a As seen in bold in the Figure 1, we refer later to
more knowledgeable tutor helps a student in solv- these properties of questioning as focused and goal-
ing problems otherwise too difficult. One approach driven.
well-suited for mathematics is funneling (Wood, In this work, we explore the use of large language
1994), which uses prompting questions to guide models (Raffel et al., 2020; Radford et al., 2019) to
students towards a solution. generate guiding sub-questions for math word prob-

Equal contribution lems. In particular, we use reinforcement learning
(RL) with rewards from various sources includ- in the form of questioning prompts. For mathe-
ing Math question answering (Math QA) models matics, Wood (1994) analyzed interactions in math
and various forms of input conditioning for gener- classrooms and proposed two distinct interaction
ating these questions. We train and evaluate our patterns - funneling, which functions by guiding
models on the recently released GSM8K MathQA students using leading/prompting questions to a
dataset (Cobbe et al., 2021) of multi-step reasoning predetermined solution procedure, and focusing,
MWPs. We illustrate the benefit of our RL-based which functions by drawing student attention to the
generation strategy using both automatic and hu- critical aspects of the problem. We draw inspira-
man evaluation metrics. Our evaluation shows that tion from this strand of work. Our overall question
our guided approach makes the generation model generation approach can be conceived to be similar
ask more logically relevant and structurally correct to funneling, with specific sub-questions focusing
questions, which follow the appropriate sequencing on the important domain concepts.
of questioning at the right granularity level.
Research on question generation includes visual
We further show that our generated questions, when question generation (Fan et al., 2018; Wang et al.,
provided as additional context, can aid a math ques- 2022), generation of questions for student assess-
tion answering model, thereby providing further ment (Stasaski and Hearst, 2017; Wang et al.,
empirical justification of the value of questioning 2018), generation of factual questions based on
for math QA model training. Questioning could Wikipedia articles (Rajpurkar et al., 2016; Ko et al.,
facilitate reasoning of MWP solvers by making 2020) or generation of sequential information-
intermediate reasoning steps explicit. Finally, we seeking questions in dialogue-based scenarios
explore the didactic usefulness of our questioning (Reddy et al., 2019; Choi et al., 2018). Other work
strategy by conducting a preliminary user study and has also explored similar ideas of improving an-
use it to show that the generated sequence of ques- swerability by question-asking (Klein and Nabi,
tions may have the potential to improve students’ 2019; Shwartz et al., 2020; Perez et al., 2020; Pan
problem-solving. However, we cautiously note that et al., 2021) and ranking them (Rao and Daumé III,
achieving this would require further progress on 2018). However, factual questions do not usually
many fronts in AI and Education. require much reasoning and mostly boil down to
information retrieval from text. In this work, we fo-
In what follows, we begin by discussing related cus on question generation for reasoning problems.
work and introducing our research questions in
section 2 and section 3. We propose ways to in- Prior work on guided and controlled question gen-
duce these properties in LMs using planning and eration uses either entities as guiding mechanism
reinforcement learning in section 4; section 5 em- (Huang et al., 2021) or reinforcement learning-
pirically demonstrates the effectiveness of inducing based graph to sequence approach (Chen et al.,
questioning strategy in LMs and the quality of gen- 2019). Identification of entities and relationships
erated questions evaluated using automatic metrics present in the text often uses rule-based or on-shelf
and by humans. Finally, we evaluate the potential extraction tools, which are hard to extend (Dhin-
of using such questions as an educational tool for gra et al., 2020). Often these single-hop questions
helping students solve MWPs in section 6. are combined to form a multi-hop question that
requires complex reasoning to solve it (Pan et al.,
2 Related Work 2021; Sachan et al., 2020). Controllable text gener-
ation has been studied in the past for text generation
Socratic questioning approaches have evolved (Hu et al., 2017; Miladinović et al., 2022; Carls-
within the learning sciences community into the son et al., 2022), Wikipedia texts (Liu et al., 2018;
theory of scaffolding (Wood et al., 1976; Reiser, Prabhumoye et al., 2018) and data-to-text genera-
2004), which broadly refers to assisting students tion (Puduppully and Lapata, 2021; Su et al., 2021).
in problem-solving beyond their zone of proximal Controlled text generation is particularly useful for
development (Quintana et al., 2004). Computer- ensuring that the information is correct or the num-
based scaffolds (e.g., in the form of hints, prompts, bers are encapsulated properly (Gong et al., 2020).
feedback) have moderate effects on student learn- Our task has similar requirements.
ing outcomes (Kim et al., 2018), and our work can
be used to automatically generate such scaffolds A final strand of related work lies in the ballpark of
Focused Goal-driven
s*1 : (- -), s*2 : (*) Reward, R = Rfl + Rg + Rans

MWP

Content P⊕[s1, s2] P Question Rfl

Planner
+ Generator
Fluency

generated Rg
Seq2Seq Seq2Seq sub-questions Granularity
* *
(q1 ,..., qn )
P : {Context, C: Janet’s ducks lay 16 eggs per day. She
eats three for breakfast every morning and ….. , Rans
Question, Q: How much in dollars does she make every Answerability
day?}
q*1: How many eggs Janet are left with Janet to sell?
q1: How many eggs does Janet sell?, s1:<<16-3-4=9>>9 q*2: What is the total amount that Janet make?
q2: How much does Janet make? , s2:<<9*2=18>>18
A: 18 Reward Module

Figure 2: Our overall methodology: Two Socratic properties of focused (red dotted box) and goal-driven (green
dotted box) question generation are added to the question generation model with a combination of content planning
and reward based finetuning. Here, ⊕ represents the concatenation operation.

math problem solvers (Hosseini et al., 2014; Kush- ness of sub-questioning in solving MWPs.
man et al., 2014; Roy et al., 2015; Seo et al., 2015;
Sachan and Xing, 2017; Sachan et al., 2017, 2018, RQ2: What are the properties of a good ques-
inter alia). Recent work in this area uses special- tioning strategy? Once we established that sub-
ized architectures such as graph-based encoders questioning is helpful, we performed the same sub-
(Zhang et al., 2020) and tree-based decoders (Xie questioning experiment as RQ1 with NLP models
and Sun, 2019), and more recently, large pretrained but with the permuted ordering of sub-questions,
LMs which show state-of-the-art results (Cobbe change in granularity of sub-questions or changed
et al., 2021; Shen et al., 2021; Kojima et al., 2022; content (Table 2). We observed a decrease in the
Wei et al., 2022; Chowdhery et al., 2022). Appli- answering capabilities of the QA model for all the
cation of these approaches to the MWP datasets cases, establishing that the right sequence of dis-
like GSM8K (our data context) still holds consid- ciplined questions with relevant content is an es-
erable room for improvement, primarily in the rea- sential component of a good questioning strategy.
soning capabilities, and the majority of the latest Based on our results and inspired by prior work
approaches are still unable to solve a lot of the (Wood, 1994; Anghileri, 2006), we hypothesize the
problems and sensitive to even slightest modifica- most important components of a Socratic question-
tions in the problem (Patel et al., 2021; Stolfo et al., ing strategy as:
2022; Srivastava et al., 2022) . (A) Focused: An essential property of a good
questioning strategy is to ask questions that
3 Research Questions are directed towards the most critical domain-
We now discuss the usefulness of questions in solv- specific content. Irrelevant questions not only
ing a math word problem and then study the differ- make the process difficult but also force a di-
ent properties of a good questioning strategy. version in the focus and may increase the cog-
nitive load that a student experiences.
RQ1: Does sub-questioning help in understand-
(B) Goal-driven: Asking the right sequence of
ing a math word problem better? Question
relevant questions that can assist students in
prompts as a teaching strategy act as instructions
reaching the final goal (solving the main ques-
that guide the students throughout a problem-
tion in case of math word problems) is a fur-
solving process (Wood, 1994). Such question-
ther important part of good questioning.
ing, as a valid scaffolding strategy (Kim et al.,
2018), is valuable in supporting student thinking 4 Methodology
and is commonplace in high-quality math instruc-
tion (Boston and Candela, 2018). We explored We discuss our approach to modeling Socratic ques-
the sub-questioning strategy with our trained NLP tioning using large LMs. We begin by defining our
model and found that sub-questioning helps answer MWP dataset D as a collection of MWPs. Each
the MWPs more effectively (Table 1). Experiments MWP P in the dataset is accompanied by its so-
with NLP models and humans establish the useful- lution S and the numerical answer A. We do not
always assume the existence of problem solutions Equations: Equations contain important infor-
S and answers A as they can be automatically de- mation for an MWP as they involve not just the
rived from various MathQA models. Each MWP operators but also the quantities involved in the
P = (C, Q) consists of the story context C and the problem. Similar to operators, equations can play
question Q. The problem solution S consists of n an important guiding principle for asking more fo-
solution steps S = (s1 , ..., sn ). We define Socratic cused questions leading towards a correct solution.
questioning such that each solution step si can be
We use the same seq2seq architecture for the con-
mapped to a sub-question qi . We refer to q as a
tent planner module as our QG model, with the only
collection of all Socratic questions q1 , ..., qn for a
difference being that the output comprises a set of
given MWP P in our work. An example MWP is
equations s∗1 , .., s∗n or just the operators within the
present in Figure 2.
equations (instead of the sub-questions). The gener-
Our main module is the Question Generator (QG) ated operators/equations are appended to the input
module, which is a transformer (Vaswani et al., MWP P in the encoder for the QG module and the
2017) based encoder-decoder model. The QG modified focused learning objective LQGf is:
model takes the reference Math word problem P
n
and generates the Socratic questions q∗ as close to X
LQGf = − log PDec (qi |q:i−1 ; Enc([P ⊕plan]))
the true sub-questions q as possible. The learning
i=1
objective of the QG module is as: (2)
n
X Here, plan depicts the content planner module’s
LQG = − log PDec (qi |q:i−1 ; Enc(P )) (1) output and ⊕ depicts the concatenation operation.
i=1
4.2 Goal-driven questions
where Enc represents the encoder and Dec repre-
sents the decoder for the seq2seq QG model. Note An essential element of a good questioning strat-
that the sub-questions qi are decoded word by word egy is to ask goal-driven questions that are not only
in an auto-regressive setting. factually associated to the main problem but also
Next, we propose to inject the two Socratic ques- eventually help in answering the main question.
tioning properties in our QG model as follows: However, there can be any number of goal-driven
questions that can be asked for a MWP. Thus, our
4.1 Focused questions goal is to optimize the questioning strategy such
To learn a sequence of disciplined questions fo- that it is goal-driven, efficient, and rewarding at
cused on specific reasoning steps in the MWP, it each step, making sure that the final goal can be
is important to ask the right set of questions. We achieved with these individual questions. We in-
propose a content planner ψ that serves as a guid- duce these properties in our QG model using vari-
ing principle for the QG model to ask the right ous rewards that force the model to stay relevant to
focused questions. In principle, the content plan- the problem. These rewards are defined as:
ner module can extract any relevant information to Fluency: It is important that the generated sub-
assist the QG model, but for the task of math word questions are easily understandable and fluent in
problems, we restrict it to operators and equations.1 the meaning they represent. Although the QG train-
Our planning strategies are defined as: ing objective ensures the syntax and semantics of
Operators: Given an MWP P , the content plan- the questions generated, rewarding the system to
ner learns to identify the operations and operators stay fluent is necessary to remove repetitions and
(e.g., addition, multiplication, ..) involved in the illogical questions.
problem. Since the operators play a significant role Granularity: As solving a MWP usually in-
in a given MWP, the generated operators are used volves multiple reasoning steps, asking relevant
as the guiding principle to generate sub-questions questions at each step can help in solving the MWP.
by the QG model. Moreover, our questioning strategy is based on the
1
We also do not consider the step-by-step solutions S in fact that the questions are organised, structured and
our work, as creating step-by-step textual solution requires follow a sequence. With the granularity reward, the
a lot of time and effort from teachers and even the largest
language models fail to understand MWPs easily (Wei et al., model can learn to ask the right number of ques-
2022; Chowdhery et al., 2022). tions (compared to the number of reasoning steps
to solve MWP) in a specific sequence and refrain Variation GPT-2 GPT-3
from unstructured questions. P 5.45 (↓ 47%) 29 (↓ 38%)
P ⊕ {q} 10.46 47
Answerability: For every generated question, it
is important to evaluate if the generated questions Table 1: Comparison of Math QA accuracy (in %)
can be answered given context C and can help in with and without Socratic questions for GSM8K test
answering the overall MWP. We trained an external dataset. (↓) represents the drop in the accuracy when
QA model that can answer the MWPs taking help compared to the Socratic questions (P ⊕ {q}). ⊕
represents the concatenation operation. GPT-2 model
from the sub-questions and evaluated if the gen-
was trained with and without Socratic questions while
erated question can assist in answering the main GPT-3 model (Brown et al., 2020) was prompted us-
problem. The answerability reward is provided on ing one-shot example (more details in Appendix sub-
both a step-by-step basis (if the QA model can an- section B.2).
swer a sub-part of the main problem) and overall
(if using all sub-questions, whether the final answer 0
was correct or not). reward is F (A, A0 ) = #a 0 0
|q0 | , where #a and |q |
denote the number of correct answers to the gener-
During training, the QG model samples a set of ated sub-questions and total number of generated
sub-questions q0 , calculates various rewards based questions respectively.
on q0 . The parameters of the QG model are updated
using the REINFORCE algorithm (Williams, 1992) 4.3 Overall Loss Function
as: Finally, with the induced Socratic properties in the
QG model, the total loss is defined as a combination
LRL = − Eq0 ∼PDec [R(q, q0 , P )] of the focused learning loss LQGf and the loss of
n
X the rewards LRL , as:
= − R(q, q0 , P ) log PDec (qi |q:i−1 ; Enc(P ))
i=1 L = α LQGf + (1 − α) LRL (3)

The reward function [R(q, q0 , P)] measures the where α is a weighting factor.
individual rewards for fluency, granularity and an-
swerability and is calculated as: 5 Empirical Analysis
Fluency: Rf l = BLEU (q, q0 ) We now demonstrate the effectiveness of inducing
the defined questioning properties in large LMs.
where, BLEU(.,.) represents the BLEU score (Pap-
ineni et al., 2002). Dataset We study the properties of Socratic ques-
tioning on the GSM8K dataset2 (Cobbe et al., 2021)
Granularity: Rg = F (q, q0 ) that consists of 8.5K grade school math word prob-
0
where, F (q, q0 ) = 1 − ||q|−|q || 0
|q0 | , and |q| and |q |
lems. Each problem requires 2 to 8 reasoning steps
to solve, and solutions primarily involve a sequence
denote the number of questions in q and q0 respec-
of elementary calculations using basic arithmetic
tively.
operations (+ − ∗ /). The dataset is segmented
Answerability: Rans = F (A, A0 ) into 7.5K training problems and 1K test problems.
where, F (A, A0 ) = 1 if the final answer from the Models We used T5 (Raffel et al., 2020) as the
QA model is correct when it is given sub-questions backbone of both our QG and content planning
q0 alongside the MWP P, and 0 otherwise. A0 de- modules. For reward generating QA model, we
notes the answer from the QA model and A denotes used GPT-2 (Radford et al., 2019) for all RQ2 ex-
the true answer. periments because of resource constraints. How-
ever, a better QA model like GPT-3 (Brown et al.,
We also evaluated the step-by-step performance
2020) can be used in the future. Both QG and con-
of the QA model on the generated sub-questions
tent planning models are fine-tuned on the GSM8K
to check if the QA model can answer the gener-
training set using the Huggingface library (Wolf
ated sub-questions correctly. This allows us to
et al., 2020).
provide partial rewards at each step of the genera-
2
tion model. The modified sub-step answerability https://github.com/openai/grade-school-math
Variation QA Accuracy Planning BLEU BERT F1 #Q
Granularity None 51.53 0.783 0.428
P ⊕ {q}0 5.45 (↓ 45%) Operators 54.98 0.788 0.642
P ⊕ {q}0.25 3.94 (↓ 62%) + planner 45.05 0.779 0.346
P ⊕ {q}0.5 3.35 (↓ 67%) Equations 58.82 0.813 0.807
P ⊕ {q}0.75 9.70 (↓ 7%) + planner 52.48 0.787 0.485
P ⊕ {q}1 10.46
Table 3: Focused questions: QG model performance
Order
compared on the gold set of ground truth test questions
P ⊕ shuffle({q}) 8.94 (↓ 14%) with different planning strategies. Note that for the
Relevance planner rows, content planning information from the
P ⊕ <base-ques> 2.57 (↓ 75%) ground truth data is replaced with the output from the
content planner model.
Table 2: Comparison of Math QA accuracy (in %) for
different variations of experiments with ground truth
data. {q}k represents that only k% of the ground truth tion, especially when questions are relevant to the
sub-questions are used and selected randomly. For e.g.,
concept to be learnt, in the right sequence (order-
{q}0.25 represents only 25% of the sub-questions are
used. shuffle({q}) represents all sub-questions, but ing) with high granularity in their structure. We
with shuffled order. Finally, <base-ques> are the verify our hypothesis with GPT-2 model as a QA
sub-questions generated from a T5 large model with- solver after fine-tuning it on the training set of
out fine-tuning on our task. (↓) represents the drop in the GSM8K dataset and the GPT-3 model with
the accuracy when compared to the Socratic questions one-shot prompting. Table 1 demonstrates that the
(P ⊕ {q}). ⊕ represents the concatenation operation. Socratic questioning improves the performance of
GPT-2 small was used as QA model for all the above the QA solver as high as 45%. Then, we vary the
experiments.
properties of the test questions and examine the per-
formance of the QA Solver. Table 2 demonstrates
Implementation Details For the training of the that Socratic questions significantly improve the
models, we used Nvidia Tesla A100 with 40 GB model performance from 5.45% to 10.46%. Sub-
of GPU memory. We ran each experiment for 50 questioning even helps when only 75% Socratic
epochs, with a periodical evaluation of the valida- questions are retained (denoted as {q}0.75 in the
tion set. Training time without using rewards is 10 table) or when the order is shuffled (this might
minutes per epoch. With rewards, the training time be an artefact of the dataset containing a minor-
per epoch is increased to several hours. We used ity of examples with strict order). An interesting
the T5-large model without modifications for the observation is that when the number of Socratic
content planner and question generation module questions is reduced by half or lower (while pre-
and GPT-2 small as the QA solver. serving their order), the model gets confused and
performs worse than when it had no sub-questions.
Evaluation Metrics We report automatic evalua- Finally, we take the pre-trained T5 model and with-
tion using SacreBLEU (Post, 2018) which is based out fine-tuning it for our task, we take the outputs
on exact word overlap, BERT F1 score (Zhang and used it alongside the problem P as additional
et al., 2019) which is based on DeBERTa (He et al., information to solve the problem. The performance
2020) as the similarity model. We also report #Q, goes as low as 2.57%, indicating that non-relevant
the number of questions generated compared to the information degrades the performance.
number of ground truth reasoning steps (same as
Granularity reward), and Math QA Solver accu- 5.2 RQ2: What are the properties of a good
racy (same as the overall Answerability reward) to questioning strategy?
assess if our generated questions helped the QA We now present our analysis on inducing the two
model reach the final numerical solution. Socratic properties to LMs.
5.1 RQ1: Does sub-questioning help in Focused generation: Table 3 compares the two
understanding math concepts better? planning strategies. Results demonstrate that plan-
We hypothesize that high-quality sub-questioning ning strategies improve the baseline methods by
helps Math QA solvers to reach the correct solu- more than 3% on BLEU score with operators as
Strategy BLEU BERT F1 #Q Strategy QA Accuracy
Baseline 13.02 0.566 0.056 No planning 6.74
Fine-tuned 51.53 0.783 0.428 + rewards 6.75
+ fluency 52.21 0.784 0.440 Operators 7.50
+ # of questions 51.86 0.784 0.431 + rewards 7.52
+ QA 52.22 0.783 0.417 Equations 8.49
+ all weighted 53.39 0.781 0.431 + rewards 8.50
Eq planning 58.82 0.813 0.807
+ fluency 59.52 0.816 0.818 Table 5: Overall model variation and the influence on
Math QA solver accuracy (in %) with different plan-
+ # of questions 59.75 0.814 0.811
ning and reward strategies. Here, GPT-2 small is used
+ QA 59.37 0.813 0.799 as the QA model. Please note upper limit using ground
+ all weighted 59.62 0.815 0.815 truth questions is 10.46% as shown in Table 1.
Table 4: Goal-driven questions: QG model perfor-
mance compared to the gold set of ground truth ques- Planning BLEU BERT F1 #Q
tions with different rewards. None 51.53 0.783 0.428
Diff op, diff # 51.59 0.785 0.415
Diff op, same # 54.26 0.786 0.546
planning, and by more than 7% with equations. Operators (op) 54.98 0.788 0.642
Similar to the BLEU score, we achieve better per-
formance on BERT F1 scores too. Finally, the Table 6: Manipulating the planning inputs influences
number of correct question count improves with the quality of generated questions and overall QG
model performance. same # has the same number of
planning and doubles compared to the no-planning operators as number of reasoning steps but the types
variant. However, results show that in all the vari- (+-/*) are shuffled, diff # has both number and type of
ants the number of generated sub-questions is less operator shuffled.
than the number of reasoning steps. This could be
improved further by oversampling during the beam
search (beam search settings are the same for all QG quality, have a negligible effect on QA perfor-
variants in this experiment). The results degrade mance. This is mainly because slight improvement
when the ground truth content (both equations and in sub-questions quality does not necessarily help
planning) is replaced by our content planner mod- in reaching the final goal.
ule. This is expected as the errors in the content
planning module are cascaded when generating 5.3 Human quality evaluation
sub-questions. However, with more powerful mod- Next, we perform a human evaluation of the ques-
els, errors in the content planner can be reduced, tions generated for 100 randomly selected test
leading to improvement in all the metrics. See the MWPs to assess the quality of our model gener-
Appendix for experiments with the iterative split- ation (our best model) compared to the baseline
ting of MWP into multiple parts for generation. (with no planning or reward-based strategies). For
this analysis, we divided the questions among 4
Goal-driven generation: Table 4 summarizes annotators with an overlap of 40% of the questions
the results for the rewards as a strategy to incen- among them3 to evaluate the generated question
tivize the model to generate goal-driven and reward- quality on the following factors. A 5-point Likert
ing questions. We can observe the gains associated scale ranging from 1 (poor) to 5 (very good) was
with each reward for both the baseline model and used for each dimensions of quality assessment:
the best-performing model from Table 3 (equation- repetition - whether questions are repeated, factu-
based content planning model in our case), suggest- ality - whether all questions can be solved by the
ing the importance of rewards. information given in the problem, logical relevance
- if the question is logically related to the MWP,
QA performance We study the impact of the right sequence - correct sequence of questions lead-
QG model considering both Socratic properties as ing to the final answer, granularity - questions are
shown in Table 5. Sub-questions with operators granular enough to solve the problem but are still
and equations as planning improves the QA perfor-
mance by 1 − 2%. Rewards, although improves the 3
Overlap allows us to compute inter-annotator agreement.
relevant and no retrieval or basic common sense 6 A preliminary user study with learners
questions are asked, completeness - questions are
complete with all steps covered to reach to the final Finally, we designed a preliminary user study to
answer, and fluency - grammatical correctness and evaluate whether our generated questions, when
fluent in the language. presented as further problem-solving exercises (as
typically used in educational settings) can help
Repetitive Model Comparison learners on the way to solving the overall prob-
Baseline lem. Given our research question, we hypothesized
Factuality Our Model
that guidance with questions can increase the over-
Quality Evaluation

Logical Rel
all problem-solving success rate for users in the
Right Seq questions (treatment) group compared to the no-
Granular questions control group. Our study uses Socratic
Completeness questions as the main pedagogical intervention. We
Fluency focus on participants who cannot solve a problem
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 on the first attempt to clearly distinguish the impact
Likert Scale of automated sub-questioning. The key metric we
measure is the success rate, which is defined as the
Figure 3: Comparison of baseline versus our model
generated sub-questions on several metrics from our percentage of correctly solved problems.
human evaluations (showing mean and standard devi-
ation).
For our study, we built a simple user interface
which allowed participants to solve math word
problems (see Figure 5 and Figure 6 in the ap-
Figure 3 presents our findings, clearly demonstrat-
pendix). The interface contained a calculator which
ing that our planning and reward strategies lead to
the users could use if needed. The study comprises
superior quality questions on the MWP task. Al-
5 pre-test problems and 8 problem-solving exer-
though both baselines and our model-generated
cises. These problems were randomly selected
text achieve almost full score (5) on the fluency pa-
from the GSM8K test set. Our user study with
rameter, our model-generated questions are more
this interface was then deployed on Mechanical
aligned to the MWP, thus leading to a higher score
Turk and participants were hired using the platform
on all the other parameters. We also present a ran-
and were paid 10-12$ per hour. We selected par-
domly selected sample of generated questions in
ticipants with moderate levels of prior knowledge
the Appendix.
using the pre-test scores as the selection criteria,
5.4 Ablation study: Manipulating question and only those scoring in the range of 40-80% were
properties selected for the study. This way, we excluded both
low-prior knowledge participants and experts in
Both planning strategies help generate better ques-
our study to ensure there was a learning possibility.
tions. To gain a deeper understanding of how con-
tent planner ψ affects generated questions, we fur- We randomly split the participants into two groups
ther analyze the influence of operators as a plan- - no-questions group (N = 19) with no question
ning strategy. Here, we randomize operators and prompts, and questions group (N = 17) with ques-
their sequence and measure change in performance. tions generated from our model. Both groups used
Table 6 shows that the correct sequence of opera- the same interface for solving math word problems
tors with the correct number of operators guides and had the opportunity to resolve their answers
the generation process better than randomized ver- after the first incorrect submission. The only differ-
sions. A gap between the correct count of operators ence was that after incorrectly solving a problem on
and random count indicates that having a correct the first submission, participants in the questions
number of operators (of any type) is more valuable group saw sub-questions, while those in the no-
than the exact type of operators. We observed that questions group were only prompted to try again.
the number of operators guides the model in terms The sub-questions were generated using the best-
of the number of questions that need to be asked, performing model with planning and rewards.
while type changes the overall quality. Needless
to say, for the same number of operators, quality The results of the user study are shown in Ta-
matters. ble 7. The first attempt success rate is 58.4%
for the control group and 66.0% for the treatment Group 1st success 2nd success
group, which might be the result of a slightly M SD M SD
skewed prior knowledge distribution of 0.68 and No-questions 58.4 23.0 35.8 32.5
0.65 for treatment and control groups respectively. Questions 66.0 21.1 31.0 27.9
Even though participants in the treatment group
(M = 124.9, SD = 92.1) spend significantly Table 7: User study success rates (in %) before after in-
troduction of sub-questions. 1st success is the propor-
more time (p < 0.01) solving problems during
tion of exercises solved correctly on the first attempt
the second attempt relative to the control group and 2nd success is the proportion of correctly solved
(M = 41.5, SD = 31.4), we did not find any sta- exercises on the second attempt (out of all incorrectly
tistically significant difference between the groups solved on the first attempt).
in the second submission success rate (p = 0.659,
BF01 = 2.755, Cohen’s d = 0.157), indicating
weak odds favouring the null hypothesis and rather
a small effect size.

As our study was unable to establish overall per-


formance improvements, we further analysed the
second submission success rate per problem (see
Figure 4), and correlated it with the difficulty of
the question. This analysis indicated that sub-
questioning seems to improve the success of sim-
pler problems and degrade the accuracy for rela- Figure 4: Second submission success rate for problems
tively more complex problems. Prior work has sug- with at least 10% occurrence for each group (excluding
the two simplest problems 1 and 6). Difficulty level is
gested that the effectiveness of question prompts
annotated blind to the correct solution.
varies according to an individual’s prior knowl-
edge (Kim et al., 2018), and with insufficient prior
knowledge, performance for complex problems in LMs. We further evaluate if these questions can
may suffer. A posthoc inspection of the generated assist students in learning domain concepts. We
sub-questions for more complex problems shows found that the generated questions were generic for
that they also scored lower in the human quality each student and if adapted to their prior knowl-
evaluation. Thus, we hypothesize that for more edge and intermediate solutions, their effectiveness
complex questions, the generated sub-questions are could have been greater.
not good enough, and so they may make the task
more challenging for participants. A discussion on limitations of our work
Our questioning strategy, although utilizes infor-
While we were not able to establish any direct ben-
mation from the content planner and the reward
efits of automatic Socratic questioning in a real
strategy, leaves much to be desired in terms of its
learning scenario, we leave a more complete user
controllability. Based on our user study, we need
study for future work. Deployment of Socratic
to be careful in using the questioning strategy in
questioning systems in real educational scenarios
real educational contexts, as improper content can
would require a better assessment of question gen-
sometimes do more harm than good. Based on the
eration quality as well as a better understanding of
prior work, we focused on two aspects of goodness
learners. We believe this is an interesting avenue
in questioning in math education. However, this is
for future research and encourage future work to
not a complete list and other aspects could also be
attempt to address these issues.
important. We note that our user study was focused
7 Conclusion on the intermediate success rate rather than on ac-
tual learning. From a learning standpoint, asking
We study the importance of sub-questioning for questions that are always easily answerable won’t
learning a mathematical concept and explore how lead to deeper, wider learning. If learners do not
LMs may generate these sub-questions. We demon- have to struggle to answer the sub-questions being
strate the usefulness of Socratic questioning strate- asked and are instead repeating something verbatim
gies and propose ways to induce these properties or offering a slightly reconfigured version of what
they have been asked, they are probably answering Dublin, Ireland. Association for Computational Lin-
sub-questions that do not require conceptual under- guistics.
standing. Another limitation of our work is that Yu Chen, Lingfei Wu, and Mohammed J. Zaki.
our user study was underpowered due to resource 2019. Reinforcement learning based graph-to-
constraints, which prevents us from drawing strong sequence model for natural question generation. CoRR,
abs/1908.04942.
conclusions at this point. A larger user study is
however forthcoming. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-
tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-
Finally, we choose to focus on Socratic questioning moyer. 2018. QuAC: Question answering in context.
in a rather narrow sense of trying to call learners’ In Proceedings of the 2018 Conference on Empirical
attention to relevant facts and then implicitly stimu- Methods in Natural Language Processing, pages 2174–
2184, Brussels, Belgium. Association for Computa-
lating them to integrate facts and draw conclusions. tional Linguistics.
However, when taken together with all its nuances,
the effectiveness of Socratic questioning can be Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
posited to depend on other critical question types Barham, Hyung Won Chung, Charles Sutton, Sebastian
that seek clarification (e.g., can you rephrase?), ev- Gehrmann, et al. 2022. Palm: Scaling language model-
idence (e.g., can you provide an example?) and ing with pathways. arXiv preprint arXiv:2204.02311.
implication (e.g., why do you think. . . ?) from Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ja-
learners too, all of which are truly dialogic and nat- cob Hilton, Reiichiro Nakano, Christopher Hesse, and
urally leave room for learner questions. When both John Schulman. 2021. Training verifiers to solve math
the teacher and learners are jointly responsible for word problems. arXiv preprint arXiv:2110.14168.
pushing the dialogue forward, intermediate success Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachan-
may also not always be desirable as learner errors dran, Graham Neubig, Ruslan Salakhutdinov, and
and misconceptions may offer an important hook William W. Cohen. 2020. Differentiable reasoning
over a virtual knowledge base. CoRR, abs/2002.10640.
for the teacher to nudge the dialogue productively.
Zhihao Fan, Zhongyu Wei, Piji Li, Yanyan Lan, and
Acknowledgements Xuanjing Huang. 2018. A question type driven frame-
work to diversify visual question generation. In IJCAI,
This project was made possible by an ETH AI Cen- pages 4048–4054.
ter Doctoral Fellowship to Jakub Macina with par- Heng Gong, Wei Bi, Xiaocheng Feng, Bing Qin, Xiao-
tial support by the Asuera Stiftung and the ETH jiang Liu, and Ting Liu. 2020. Enhancing content plan-
Zurich Foundation. Many thanks to the group mem- ning for table-to-text generation with data understand-
bers and our reviewers for their valuable feedback. ing and verification. In Findings of the Association for
Computational Linguistics: EMNLP 2020, pages 2905–
2914, Online. Association for Computational Linguis-
tics.
References
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Julia Anghileri. 2006. Scaffolding practices that en- Weizhu Chen. 2020. Deberta: Decoding-enhanced
hance mathematics learning. Journal of Mathematics bert with disentangled attention. arXiv preprint
Teacher Education, 9(1):33–52. arXiv:2006.03654.
Melissa D Boston and Amber G Candela. 2018. The Mohammad Javad Hosseini, Hannaneh Hajishirzi,
instructional quality assessment as a tool for reflecting Oren Etzioni, and Nate Kushman. 2014. Learning
on instructional practice. ZDM, 50(3):427–444. to solve arithmetic word problems with verb catego-
rization. In Proceedings of the 2014 Conference on
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Empirical Methods in Natural Language Processing
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind (EMNLP), pages 523–533.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan
learners. Advances in neural information processing Salakhutdinov, and Eric P. Xing. 2017. Toward con-
systems, 33:1877–1901. trolled generation of text. In Proceedings of the 34th
International Conference on Machine Learning - Vol-
Fredrik Carlsson, Joey Öhman, Fangyu Liu, Severine ume 70, ICML’17, page 1587–1596. JMLR.org.
Verlinden, Joakim Nivre, and Magnus Sahlgren. 2022.
Fine-grained controllable text generation using non- Qingbao Huang, Mingyi Fu, Linzhang Mo, Yi Cai,
residual prompting. In Proceedings of the 60th An- Jingyun Xu, Pijian Li, Qing Li, and Ho-fung Leung.
nual Meeting of the Association for Computational Lin- 2021. Entity guided question generation with contex-
guistics (Volume 1: Long Papers), pages 6837–6857, tual structure and sequence information capturing. Pro-
ceedings of the AAAI Conference on Artificial Intelli- Ethan Perez, Patrick Lewis, Wen-tau Yih, Kyunghyun
gence, 35:13064–13072. Cho, and Douwe Kiela. 2020. Unsupervised question
decomposition for question answering. In Proceedings
Nam Ju Kim, Brian R Belland, and Andrew E Walker. of the 2020 Conference on Empirical Methods in Natu-
2018. Effectiveness of computer-based scaffolding in ral Language Processing (EMNLP), pages 8864–8880,
the context of problem-based learning for stem educa- Online. Association for Computational Linguistics.
tion: Bayesian meta-analysis. Educational Psychology
Review, 30(2):397–429. Matt Post. 2018. A call for clarity in reporting BLEU
scores. In Proceedings of the Third Conference on Ma-
Tassilo Klein and Moin Nabi. 2019. Learning to an- chine Translation: Research Papers, pages 186–191,
swer by learning to ask: Getting the best of gpt-2 and Belgium, Brussels. Association for Computational Lin-
bert worlds. arXiv preprint arXiv:1911.02365. guistics.
Wei-Jen Ko, Te-yuan Chen, Yiyan Huang, Greg Dur- Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut-
rett, and Junyi Jessy Li. 2020. Inquisitive question gen- dinov, and Alan W Black. 2018. Style transfer through
eration for high level text comprehension. In Proceed- back-translation. In Proceedings of the 56th Annual
ings of the 2020 Conference on Empirical Methods in Meeting of the Association for Computational Linguis-
Natural Language Processing (EMNLP), pages 6544– tics (Volume 1: Long Papers), pages 866–876, Mel-
6555, Online. Association for Computational Linguis- bourne, Australia. Association for Computational Lin-
tics. guistics.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Ratish Puduppully and Mirella Lapata. 2021. Data-to-
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- text generation with macro planning. Transactions of
guage models are zero-shot reasoners. arXiv preprint the Association for Computational Linguistics, 9:510–
arXiv:2205.11916. 527.

Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa
Regina Barzilay. 2014. Learning to automatically solve Patwary, and Bryan Catanzaro. 2020. Training ques-
algebra word problems. In Proceedings of the 52nd tion answering models from synthetic data. In Proceed-
Annual Meeting of the Association for Computational ings of the 2020 Conference on Empirical Methods in
Linguistics (Volume 1: Long Papers), pages 271–281. Natural Language Processing (EMNLP), pages 5811–
5826, Online. Association for Computational Linguis-
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben tics.
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. 2018. Generating wikipedia by summarizing Chris Quintana, Brian J. Reiser, Elizabeth A. Davis,
long sequences. arXiv preprint arXiv:1801.10198. Joseph Krajcik, Eric Fretz, Ravit Golan Duncan, Eleni
Kyza, Daniel Edelson, and Elliot Soloway. 2004. A
Ðord̄e Miladinović, Kumar Shridhar, Kushal Jain, scaffolding design framework for software to support
Max B Paulus, Joachim M Buhmann, and Carl Allen. science inquiry. Journal of the Learning Sciences,
2022. Learning to drop out: An adversarial ap- 13(3):337–386.
proach to training sequence vaes. arXiv preprint
arXiv:2209.12590. Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Liangming Pan, Wenhu Chen, Wenhan Xiong, Min- models are unsupervised multitask learners.
Yen Kan, and William Yang Wang. 2021. Unsuper-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
vised multi-hop question answering by question gen-
ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
eration. In Proceedings of the 2021 Conference of the
Wei Li, and Peter J. Liu. 2020. Exploring the limits of
North American Chapter of the Association for Compu-
transfer learning with a unified text-to-text transformer.
tational Linguistics: Human Language Technologies,
Journal of Machine Learning Research, 21(140):1–67.
pages 5866–5880, Online. Association for Computa-
tional Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- machine comprehension of text. In Proceedings of
Jing Zhu. 2002. Bleu: a method for automatic evalua- the 2016 Conference on Empirical Methods in Natu-
tion of machine translation. In Proceedings of the 40th ral Language Processing, pages 2383–2392, Austin,
Annual Meeting of the Association for Computational Texas. Association for Computational Linguistics.
Linguistics, pages 311–318, Philadelphia, Pennsylva-
nia, USA. Association for Computational Linguistics. Sudha Rao and Hal Daumé III. 2018. Learning to ask
good questions: Ranking clarification questions using
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. neural expected value of perfect information. In Pro-
2021. Are NLP models really able to solve simple ceedings of the 56th Annual Meeting of the Associa-
math word problems? In Proceedings of the 2021 tion for Computational Linguistics (Volume 1: Long
Conference of the North American Chapter of the As- Papers), pages 2737–2746, Melbourne, Australia. As-
sociation for Computational Linguistics: Human Lan- sociation for Computational Linguistics.
guage Technologies, pages 2080–2094, Online. Associ-
ation for Computational Linguistics. Siva Reddy, Danqi Chen, and Christopher D Manning.
2019. Coqa: A conversational question answering chal- game: Quantifying and extrapolating the capabilities
lenge. Transactions of the Association for Computa- of language models. arXiv preprint arXiv:2206.04615.
tional Linguistics, 7:249–266.
Katherine Stasaski and Marti A. Hearst. 2017. Multi-
Brian J. Reiser. 2004. Scaffolding Complex Learn- ple choice question generation utilizing an ontology. In
ing: The Mechanisms of Structuring and Problema- Proceedings of the 12th Workshop on Innovative Use of
tizing Student Work. Journal of the Learning Sci- NLP for Building Educational Applications, pages 303–
ences, 13(3):273–304. Publisher: Routledge _eprint: 312, Copenhagen, Denmark. Association for Computa-
https://doi.org/10.1207/s15327809jls1303_2. tional Linguistics.
Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason- Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bern-
ing about quantities in natural language. Transactions hard Schölkopf, and Mrinmaya Sachan. 2022. A causal
of the Association for Computational Linguistics, 3:1– framework to quantify the robustness of mathemati-
13. cal reasoning with language models. arXiv preprint
arXiv:2210.12023.
Devendra Singh Sachan, Lingfei Wu, Mrinmaya
Sachan, and William Hamilton. 2020. Stronger Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang,
transformers for neural multi-hop question generation. and Nigel Collier. 2021. Plan-then-generate: Con-
arXiv preprint arXiv:2010.11374. trolled data-to-text generation via planning. In Find-
ings of the Association for Computational Linguistics:
Mrinmaya Sachan, Kumar Dubey, and Eric Xing. 2017. EMNLP 2021, pages 895–909, Punta Cana, Dominican
From textbooks to knowledge: A case study in harvest- Republic. Association for Computational Linguistics.
ing axiomatic knowledge from textbooks to solve ge-
ometry problems. In Proceedings of the 2017 Confer- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ence on Empirical Methods in Natural Language Pro- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
cessing, pages 773–784. Kaiser, and Illia Polosukhin. 2017. Attention is all you
need. Advances in neural information processing sys-
Mrinmaya Sachan, Kumar Avinava Dubey, Tom M tems, 30.
Mitchell, Dan Roth, and Eric P Xing. 2018. Learning
pipelines with limited data and domain knowledge: A Ruonan Wang, Yuxi Qian, Fangxiang Feng, Xiaojie
study in parsing physics problems. Advances in Neural Wang, and Huixing Jiang. 2022. Co-VQA : Answer-
Information Processing Systems, 31. ing by interactive sub question sequence. In Findings
of the Association for Computational Linguistics: ACL
Mrinmaya Sachan and Eric Xing. 2017. Learning 2022, pages 2396–2408, Dublin, Ireland. Association
to solve geometry problems from natural language for Computational Linguistics.
demonstrations in textbooks. In Proceedings of the 6th
Joint Conference on Lexical and Computational Seman- Zichao Wang, Andrew S Lan, Weili Nie, Andrew E Wa-
tics (* SEM 2017), pages 251–261. ters, Phillip J Grimaldi, and Richard G Baraniuk. 2018.
Qg-net: a data-driven question generation model for ed-
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren ucational content. In Proceedings of the Fifth Annual
Etzioni, and Clint Malcolm. 2015. Solving geometry ACM Conference on Learning at Scale, pages 1–10.
problems: Combining text and diagram interpretation.
In Proceedings of the 2015 Conference on Empirical Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Methods in Natural Language Processing, pages 1466– Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022.
1476, Lisbon, Portugal. Association for Computational Chain of thought prompting elicits reasoning in large
Linguistics. language models. CoRR, abs/2201.11903.
Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Ronald J. Williams. 1992. Simple statistical gradient-
Jiang, Ming Zhang, and Qun Liu. 2021. Generate & following algorithms for connectionist reinforcement
rank: A multi-task framework for math word problems. learning. Mach. Learn., 8(3–4):229–256.
In Findings of the Association for Computational Lin-
guistics: EMNLP 2021, pages 2269–2279, Punta Cana, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Dominican Republic. Association for Computational Chaumond, Clement Delangue, Anthony Moi, Pierric
Linguistics. Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe
Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
Vered Shwartz, Peter West, Ronan Le Bras, Chan- Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
dra Bhagavatula, and Yejin Choi. 2020. Unsuper- Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
vised commonsense question answering with self-talk. Alexander Rush. 2020. Transformers: State-of-the-
In Proceedings of the 2020 Conference on Empirical art natural language processing. In Proceedings of
Methods in Natural Language Processing (EMNLP), the 2020 Conference on Empirical Methods in Natural
pages 4615–4629, Online. Association for Computa- Language Processing: System Demonstrations, pages
tional Linguistics. 38–45, Online. Association for Computational Linguis-
tics.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, David Wood, Jerome S Bruner, and Gail Ross. 1976.
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià The role of tutoring in problem solving. Child Psychol-
Garriga-Alonso, et al. 2022. Beyond the imitation ogy & Psychiatry & Allied Disciplines.
Terry Wood. 1994. Patterns of interaction and the cul-
ture of mathematics classrooms. In Cultural perspec-
tives on the mathematics classroom, pages 149–168.
Springer.
Zhipeng Xie and Shichao Sun. 2019. A goal-driven
tree-structured neural model for math word problems.
In IJCAI, pages 5299–5305.
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to-tree
learning for solving math word problems. Association
for Computational Linguistics.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint Figure 6: After submitting an incorrect solution on the
arXiv:1904.09675. first attempt in the treatment group, our model gen-
erated sub-questions are shown to the participants to
A Details of User Study guide them through the problem-solving process. The
control group only sees a prompt to try again.
We perform a user study using Amazon Mechanical
Turk. Participants which did not spend a minimum
time per question were excluded from the analysis. not equal to 1 as there are some duplicates gener-
Generated questions used in the questions group ated by the model and sometimes the split is not
are listed in the Table 9. perfect.

B.2 GPT-3 prompting


We used one shot prompting for GPT-3 meaning
we provide one example (Q,A) to the model and let
it predict the answer (A) for the next question (Q)
provided.

No sub-questions Problem: John has 10


hectares of a pineapple field. There are 100 pineap-
Figure 5: Interface for our user study (cf. Section 6). ples per hectare. John can harvest his pineapples
For each problem, the first screen contains the MWP every 3 months.
text, a calculator, and an input box to submit the an- Q: How many pineapples can John harvest within a
swer. year? A: John has 100 x 10= «100*10=1000»1000
pineapples on his field. John can harvest his Pineap-
B Experimental details ple 12 / 3 = «12/3=4»4 times per year. Therefore
John can harvest 1000 x 4 = «1000*4=4000»4000
B.1 Iterative pineapples per year. #### 4000
Except global strategy to generate questions given a
MWP, we experimented with iteratively generation Socratic sub-questions Problem: John has 10
on the sentence level. hectares of a pineapple field. There are 100 pineap-
ples per hectare. John can harvest his pineapples
This also explains the #Q for iterative case to be every 3 months.
Q: How many pineapples does John have? A: John
Planning BLEU BERT F1 #Q has 100 x 10= «100*10=1000»1000 pineapples on
None 49.39 0.763 0.390 his field.
Operators 55.25 0.779 0.752 Q: How many times can John harvest his pineap-
Equations 58.31 0.795 0.819 ples? A: John can harvest his Pineapple 12 / 3 =
«12/3=4»4 times per year.
Table 8: QG model performance compared on the gold Q: How many pineapples can John harvest within
set of ground truth test questions with different plan- a year? A: Therefore John can harvest 1000 x 4
ning strategies in an iterative setting. = «1000*4=4000»4000 pineapples per year. ####
4000

C Error analysis of generated questions


Table 9 studies some of the errors encountered by
us in our question generation strategy.
Id Text Generated questions
1 James decides to run 3 sprints 3 times a week. How many sprints does James run in a week?
He runs 60 meters each sprint. How many total How many meters does James run in a week?
meters does he run a week? (540)
2 Carla is downloading a 200 GB file. Normally Missing: time to download 40% of the file
she can download 2 GB/minute, but 40% of the How many minutes does it take to install up-
way through the download, Windows forces a dates?
restart to install updates, which takes 20 min- How many minutes does it take to restart the
utes. Then Carla has to restart the download download?
from the beginning. How long does it take to How many minutes does it take to download
download the file? (160) the file?
3 John buys twice as many red ties as blue ties. How many blue ties did John buy?
The red ties cost 50% more than blue ties. He How many red ties did John buy?
spent $200 on blue ties that cost $40 each. How How much did the red ties cost?
much did he spend on ties? (800) How much did John spend on ties?
4 Mr. Tan sold his house for $400 000. He paid How much did Mr. Tan pay in transfer fees and
the transfer fees that amount to 3% of the selling brokerage fees?
price and also paid a brokerage fee that is 5% How much did Mr. Tan pay in total?
of the selling price. If he also paid $250 000 for How much is Mr. Tan’s net proceeds from sell-
the remaining loan amount of the house, how ing the house?
much is Mr. Tan’s net proceeds from selling the
house? (118000)
5 John drives for 3 hours at a speed of 60 mph How far did John drive in the first 3 hours?
and then turns around because he realizes he How far did John drive in the remaining 2
forgot something very important at home. He hours?
tries to get home in 4 hours but spends the first How fast did John drive in the next half-hour?
2 hours in standstill traffic. He spends the next How fast did John drive in the remaining 2
half-hour driving at a speed of 30mph, before hours? Error: remaining 1 and half hour.
being able to drive the remaining time of the How far did John drive in those 4 hours?
4 hours going at 80 mph. How far is he from How far is John from home at the end of those
home at the end of those 4 hours? (45) 4 hours?
6 Charlie wants to sell beeswax candles. For ev- How many pounds of beeswax does Charlie
ery pound of beeswax, he can make 10 tapered need?
candles. One pound of beeswax and the wicks How much will each candle cost?
cost $10.00 in supplies. If he sells each candle How much will Charlie sell the candles for?
for $2.00 each, what is his net profit if he makes Missing: selling price for 20 candles.
and sells 20 candles? (20) What is Charlie’s net profit?
7 Shiela bought five cell phones for $150 each How much is the interest?
for a 3-month installment. A 2% interest will How much will Shiela pay in total?
be charged for each unit. How much in total Missing: question about the price of 5 cell
will Shiela pay each month for the period of 3 phones.
months? (255) How much will Shiela pay each month for 3
months?
8 Melanie is a door-to-door saleswoman. She Wrong order: 1) before visiting orange
sold a third of her vacuum cleaners at the green house?, 2) before visiting red house?, 3)be-
house, 2 more to the red house, and half of what fore visiting green house?
was left at the orange house. If Melanie has 5 How many vacuum cleaners did Melanie sell at
vacuum cleaners left, how many did she start the green house?
with? (18) How many vacuum cleaners did Melanie sell at
the red house?
How many vacuum cleaners did Melanie have
left after selling to the red house?
How many vacuum cleaners did Melanie start
with?

Table 9: User study problems and generated sub-questions

You might also like