Automatic Generation of Socratic Subquestions For Teaching Math Word Problems
Automatic Generation of Socratic Subquestions For Teaching Math Word Problems
Janet’s ducks lay 16 eggs per day. She eats three for breakfast every
Abstract morning and bakes muffins for her friends every day with four. She sells
the remainder at the farmers' market daily for $2 per fresh duck egg.
Socratic questioning is an educational method How much in dollars does she make every day at the farmers' market?
that allows students to discover answers to Goal-driven Focused
How many eggs does Janet sell?
arXiv:2211.12835v1 [cs.CL] 23 Nov 2022
MWP
Planner
+ Generator
Fluency
generated Rg
Seq2Seq Seq2Seq sub-questions Granularity
* *
(q1 ,..., qn )
P : {Context, C: Janet’s ducks lay 16 eggs per day. She
eats three for breakfast every morning and ….. , Rans
Question, Q: How much in dollars does she make every Answerability
day?}
q*1: How many eggs Janet are left with Janet to sell?
q1: How many eggs does Janet sell?, s1:<<16-3-4=9>>9 q*2: What is the total amount that Janet make?
q2: How much does Janet make? , s2:<<9*2=18>>18
A: 18 Reward Module
Figure 2: Our overall methodology: Two Socratic properties of focused (red dotted box) and goal-driven (green
dotted box) question generation are added to the question generation model with a combination of content planning
and reward based finetuning. Here, ⊕ represents the concatenation operation.
math problem solvers (Hosseini et al., 2014; Kush- ness of sub-questioning in solving MWPs.
man et al., 2014; Roy et al., 2015; Seo et al., 2015;
Sachan and Xing, 2017; Sachan et al., 2017, 2018, RQ2: What are the properties of a good ques-
inter alia). Recent work in this area uses special- tioning strategy? Once we established that sub-
ized architectures such as graph-based encoders questioning is helpful, we performed the same sub-
(Zhang et al., 2020) and tree-based decoders (Xie questioning experiment as RQ1 with NLP models
and Sun, 2019), and more recently, large pretrained but with the permuted ordering of sub-questions,
LMs which show state-of-the-art results (Cobbe change in granularity of sub-questions or changed
et al., 2021; Shen et al., 2021; Kojima et al., 2022; content (Table 2). We observed a decrease in the
Wei et al., 2022; Chowdhery et al., 2022). Appli- answering capabilities of the QA model for all the
cation of these approaches to the MWP datasets cases, establishing that the right sequence of dis-
like GSM8K (our data context) still holds consid- ciplined questions with relevant content is an es-
erable room for improvement, primarily in the rea- sential component of a good questioning strategy.
soning capabilities, and the majority of the latest Based on our results and inspired by prior work
approaches are still unable to solve a lot of the (Wood, 1994; Anghileri, 2006), we hypothesize the
problems and sensitive to even slightest modifica- most important components of a Socratic question-
tions in the problem (Patel et al., 2021; Stolfo et al., ing strategy as:
2022; Srivastava et al., 2022) . (A) Focused: An essential property of a good
questioning strategy is to ask questions that
3 Research Questions are directed towards the most critical domain-
We now discuss the usefulness of questions in solv- specific content. Irrelevant questions not only
ing a math word problem and then study the differ- make the process difficult but also force a di-
ent properties of a good questioning strategy. version in the focus and may increase the cog-
nitive load that a student experiences.
RQ1: Does sub-questioning help in understand-
(B) Goal-driven: Asking the right sequence of
ing a math word problem better? Question
relevant questions that can assist students in
prompts as a teaching strategy act as instructions
reaching the final goal (solving the main ques-
that guide the students throughout a problem-
tion in case of math word problems) is a fur-
solving process (Wood, 1994). Such question-
ther important part of good questioning.
ing, as a valid scaffolding strategy (Kim et al.,
2018), is valuable in supporting student thinking 4 Methodology
and is commonplace in high-quality math instruc-
tion (Boston and Candela, 2018). We explored We discuss our approach to modeling Socratic ques-
the sub-questioning strategy with our trained NLP tioning using large LMs. We begin by defining our
model and found that sub-questioning helps answer MWP dataset D as a collection of MWPs. Each
the MWPs more effectively (Table 1). Experiments MWP P in the dataset is accompanied by its so-
with NLP models and humans establish the useful- lution S and the numerical answer A. We do not
always assume the existence of problem solutions Equations: Equations contain important infor-
S and answers A as they can be automatically de- mation for an MWP as they involve not just the
rived from various MathQA models. Each MWP operators but also the quantities involved in the
P = (C, Q) consists of the story context C and the problem. Similar to operators, equations can play
question Q. The problem solution S consists of n an important guiding principle for asking more fo-
solution steps S = (s1 , ..., sn ). We define Socratic cused questions leading towards a correct solution.
questioning such that each solution step si can be
We use the same seq2seq architecture for the con-
mapped to a sub-question qi . We refer to q as a
tent planner module as our QG model, with the only
collection of all Socratic questions q1 , ..., qn for a
difference being that the output comprises a set of
given MWP P in our work. An example MWP is
equations s∗1 , .., s∗n or just the operators within the
present in Figure 2.
equations (instead of the sub-questions). The gener-
Our main module is the Question Generator (QG) ated operators/equations are appended to the input
module, which is a transformer (Vaswani et al., MWP P in the encoder for the QG module and the
2017) based encoder-decoder model. The QG modified focused learning objective LQGf is:
model takes the reference Math word problem P
n
and generates the Socratic questions q∗ as close to X
LQGf = − log PDec (qi |q:i−1 ; Enc([P ⊕plan]))
the true sub-questions q as possible. The learning
i=1
objective of the QG module is as: (2)
n
X Here, plan depicts the content planner module’s
LQG = − log PDec (qi |q:i−1 ; Enc(P )) (1) output and ⊕ depicts the concatenation operation.
i=1
4.2 Goal-driven questions
where Enc represents the encoder and Dec repre-
sents the decoder for the seq2seq QG model. Note An essential element of a good questioning strat-
that the sub-questions qi are decoded word by word egy is to ask goal-driven questions that are not only
in an auto-regressive setting. factually associated to the main problem but also
Next, we propose to inject the two Socratic ques- eventually help in answering the main question.
tioning properties in our QG model as follows: However, there can be any number of goal-driven
questions that can be asked for a MWP. Thus, our
4.1 Focused questions goal is to optimize the questioning strategy such
To learn a sequence of disciplined questions fo- that it is goal-driven, efficient, and rewarding at
cused on specific reasoning steps in the MWP, it each step, making sure that the final goal can be
is important to ask the right set of questions. We achieved with these individual questions. We in-
propose a content planner ψ that serves as a guid- duce these properties in our QG model using vari-
ing principle for the QG model to ask the right ous rewards that force the model to stay relevant to
focused questions. In principle, the content plan- the problem. These rewards are defined as:
ner module can extract any relevant information to Fluency: It is important that the generated sub-
assist the QG model, but for the task of math word questions are easily understandable and fluent in
problems, we restrict it to operators and equations.1 the meaning they represent. Although the QG train-
Our planning strategies are defined as: ing objective ensures the syntax and semantics of
Operators: Given an MWP P , the content plan- the questions generated, rewarding the system to
ner learns to identify the operations and operators stay fluent is necessary to remove repetitions and
(e.g., addition, multiplication, ..) involved in the illogical questions.
problem. Since the operators play a significant role Granularity: As solving a MWP usually in-
in a given MWP, the generated operators are used volves multiple reasoning steps, asking relevant
as the guiding principle to generate sub-questions questions at each step can help in solving the MWP.
by the QG model. Moreover, our questioning strategy is based on the
1
We also do not consider the step-by-step solutions S in fact that the questions are organised, structured and
our work, as creating step-by-step textual solution requires follow a sequence. With the granularity reward, the
a lot of time and effort from teachers and even the largest
language models fail to understand MWPs easily (Wei et al., model can learn to ask the right number of ques-
2022; Chowdhery et al., 2022). tions (compared to the number of reasoning steps
to solve MWP) in a specific sequence and refrain Variation GPT-2 GPT-3
from unstructured questions. P 5.45 (↓ 47%) 29 (↓ 38%)
P ⊕ {q} 10.46 47
Answerability: For every generated question, it
is important to evaluate if the generated questions Table 1: Comparison of Math QA accuracy (in %)
can be answered given context C and can help in with and without Socratic questions for GSM8K test
answering the overall MWP. We trained an external dataset. (↓) represents the drop in the accuracy when
QA model that can answer the MWPs taking help compared to the Socratic questions (P ⊕ {q}). ⊕
represents the concatenation operation. GPT-2 model
from the sub-questions and evaluated if the gen-
was trained with and without Socratic questions while
erated question can assist in answering the main GPT-3 model (Brown et al., 2020) was prompted us-
problem. The answerability reward is provided on ing one-shot example (more details in Appendix sub-
both a step-by-step basis (if the QA model can an- section B.2).
swer a sub-part of the main problem) and overall
(if using all sub-questions, whether the final answer 0
was correct or not). reward is F (A, A0 ) = #a 0 0
|q0 | , where #a and |q |
denote the number of correct answers to the gener-
During training, the QG model samples a set of ated sub-questions and total number of generated
sub-questions q0 , calculates various rewards based questions respectively.
on q0 . The parameters of the QG model are updated
using the REINFORCE algorithm (Williams, 1992) 4.3 Overall Loss Function
as: Finally, with the induced Socratic properties in the
QG model, the total loss is defined as a combination
LRL = − Eq0 ∼PDec [R(q, q0 , P )] of the focused learning loss LQGf and the loss of
n
X the rewards LRL , as:
= − R(q, q0 , P ) log PDec (qi |q:i−1 ; Enc(P ))
i=1 L = α LQGf + (1 − α) LRL (3)
The reward function [R(q, q0 , P)] measures the where α is a weighting factor.
individual rewards for fluency, granularity and an-
swerability and is calculated as: 5 Empirical Analysis
Fluency: Rf l = BLEU (q, q0 ) We now demonstrate the effectiveness of inducing
the defined questioning properties in large LMs.
where, BLEU(.,.) represents the BLEU score (Pap-
ineni et al., 2002). Dataset We study the properties of Socratic ques-
tioning on the GSM8K dataset2 (Cobbe et al., 2021)
Granularity: Rg = F (q, q0 ) that consists of 8.5K grade school math word prob-
0
where, F (q, q0 ) = 1 − ||q|−|q || 0
|q0 | , and |q| and |q |
lems. Each problem requires 2 to 8 reasoning steps
to solve, and solutions primarily involve a sequence
denote the number of questions in q and q0 respec-
of elementary calculations using basic arithmetic
tively.
operations (+ − ∗ /). The dataset is segmented
Answerability: Rans = F (A, A0 ) into 7.5K training problems and 1K test problems.
where, F (A, A0 ) = 1 if the final answer from the Models We used T5 (Raffel et al., 2020) as the
QA model is correct when it is given sub-questions backbone of both our QG and content planning
q0 alongside the MWP P, and 0 otherwise. A0 de- modules. For reward generating QA model, we
notes the answer from the QA model and A denotes used GPT-2 (Radford et al., 2019) for all RQ2 ex-
the true answer. periments because of resource constraints. How-
ever, a better QA model like GPT-3 (Brown et al.,
We also evaluated the step-by-step performance
2020) can be used in the future. Both QG and con-
of the QA model on the generated sub-questions
tent planning models are fine-tuned on the GSM8K
to check if the QA model can answer the gener-
training set using the Huggingface library (Wolf
ated sub-questions correctly. This allows us to
et al., 2020).
provide partial rewards at each step of the genera-
2
tion model. The modified sub-step answerability https://github.com/openai/grade-school-math
Variation QA Accuracy Planning BLEU BERT F1 #Q
Granularity None 51.53 0.783 0.428
P ⊕ {q}0 5.45 (↓ 45%) Operators 54.98 0.788 0.642
P ⊕ {q}0.25 3.94 (↓ 62%) + planner 45.05 0.779 0.346
P ⊕ {q}0.5 3.35 (↓ 67%) Equations 58.82 0.813 0.807
P ⊕ {q}0.75 9.70 (↓ 7%) + planner 52.48 0.787 0.485
P ⊕ {q}1 10.46
Table 3: Focused questions: QG model performance
Order
compared on the gold set of ground truth test questions
P ⊕ shuffle({q}) 8.94 (↓ 14%) with different planning strategies. Note that for the
Relevance planner rows, content planning information from the
P ⊕ <base-ques> 2.57 (↓ 75%) ground truth data is replaced with the output from the
content planner model.
Table 2: Comparison of Math QA accuracy (in %) for
different variations of experiments with ground truth
data. {q}k represents that only k% of the ground truth tion, especially when questions are relevant to the
sub-questions are used and selected randomly. For e.g.,
concept to be learnt, in the right sequence (order-
{q}0.25 represents only 25% of the sub-questions are
used. shuffle({q}) represents all sub-questions, but ing) with high granularity in their structure. We
with shuffled order. Finally, <base-ques> are the verify our hypothesis with GPT-2 model as a QA
sub-questions generated from a T5 large model with- solver after fine-tuning it on the training set of
out fine-tuning on our task. (↓) represents the drop in the GSM8K dataset and the GPT-3 model with
the accuracy when compared to the Socratic questions one-shot prompting. Table 1 demonstrates that the
(P ⊕ {q}). ⊕ represents the concatenation operation. Socratic questioning improves the performance of
GPT-2 small was used as QA model for all the above the QA solver as high as 45%. Then, we vary the
experiments.
properties of the test questions and examine the per-
formance of the QA Solver. Table 2 demonstrates
Implementation Details For the training of the that Socratic questions significantly improve the
models, we used Nvidia Tesla A100 with 40 GB model performance from 5.45% to 10.46%. Sub-
of GPU memory. We ran each experiment for 50 questioning even helps when only 75% Socratic
epochs, with a periodical evaluation of the valida- questions are retained (denoted as {q}0.75 in the
tion set. Training time without using rewards is 10 table) or when the order is shuffled (this might
minutes per epoch. With rewards, the training time be an artefact of the dataset containing a minor-
per epoch is increased to several hours. We used ity of examples with strict order). An interesting
the T5-large model without modifications for the observation is that when the number of Socratic
content planner and question generation module questions is reduced by half or lower (while pre-
and GPT-2 small as the QA solver. serving their order), the model gets confused and
performs worse than when it had no sub-questions.
Evaluation Metrics We report automatic evalua- Finally, we take the pre-trained T5 model and with-
tion using SacreBLEU (Post, 2018) which is based out fine-tuning it for our task, we take the outputs
on exact word overlap, BERT F1 score (Zhang and used it alongside the problem P as additional
et al., 2019) which is based on DeBERTa (He et al., information to solve the problem. The performance
2020) as the similarity model. We also report #Q, goes as low as 2.57%, indicating that non-relevant
the number of questions generated compared to the information degrades the performance.
number of ground truth reasoning steps (same as
Granularity reward), and Math QA Solver accu- 5.2 RQ2: What are the properties of a good
racy (same as the overall Answerability reward) to questioning strategy?
assess if our generated questions helped the QA We now present our analysis on inducing the two
model reach the final numerical solution. Socratic properties to LMs.
5.1 RQ1: Does sub-questioning help in Focused generation: Table 3 compares the two
understanding math concepts better? planning strategies. Results demonstrate that plan-
We hypothesize that high-quality sub-questioning ning strategies improve the baseline methods by
helps Math QA solvers to reach the correct solu- more than 3% on BLEU score with operators as
Strategy BLEU BERT F1 #Q Strategy QA Accuracy
Baseline 13.02 0.566 0.056 No planning 6.74
Fine-tuned 51.53 0.783 0.428 + rewards 6.75
+ fluency 52.21 0.784 0.440 Operators 7.50
+ # of questions 51.86 0.784 0.431 + rewards 7.52
+ QA 52.22 0.783 0.417 Equations 8.49
+ all weighted 53.39 0.781 0.431 + rewards 8.50
Eq planning 58.82 0.813 0.807
+ fluency 59.52 0.816 0.818 Table 5: Overall model variation and the influence on
Math QA solver accuracy (in %) with different plan-
+ # of questions 59.75 0.814 0.811
ning and reward strategies. Here, GPT-2 small is used
+ QA 59.37 0.813 0.799 as the QA model. Please note upper limit using ground
+ all weighted 59.62 0.815 0.815 truth questions is 10.46% as shown in Table 1.
Table 4: Goal-driven questions: QG model perfor-
mance compared to the gold set of ground truth ques- Planning BLEU BERT F1 #Q
tions with different rewards. None 51.53 0.783 0.428
Diff op, diff # 51.59 0.785 0.415
Diff op, same # 54.26 0.786 0.546
planning, and by more than 7% with equations. Operators (op) 54.98 0.788 0.642
Similar to the BLEU score, we achieve better per-
formance on BERT F1 scores too. Finally, the Table 6: Manipulating the planning inputs influences
number of correct question count improves with the quality of generated questions and overall QG
model performance. same # has the same number of
planning and doubles compared to the no-planning operators as number of reasoning steps but the types
variant. However, results show that in all the vari- (+-/*) are shuffled, diff # has both number and type of
ants the number of generated sub-questions is less operator shuffled.
than the number of reasoning steps. This could be
improved further by oversampling during the beam
search (beam search settings are the same for all QG quality, have a negligible effect on QA perfor-
variants in this experiment). The results degrade mance. This is mainly because slight improvement
when the ground truth content (both equations and in sub-questions quality does not necessarily help
planning) is replaced by our content planner mod- in reaching the final goal.
ule. This is expected as the errors in the content
planning module are cascaded when generating 5.3 Human quality evaluation
sub-questions. However, with more powerful mod- Next, we perform a human evaluation of the ques-
els, errors in the content planner can be reduced, tions generated for 100 randomly selected test
leading to improvement in all the metrics. See the MWPs to assess the quality of our model gener-
Appendix for experiments with the iterative split- ation (our best model) compared to the baseline
ting of MWP into multiple parts for generation. (with no planning or reward-based strategies). For
this analysis, we divided the questions among 4
Goal-driven generation: Table 4 summarizes annotators with an overlap of 40% of the questions
the results for the rewards as a strategy to incen- among them3 to evaluate the generated question
tivize the model to generate goal-driven and reward- quality on the following factors. A 5-point Likert
ing questions. We can observe the gains associated scale ranging from 1 (poor) to 5 (very good) was
with each reward for both the baseline model and used for each dimensions of quality assessment:
the best-performing model from Table 3 (equation- repetition - whether questions are repeated, factu-
based content planning model in our case), suggest- ality - whether all questions can be solved by the
ing the importance of rewards. information given in the problem, logical relevance
- if the question is logically related to the MWP,
QA performance We study the impact of the right sequence - correct sequence of questions lead-
QG model considering both Socratic properties as ing to the final answer, granularity - questions are
shown in Table 5. Sub-questions with operators granular enough to solve the problem but are still
and equations as planning improves the QA perfor-
mance by 1 − 2%. Rewards, although improves the 3
Overlap allows us to compute inter-annotator agreement.
relevant and no retrieval or basic common sense 6 A preliminary user study with learners
questions are asked, completeness - questions are
complete with all steps covered to reach to the final Finally, we designed a preliminary user study to
answer, and fluency - grammatical correctness and evaluate whether our generated questions, when
fluent in the language. presented as further problem-solving exercises (as
typically used in educational settings) can help
Repetitive Model Comparison learners on the way to solving the overall prob-
Baseline lem. Given our research question, we hypothesized
Factuality Our Model
that guidance with questions can increase the over-
Quality Evaluation
Logical Rel
all problem-solving success rate for users in the
Right Seq questions (treatment) group compared to the no-
Granular questions control group. Our study uses Socratic
Completeness questions as the main pedagogical intervention. We
Fluency focus on participants who cannot solve a problem
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 on the first attempt to clearly distinguish the impact
Likert Scale of automated sub-questioning. The key metric we
measure is the success rate, which is defined as the
Figure 3: Comparison of baseline versus our model
generated sub-questions on several metrics from our percentage of correctly solved problems.
human evaluations (showing mean and standard devi-
ation).
For our study, we built a simple user interface
which allowed participants to solve math word
problems (see Figure 5 and Figure 6 in the ap-
Figure 3 presents our findings, clearly demonstrat-
pendix). The interface contained a calculator which
ing that our planning and reward strategies lead to
the users could use if needed. The study comprises
superior quality questions on the MWP task. Al-
5 pre-test problems and 8 problem-solving exer-
though both baselines and our model-generated
cises. These problems were randomly selected
text achieve almost full score (5) on the fluency pa-
from the GSM8K test set. Our user study with
rameter, our model-generated questions are more
this interface was then deployed on Mechanical
aligned to the MWP, thus leading to a higher score
Turk and participants were hired using the platform
on all the other parameters. We also present a ran-
and were paid 10-12$ per hour. We selected par-
domly selected sample of generated questions in
ticipants with moderate levels of prior knowledge
the Appendix.
using the pre-test scores as the selection criteria,
5.4 Ablation study: Manipulating question and only those scoring in the range of 40-80% were
properties selected for the study. This way, we excluded both
low-prior knowledge participants and experts in
Both planning strategies help generate better ques-
our study to ensure there was a learning possibility.
tions. To gain a deeper understanding of how con-
tent planner ψ affects generated questions, we fur- We randomly split the participants into two groups
ther analyze the influence of operators as a plan- - no-questions group (N = 19) with no question
ning strategy. Here, we randomize operators and prompts, and questions group (N = 17) with ques-
their sequence and measure change in performance. tions generated from our model. Both groups used
Table 6 shows that the correct sequence of opera- the same interface for solving math word problems
tors with the correct number of operators guides and had the opportunity to resolve their answers
the generation process better than randomized ver- after the first incorrect submission. The only differ-
sions. A gap between the correct count of operators ence was that after incorrectly solving a problem on
and random count indicates that having a correct the first submission, participants in the questions
number of operators (of any type) is more valuable group saw sub-questions, while those in the no-
than the exact type of operators. We observed that questions group were only prompted to try again.
the number of operators guides the model in terms The sub-questions were generated using the best-
of the number of questions that need to be asked, performing model with planning and rewards.
while type changes the overall quality. Needless
to say, for the same number of operators, quality The results of the user study are shown in Ta-
matters. ble 7. The first attempt success rate is 58.4%
for the control group and 66.0% for the treatment Group 1st success 2nd success
group, which might be the result of a slightly M SD M SD
skewed prior knowledge distribution of 0.68 and No-questions 58.4 23.0 35.8 32.5
0.65 for treatment and control groups respectively. Questions 66.0 21.1 31.0 27.9
Even though participants in the treatment group
(M = 124.9, SD = 92.1) spend significantly Table 7: User study success rates (in %) before after in-
troduction of sub-questions. 1st success is the propor-
more time (p < 0.01) solving problems during
tion of exercises solved correctly on the first attempt
the second attempt relative to the control group and 2nd success is the proportion of correctly solved
(M = 41.5, SD = 31.4), we did not find any sta- exercises on the second attempt (out of all incorrectly
tistically significant difference between the groups solved on the first attempt).
in the second submission success rate (p = 0.659,
BF01 = 2.755, Cohen’s d = 0.157), indicating
weak odds favouring the null hypothesis and rather
a small effect size.
Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa
Regina Barzilay. 2014. Learning to automatically solve Patwary, and Bryan Catanzaro. 2020. Training ques-
algebra word problems. In Proceedings of the 52nd tion answering models from synthetic data. In Proceed-
Annual Meeting of the Association for Computational ings of the 2020 Conference on Empirical Methods in
Linguistics (Volume 1: Long Papers), pages 271–281. Natural Language Processing (EMNLP), pages 5811–
5826, Online. Association for Computational Linguis-
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben tics.
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. 2018. Generating wikipedia by summarizing Chris Quintana, Brian J. Reiser, Elizabeth A. Davis,
long sequences. arXiv preprint arXiv:1801.10198. Joseph Krajcik, Eric Fretz, Ravit Golan Duncan, Eleni
Kyza, Daniel Edelson, and Elliot Soloway. 2004. A
Ðord̄e Miladinović, Kumar Shridhar, Kushal Jain, scaffolding design framework for software to support
Max B Paulus, Joachim M Buhmann, and Carl Allen. science inquiry. Journal of the Learning Sciences,
2022. Learning to drop out: An adversarial ap- 13(3):337–386.
proach to training sequence vaes. arXiv preprint
arXiv:2209.12590. Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
Liangming Pan, Wenhu Chen, Wenhan Xiong, Min- models are unsupervised multitask learners.
Yen Kan, and William Yang Wang. 2021. Unsuper-
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
vised multi-hop question answering by question gen-
ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
eration. In Proceedings of the 2021 Conference of the
Wei Li, and Peter J. Liu. 2020. Exploring the limits of
North American Chapter of the Association for Compu-
transfer learning with a unified text-to-text transformer.
tational Linguistics: Human Language Technologies,
Journal of Machine Learning Research, 21(140):1–67.
pages 5866–5880, Online. Association for Computa-
tional Linguistics. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- machine comprehension of text. In Proceedings of
Jing Zhu. 2002. Bleu: a method for automatic evalua- the 2016 Conference on Empirical Methods in Natu-
tion of machine translation. In Proceedings of the 40th ral Language Processing, pages 2383–2392, Austin,
Annual Meeting of the Association for Computational Texas. Association for Computational Linguistics.
Linguistics, pages 311–318, Philadelphia, Pennsylva-
nia, USA. Association for Computational Linguistics. Sudha Rao and Hal Daumé III. 2018. Learning to ask
good questions: Ranking clarification questions using
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. neural expected value of perfect information. In Pro-
2021. Are NLP models really able to solve simple ceedings of the 56th Annual Meeting of the Associa-
math word problems? In Proceedings of the 2021 tion for Computational Linguistics (Volume 1: Long
Conference of the North American Chapter of the As- Papers), pages 2737–2746, Melbourne, Australia. As-
sociation for Computational Linguistics: Human Lan- sociation for Computational Linguistics.
guage Technologies, pages 2080–2094, Online. Associ-
ation for Computational Linguistics. Siva Reddy, Danqi Chen, and Christopher D Manning.
2019. Coqa: A conversational question answering chal- game: Quantifying and extrapolating the capabilities
lenge. Transactions of the Association for Computa- of language models. arXiv preprint arXiv:2206.04615.
tional Linguistics, 7:249–266.
Katherine Stasaski and Marti A. Hearst. 2017. Multi-
Brian J. Reiser. 2004. Scaffolding Complex Learn- ple choice question generation utilizing an ontology. In
ing: The Mechanisms of Structuring and Problema- Proceedings of the 12th Workshop on Innovative Use of
tizing Student Work. Journal of the Learning Sci- NLP for Building Educational Applications, pages 303–
ences, 13(3):273–304. Publisher: Routledge _eprint: 312, Copenhagen, Denmark. Association for Computa-
https://doi.org/10.1207/s15327809jls1303_2. tional Linguistics.
Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason- Alessandro Stolfo, Zhijing Jin, Kumar Shridhar, Bern-
ing about quantities in natural language. Transactions hard Schölkopf, and Mrinmaya Sachan. 2022. A causal
of the Association for Computational Linguistics, 3:1– framework to quantify the robustness of mathemati-
13. cal reasoning with language models. arXiv preprint
arXiv:2210.12023.
Devendra Singh Sachan, Lingfei Wu, Mrinmaya
Sachan, and William Hamilton. 2020. Stronger Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang,
transformers for neural multi-hop question generation. and Nigel Collier. 2021. Plan-then-generate: Con-
arXiv preprint arXiv:2010.11374. trolled data-to-text generation via planning. In Find-
ings of the Association for Computational Linguistics:
Mrinmaya Sachan, Kumar Dubey, and Eric Xing. 2017. EMNLP 2021, pages 895–909, Punta Cana, Dominican
From textbooks to knowledge: A case study in harvest- Republic. Association for Computational Linguistics.
ing axiomatic knowledge from textbooks to solve ge-
ometry problems. In Proceedings of the 2017 Confer- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ence on Empirical Methods in Natural Language Pro- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
cessing, pages 773–784. Kaiser, and Illia Polosukhin. 2017. Attention is all you
need. Advances in neural information processing sys-
Mrinmaya Sachan, Kumar Avinava Dubey, Tom M tems, 30.
Mitchell, Dan Roth, and Eric P Xing. 2018. Learning
pipelines with limited data and domain knowledge: A Ruonan Wang, Yuxi Qian, Fangxiang Feng, Xiaojie
study in parsing physics problems. Advances in Neural Wang, and Huixing Jiang. 2022. Co-VQA : Answer-
Information Processing Systems, 31. ing by interactive sub question sequence. In Findings
of the Association for Computational Linguistics: ACL
Mrinmaya Sachan and Eric Xing. 2017. Learning 2022, pages 2396–2408, Dublin, Ireland. Association
to solve geometry problems from natural language for Computational Linguistics.
demonstrations in textbooks. In Proceedings of the 6th
Joint Conference on Lexical and Computational Seman- Zichao Wang, Andrew S Lan, Weili Nie, Andrew E Wa-
tics (* SEM 2017), pages 251–261. ters, Phillip J Grimaldi, and Richard G Baraniuk. 2018.
Qg-net: a data-driven question generation model for ed-
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren ucational content. In Proceedings of the Fifth Annual
Etzioni, and Clint Malcolm. 2015. Solving geometry ACM Conference on Learning at Scale, pages 1–10.
problems: Combining text and diagram interpretation.
In Proceedings of the 2015 Conference on Empirical Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Methods in Natural Language Processing, pages 1466– Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022.
1476, Lisbon, Portugal. Association for Computational Chain of thought prompting elicits reasoning in large
Linguistics. language models. CoRR, abs/2201.11903.
Jianhao Shen, Yichun Yin, Lin Li, Lifeng Shang, Xin Ronald J. Williams. 1992. Simple statistical gradient-
Jiang, Ming Zhang, and Qun Liu. 2021. Generate & following algorithms for connectionist reinforcement
rank: A multi-task framework for math word problems. learning. Mach. Learn., 8(3–4):229–256.
In Findings of the Association for Computational Lin-
guistics: EMNLP 2021, pages 2269–2279, Punta Cana, Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Dominican Republic. Association for Computational Chaumond, Clement Delangue, Anthony Moi, Pierric
Linguistics. Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe
Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
Vered Shwartz, Peter West, Ronan Le Bras, Chan- Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
dra Bhagavatula, and Yejin Choi. 2020. Unsuper- Sylvain Gugger, Mariama Drame, Quentin Lhoest, and
vised commonsense question answering with self-talk. Alexander Rush. 2020. Transformers: State-of-the-
In Proceedings of the 2020 Conference on Empirical art natural language processing. In Proceedings of
Methods in Natural Language Processing (EMNLP), the 2020 Conference on Empirical Methods in Natural
pages 4615–4629, Online. Association for Computa- Language Processing: System Demonstrations, pages
tional Linguistics. 38–45, Online. Association for Computational Linguis-
tics.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, David Wood, Jerome S Bruner, and Gail Ross. 1976.
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià The role of tutoring in problem solving. Child Psychol-
Garriga-Alonso, et al. 2022. Beyond the imitation ogy & Psychiatry & Allied Disciplines.
Terry Wood. 1994. Patterns of interaction and the cul-
ture of mathematics classrooms. In Cultural perspec-
tives on the mathematics classroom, pages 149–168.
Springer.
Zhipeng Xie and Shichao Sun. 2019. A goal-driven
tree-structured neural model for math word problems.
In IJCAI, pages 5299–5305.
Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to-tree
learning for solving math word problems. Association
for Computational Linguistics.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint Figure 6: After submitting an incorrect solution on the
arXiv:1904.09675. first attempt in the treatment group, our model gen-
erated sub-questions are shown to the participants to
A Details of User Study guide them through the problem-solving process. The
control group only sees a prompt to try again.
We perform a user study using Amazon Mechanical
Turk. Participants which did not spend a minimum
time per question were excluded from the analysis. not equal to 1 as there are some duplicates gener-
Generated questions used in the questions group ated by the model and sometimes the split is not
are listed in the Table 9. perfect.