NeurIPS 2022 Solving Quantitative Reasoning Problems With Language Models Paper Conference
NeurIPS 2022 Solving Quantitative Reasoning Problems With Language Models Paper Conference
Language Models
Google Research
Abstract
Language models have achieved remarkable performance on a wide range of tasks
that require natural language understanding. Nevertheless, state-of-the-art models
have generally struggled with tasks that require quantitative reasoning, such as
solving mathematics, science, and engineering problems at the college level. To
help close this gap, we introduce Minerva, a large language model pretrained on
general natural language data and further trained on technical content. The model
achieves state-of-the-art performance on technical benchmarks without the use of
external tools. We also evaluate our model on over two hundred undergraduate-
level problems in physics, biology, chemistry, economics, and other sciences that
require quantitative reasoning, and find that the model can correctly answer nearly
a third of them.
1 Introduction
Artificial neural networks have seen remarkable success in a variety of domains including computer
vision, speech recognition, audio and image generation, translation, game playing, and robotics. In
particular, large language models have achieved excellent performance across a variety of natural
language tasks including common-sense reasoning, question answering, and summarization [1, 2, 3,
4, 5]. However, these models have struggled with tasks that require quantitative reasoning, such as
solving mathematics, science, and engineering problems [6, 7, 8].
Quantitative reasoning problems are an interesting domain of application for language models because
they test the capability of models on several fronts. They require the solver to correctly parse a
natural language input, potentially recall world knowledge that pertains to the problem, and apply
an algorithm or series of computations to the information provided in order to arrive at a correct
solution. They also require that the solver is able to correctly parse and generate precise sequences
of mathematical tokens, as well as apply a computational procedure to tokens via symbolic and
numerical manipulation. Finally, such problems are a proving ground for research toward robust
quantitative reasoning solvers that are useful in supporting the work of humans in scientific and
technical fields.
Previous research has shown that large language models achieve impressive performance on math and
programming questions after training on domain specific datasets [9, 10, 11]. In this work, we apply
this approach to quantitative reasoning problems, in which the model is required to provide a complete
and self-contained solution, without relying on external tools. Such tasks include mathematics word
problems, competition mathematics evaluations, and many problems in science and engineering.
⇤
Equal leadership and advising contribution
†
Equal contribution
Model output: The line is parallel to y = Question: Prove that for every a, b, b 6= a,
4x+6, which means that it has the same slope 2
holds a +b
2
> a+b
2
.
2 2
as y = 4x + 6. Since the line passes through
(5, 10), we can use the point-slope form of a
line to find the equation of the line: Model output: We have
✓ ◆2
y 10 = 4(x 5) a2 + b2 a+b
>
y 10 = 4x 20 2 2
y = 4x 10 a2 + b2 a2 + 2ab + b2
>
2 4
Now that we have the equation of the line, we
can find the y-coordinate of the point where 2a2 + 2b2 > a2 + 2ab + b2
the line crosses the y-axis by substituting a2 2ab + b2 > 0
x = 0 into the equation:
(a b)2 > 0,
y =4·0 10 = 10 . which is true for all a and b.
Figure 1: Example questions with solutions from Minerva 62B. These solutions were selected by majority vote
over 256 samples, or maj1@256. All questions were prompted with the same prompt (a 4-shot prompt from
MATH; see appendix). Left: Question from the MATH dataset (Algebra 4). The model derives the solution by
writing down a line equation, simplifying it, substituting x = 0, and solving for y. The model outputs valid
LATEX along with natural language explanations. This detailed approach is different from the one used in the
ground truth target. Top right: Example question from the Joint Entrance Examination Main Math 2020 exam
taken each year by almost 2M Indian high-school students intended to study engineering and similar fields.
Bottom right: Example question from the National Math Exam in Poland (May 2022). The exam is taken by
approximately 270K high-school students every year.
We present Minerva, a language model that achieves strong performance on many quantitative
reasoning tasks. Our model is able to process scientific and mathematical questions formulated in
natural language, and generate step-by-step solutions using correct LATEX notation. Figures 1 and 2
show example responses from Minerva to mathematics and physics questions.
Minerva is based on PaLM general language models [5] that are further trained on a high-quality
dataset containing scientific and mathematical data. In particular, we start with 8B, 62B, and 540B
parameter pretrained models, and continue training them on our technical content dataset. We achieve
state-of-the-art performance on MATH [6], GSM8k [7], and a STEM subset of the MMLU [12]
dataset, all of which include math and science questions formulated in natural language. Notably, our
models achieve robust performance in a few-shot setting with no explicit training on these datasets.
The main novelty of this paper is the use of a large training dataset that juxtaposes natural language
with correctly-formatted mathematical content in LATEX format. The data is collected from the arXiv
preprint server and from web pages that we carefully process to minimise the loss of mathematical
content. In addition to achieving state-of-the-art results, our work sets a new baseline for the
performance achievable on quantitative reasoning by increasing data quality and model size.
To further increase the coverage of quantitative reasoning benchmarks, we build a dataset of over 200
undergraduate-level questions in science and mathematics from MIT’s OpenCourseWare (OCW).
2
Question: A particle moves so that it is at 3 sin 4t , 3 cos t
4
at time t. Find the speed of the particle,
measured in unit of distance per unit of time.
Model output: The speed is the magnitude of the velocity vector. Therefore, we must first find the
velocity vector, which is the derivative of the position vector. We have
✓ ◆ ✓ ◆ ✓ ◆
d t t t t 1 3 t 3 t
v= 3 sin , 3 cos = 3 cos , sin = cos , sin .
dt 4 4 4 4 4 4 4 4 4
Therefore, the magnitude of the velocity vector is
s✓ ◆2 ✓ ◆2 r
3 t 3 t 9 t 9 t
|v| = cos + sin = cos2 + sin2
4 4 4 4 16 4 16 4
r
3 t t 3
= cos2 + sin2 = .
4 4 4 4
3
Therefore, the speed of the particle is .
4
Figure 2: Example question from the MATH dataset (Precalculus 4), along with the ground truth target and
the Minerva solution. The model correctly takes the derivative of a position vector to compute the velocity. It
then computes the norm of the resulting vector, and uses a trigonometric identity to obtain a final numerical
answer. Minerva takes a completely different approach from the ground truth solution. The model output was
not modified, except for a few newlines that were removed to save space.
This provides a measure of our model’s quantitative reasoning abilities in a chain-of-thought context
beyond a pure mathematical setting.
Code generation. Applying code generating models to mathematical problems has been an active
area of exploration. PaLM [5] showed that a large language model with code in its training dataset
can achieve good performance on a code version of GSM8k. Furthermore, the Codex model [9] can
generate code solutions to MATH problems [11]. These solutions often rely on external libraries to
perform mathematical operations such as solving equations or taking limits. This is a complementary
approach to ours, in which we directly probe the model’s ability to arrive at an answer by relying
only on its own reasoning capability.
3
Formal mathematics. Mathematics developed as a discipline based in natural language, but its
axiomatic fundamentals make it possible to simulate mathematical thinking. This can be achieved
using specialized programming languages that facilitate the simulation of logical and mathematical
thinking using a computer, such as Coq [20], Isabelle [21], HOL4 [22], Lean [23], Metamath [24]
and Mizar [25]. Work on automation of proof assistants and automated theorem provers such as
E [26], leanCoP [27], and Vampire [28] has substantially benefited from integration with machine
learning methods [29, 30, 31, 32, 33].
Language models applied to formal and synthetic mathematical problems. Previous work
trained language models to predict mathematical expressions [34, 31, 32, 35, 36, 37, 38, 39]. In turn,
such a predictive model can be used to guide a proof search, as done by [32]. Large language models
excel in modelling natural language, though in the case of formal languages, models that facilitate
retaining information about the graph structure of a given mathematical formula, such as GNNs, are
still very competitive.
Modelling mathematics as a discipline of natural language. New benchmark datasets [6, 40]
cover more advanced mathematical topics. For specific subtasks that pertain to understanding
mathematics in natural language, such as searching for related equations, it is possible to use graph
convolutional networks [41]. However for directly solving such tasks from natural language, language
models are facing limited competition from other classes of models.
Table 1: Proportion of data, and number of tokens, from each source in the technical training dataset. The
General Natural Language dataset is a subset of the dataset used to pretrain the model.
4
Table 2: Model architecture and continued training hyperparameters. Model training was resumed from the
pretrained PaLM models, and the number of steps quoted refers only to continued training on our technical
dataset.
Figure 3: Performance on MATH and MMLU-STEM by subtopic. Minerva achieves state-of-the-art results
on both datasets. maj1@k denotes evaluations where k samples were generated for each problem and only the
majority vote (most common answer) was selected [19]. For MATH, k = 256 for Minerva 8B and 62B, and
k = 64 for 540B. For MMLU-STEM, k = 16. davinci-002 is the latest publicly available model from OpenAI.
multiple times per problem we use nucleus sampling [42] with temperature T = 0.6, p = 0.95. For
generative tasks, the model produces a chain-of-thought answer and demarcates a final answer. We
evaluate a solution as correct if the final answer matches the ground truth solution, independent of the
quality of the chain-of-thought preceding it. To evaluate correctness, we parse the final answers and
compare them using the SymPy library [43]. p Thispis done in order to correctly identify answers that
are mathematically equivalent such as 1/ 3 and 3/3. See Appendix D.1 for further details.
The existing datasets on which we focus are:
• MATH: a dataset of 12K middle school and high school mathematics problems [6]. Problem
statements are written in LATEX. We prompt the model with a fixed 4-shot prompt (listed
in Appendix D.2). This prompt includes four random examples from the training dataset
whose ground truth targets are not too long.
• GSM8k: middle school math word problems [7]. Models are evaluated using the chain-of-
thought prompt from Wei et al. [16]. Previous models evaluated on GSM8k made use of an
external calculator. In this work, our model does not have access to any external tools.
• MMLU-STEM: subset of the MMLU dataset [12] focused on science, technology, engi-
neering, and mathematics (STEM). For the original version, we use the 5-shot prompt from
the development set for each task. We also consider chain-of-thought prompting for this
task, where we prompt the model with examples that include step-by-step solutions. We
use a multiple-choice version of the MATH prompt for topics that involve mathematical
reasoning, and add step-by-step solutions to the standard 5-shot prompts for the rest of the
topics. See Appendix G for more details.
5
Table 3: Model performance on several quantitative reasoning datasets. For majority voting we use
k = 256 (64 for 540B) samples for MATH, k = 64 for OCWCourses , k = 100 (40 for 540B) for GSM8k
and k = 16 for MMLU-STEM. The PaLM GSM8k results do not use a calculator and were reported in [5].
We evaluated datasets that did not have published results on recent models on OpenAI davinci-002. Despite
MMLU-STEM being a multiple choice task, we can apply majority vote by prompting the model to generate a
rationale prior to the final answer, sampling multiple times, and then using majority vote on the final answers.
Superscripts denote results that are quoted from previous work: a GPT-2 [6], b PaLM 540B maj1@40 [19], and
c
Chinchilla [44].
We find that we can considerably outperform greedy decoding by sampling k > 1 solutions (with
a non-zero temperature) and selecting one using majority voting [19]. This consists of grouping
predictions with respect to their final answer and selecting the most common answer. We denote this
as maj1@k, following [17]. A variation of this algorithm, denoted majn@k, involves selecting the n
most common answers. Intuitively, the reason majority voting improves performance is that while
there are many ways to answer a question incorrectly, there are typically very few ways to answer
correctly. In other words, the truth is a Schelling point.
Contrast majority voting with pass@k, where a task is considered solved if any single sample solves
it out of k samples. See Section 4.2 for more details on pass@k performance. In Appendix E.1,
we report on how performance depends on k for different metrics. We find that while pass@k
continues to improve as k is increased, majority voting performance saturates faster: 97% of the large
k accuracy is achieved at k = 64 for MATH and k = 16 for GSM8k. This is likely because majority
voting selects the most common answer in the modeled distribution, and the error of this estimate
decreases with increasing k. This is in contrast to pass@k where the performance improvement
comes from the tail of the distribution, which can keep improving as k is increased.
Log-likelihood is another metric that can be used to rerank samples. We found that majority voting
performs significantly better than log-likelihood reranking (see Appendix E.2).
6
3 Results
Table 3 summarizes the results for Minerva models and other models, on the evaluation datasets
described in Section 2.3. Figure 3 presents a breakdown of the MATH dataset results by subtopic.
For MMLU evaluations, unless otherwise noted, performance is measured by using the standard
5-shot prompt per topic and picking the answer with the highest score. When evaluating MMLU with
majority voting, we sample k = 16 model answers using a chain-of-thought prompt.
We present model output samples in Figures 1 and 2, and additional output samples are listed in the
Appendix. In addition, we evaluated Minerva 62B on the National Math Exam in Poland and found
that it achieves a score of 57%, which happened to be the national average in 2021 [45, p. 23]. The
540B model achieves 65%.
We include results on the latest publicly available language model from OpenAI, davinci-002,
evaluated using the OpenAI API with temperature set to the official recommendation (T = 0.2). The
combination of training data, scale and inference techniques yields state of the art results on all the
technical tasks that we considered. For all tasks (with the exception of GSM8k), the improvement
with respect to previous results is considerable.
While our main focus is on few shot evaluation, we also tried to finetune Minerva on MATH; this did
not yield any improvement; interestingly, however, finetuning PaLM on MATH did give a significant
improvement. Further details can be found in Appendix E.3.
4 Performance Analysis
4.1 Model Mistakes
To better understand the types of mistakes our models make, we compare the performance of
Minerva 8B and Minerva 62B on 216 problems with high confidence majority decisions of both
models. Specifically, we selected examples where the top answer received at least 15% of votes, and
that either Minerva 8B was correct and Minerva 62B was incorrect (15 samples), or vice versa (201
samples). The categories and examples for each category are described in Appendix J.2.
As shown in Table 4, the prevailing errors of the 8B model were related to incorrect reasoning or
calculations. Many of the calculation errors were relatively benign arithmetic mistakes. Solutions
that were too short were relatively rare (in these cases, the model immediately produces an incorrect
answer without any intermediate reasoning steps). Finally, in a few cases, the model hallucinates an
equation or mathematical fact that is not real.
In the samples where the 62B model was incorrect, the dominating failure modes were again incorrect
reasoning and incorrect calculations. In summary, we find that the 62B Minerva model retains most
of the skills of the 8B model and improves upon both reasoning and calculation robustness.
Table 4: Failure modes of the 8B Minerva model, out of 201 samples which the 62B model solved correctly
and the 8B model did not.
7
model’s chain of reasoning. This leaves open the possibility of false positives: samples which have
the correct final answer, but for which the reasoning is incomplete or incorrect.
We selected 100 random questions from MATH (20 per difficulty level), along with answers sampled
at zero temperature from the 62B model. We then manually inspected the answers to determine
the false positive rate, which is the ratio between number of false positive examples and number of
examples for which the final answer is correct; for difficulty levels 1, 2, 3, 4, and 5, the false positive
rates are < 5%, 10%, < 5%, 15%, and 30%, respectively. Averaging these per-level false positive
rates (weighted by true positive rates), the overall false positive rate is estimated to be 8%. Although
this overall rate is low, we find that it does increase with difficulty level.
Our focus on pass@1 and majority voting as the primary evaluation metrics is due in part to the fact
that they are less susceptible to false positives than pass@k [17]. While the pass@256 accuracy
is 84.5% for the 62B model, false positives account for part of it. We inspected the samples that
failed in majority voting but passed on pass@k due to a single correct answer, and estimate the
false positive rate for pass@256 to be 30% among samples selected in this way. After removing
false positives, we estimate that the pass@256 accuracy at greater than 68%; see Appendix J.3 for
details.
5 Memorization
A central question in interpreting Minerva’s solutions is whether performance reflects genuine analytic
capability or instead rote memorization. This is especially relevant as there has been much prior work
indicating that language models often memorize some fraction of their training data [46, 47, 48].
When examining model solutions, we find that memorization of intermediate facts, such as numerical
values of square roots or trigonometric identities, are crucial elements of model solutions. We would
like to investigate a strong form of memorization, where model performance is a result of memorizing
the explicit problems and solutions in our evaluation set, but also a weaker form, where the model
has memorized alternate answers to the same questions.
In order to evaluate the degree to which our models solve problems by recalling information memo-
rized from training data, we conduct three analyses on the MATH dataset. First we directly search
for problems and solutions in our training corpus. Next, we generate modified versions of problems
and evaluate our models’ robustness to these changes. Finally, we measure the degree of overlap
between the ground truth solutions and solutions generated by our model and measure the effect of
this similarity on model performance. Overall, we find little evidence that the model’s performance
can be attributed to memorization.
8
Figure 4: Results indicating lack of memorization on MATH. Left, Center: Accuracy of original questions
from the MATH dataset and their modified versions. Each point represents a question. The x axis is accuracy on
the original question, and the y axis is accuracy on the modified one. Right: Majority vote accuracy, computed
only on samples with BLEU score to the ground truth solution less than or equal to the x-axis value.
9
the correctness of its outputs. If these issues could be solved, we expect the impacts of this model
to be broadly positive. A direct application could be an accessible and affordable math tutor which
could help improve educational inequalities.
7 Acknowledgments
We thank David Andre, Jacob Austin, Maarten Bosma, Aakanksha Chowdhery, Sergey Ioffe, Colin
Raffel, Charles Sutton, Christian Szegedy, Stanislas Polu, and Harrison Edwards for helpful discus-
sions.
References
[1] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[3] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language
models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446,
2021.
[4] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari,
Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model. arXiv preprint arXiv:2201.11990, 2022.
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker
Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes,
Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson,
Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier
Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David
Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat,
Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei
Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language
modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
[6] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn
Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,
2021. URL https://arxiv.org/abs/2103.03874.
[7] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.
org/abs/2110.14168.
[8] Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta
Baral, and Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical
reasoning tasks, 2022. URL https://arxiv.org/abs/2204.05660.
[9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul
Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke
Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad
10
Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant
Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie
Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and
Wojciech Zaremba. Evaluating large language models trained on code, 2021. URL https:
//arxiv.org/abs/2107.03374.
[10] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large
language models. arXiv preprint arXiv:2108.07732, 2021.
[11] Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin
Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti,
Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang. A neural network
solves, explains, and generates university math problems by program synthesis and few-shot
learning at human level, 2021. URL https://arxiv.org/abs/2112.15594.
[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. CoRR, abs/2009.03300,
2020. URL https://arxiv.org/abs/2009.03300.
[13] Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Du-
mas Ang. Parsing algebraic word problems into equations. Transactions of the Association for
Computational Linguistics, 3, 2015.
[14] Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning
to solve arithmetic word problems with verb categorization. EMNLP, 523533, 2014.
[15] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin,
David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton,
and Augustus Odena. Show Your Work: Scratchpads for Intermediate Computation with
Language Models. arXiv e-prints, art. arXiv:2112.00114, November 2021.
[16] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models, 2022. URL
https://arxiv.org/abs/2201.11903.
[17] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond,
Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy,
Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl,
Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson,
Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level
code generation with alphacode, 2022. URL https://arxiv.org/abs/2203.07814.
[18] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha,
Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee,
Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun,
Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts,
Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch,
Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel
Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen,
Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina,
Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm,
Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-
Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog
applications, 2022. URL https://arxiv.org/abs/2201.08239.
[19] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language
models, 2022. URL https://arxiv.org/abs/2203.11171.
11
[20] The Coq development team. The Coq reference manual, 2022. URL http://coq.inria.
fr. Version 8.15.
[21] Makarius Wenzel, Lawrence C. Paulson, and Tobias Nipkow. The isabelle framework. In
Otmane Ait Mohamed, César Muñoz, and Sofiène Tahar, editors, Theorem Proving in Higher
Order Logics, pages 33–38, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN
978-3-540-71067-7.
[22] John Harrison. Hol light: A tutorial introduction. In Mandayam Srivas and Albert Camilleri,
editors, Formal Methods in Computer-Aided Design, pages 265–269, Berlin, Heidelberg, 1996.
Springer Berlin Heidelberg. ISBN 978-3-540-49567-3.
[23] Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob
von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Mid-
deldorp, editors, CADE, volume 9195 of Lecture Notes in Computer Science, pages 378–388.
Springer, 2015. ISBN 978-3-319-21400-9. URL http://dblp.uni-trier.de/db/
conf/cade/cade2015.html#MouraKADR15.
[24] Norman D. Megill and David A. Wheeler. Metamath: A Computer Lan-
guage for Pure Mathematics. Lulu Press, Morrisville, North Carolina,
2019. URL http://us.metamath.org/downloads/metamath.pdf.
http://us.metamath.org/downloads/metamath.pdf.
[25] Adam Grabowski, Artur Kornilowicz, and Adam Naumowicz. Mizar in a nutshell. J. Formalized
Reasoning, 3(2):153–245, 2010. doi: 10.6092/issn.1972-5787/1980. URL https://doi.
org/10.6092/issn.1972-5787/1980.
[26] Stephan Schulz. System Description: E 1.8. In Ken McMillan, Aart Middeldorp, and Andrei
Voronkov, editors, Proc. of the 19th LPAR, volume 8312 of LNCS. Springer, 2013.
[27] Jens Otten. leancop 2.0 and ileancop 1.2: High performance lean theorem proving in classical
and intuitionistic logic (system descriptions). In Alessandro Armando, Peter Baumgartner,
and Gilles Dowek, editors, Automated Reasoning, pages 283–291, Berlin, Heidelberg, 2008.
Springer Berlin Heidelberg. ISBN 978-3-540-71070-7.
[28] Laura Kovács and Andrei Voronkov. First-order theorem proving and Vampire. In CAV, 2013.
[29] Alexander A. Alemi, François Chollet, Niklas Een, Geoffrey Irving, Christian Szegedy, and
Josef Urban. Deepmath - Deep Sequence Models for Premise Selection. In Proceedings
of the 30th International Conference on Neural Information Processing Systems, NIPS’16,
pages 2243–2251, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9. URL
http://dl.acm.org/citation.cfm?id=3157096.3157347.
[30] Zarathustra Amadeus Goertzel, Karel Chvalovský, Jan Jakubuv, Miroslav Olsák, and Josef
Urban. Fast and slow enigmas and parental guidance. CoRR, abs/2107.06750, 2021. URL
https://arxiv.org/abs/2107.06750.
[31] Wenda Li, Lei Yu, Yuhuai Wu, and Lawrence C. Paulson. Isarstep: a benchmark for high-level
mathematical reasoning. In ICLR, 2021.
[32] Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving.
arXiv preprint arXiv:2009.03393, 2020.
[33] Cezary Kaliszyk, Josef Urban, Henryk Michalewski, and Miroslav Olšák. Reinforcement
learning of theorem proving. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31.
Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/
2018/file/55acf8539596d25624059980986aaa78-Paper.pdf.
[34] Markus Norman Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical
reasoning via self-supervised skip-tree training. 2021. URL https://openreview.net/
forum?id=YmqAnY0CMEy.
12
[35] Yuhuai Wu, Markus N. Rabe, Wenda Li, Jimmy Ba, Roger B. Grosse, and Christian Szegedy.
LIME: learning inductive bias for primitives of mathematical reasoning. In Marina Meila and
Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning
Research, pages 11251–11262. PMLR, 2021. URL http://proceedings.mlr.press/
v139/wu21c.html.
[36] Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. Proof artifact
co-training for theorem proving with language models. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U.
[37] Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya
Sutskever. Formal mathematics statement curriculum learning. CoRR, abs/2202.01344, 2022.
URL https://arxiv.org/abs/2202.01344.
[38] Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygózdz,
Piotr Milos, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding hammers to integrate language
models and automated theorem provers. CoRR, abs/2205.10893, 2022. doi: 10.48550/arXiv.
2205.10893. URL https://doi.org/10.48550/arXiv.2205.10893.
[39] Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and
Christian Szegedy. Autoformalization with large language models. CoRR, abs/2205.12615,
2022. doi: 10.48550/arXiv.2205.12615. URL https://doi.org/10.48550/arXiv.
2205.12615.
[40] Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun
Cho. Naturalproofs: Mathematical theorem proving in natural language. CoRR, abs/2104.01112,
2021. URL https://arxiv.org/abs/2104.01112.
[41] Lukas Pfahler and Katharina Morik. Semantic search in millions of equations, 2020.
[42] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural
text degeneration, 2019. URL https://arxiv.org/abs/1904.09751.
[43] Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev,
Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rath-
nayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam
Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka,
Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. Sympy:
symbolic computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 2376-
5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103.
[44] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom
Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia
Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent
Sifre. Training compute-optimal large language models, 2022. URL https://arxiv.org/
abs/2203.15556.
[45] CKE. Skale centylowe wyników - matura 2021, 2021. URL https://g.infor.pl/p/
_files/37182000/1-matura-2021-centyle-37181904.pdf.
[46] Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint
arXiv:1806.02847, 2018.
[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[48] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and
Chiyuan Zhang. Quantifying memorization across neural language models, 2022. URL
https://arxiv.org/abs/2202.07646.
13
[49] Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel
Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor
Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini
Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis
Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan
Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten
Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan
Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling
up models and data with t5x and seqio, 2022. URL https://arxiv.org/abs/2203.
17189.
[50] Abhilasha Ravichander, Aakanksha Naik, Carolyn Rose, and Eduard Hovy. Equate: A bench-
mark evaluation framework for quantitative reasoning in natural language inference, 2019. URL
https://arxiv.org/abs/1901.03735.
14
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] See Section 6
(c) Did you discuss any potential negative societal impacts of your work? [Yes] See
Section 6
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [No] The code and
models are proprietary.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes]
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [No] We only run the experiments once. This is the standard
approach for work with large language models because of their significant compute
requirements.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes]
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes] OCWCourses will be released with
license CC BY-NC-SA 4.0. [No] The mathematical webpages dataset is proprietary.
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
OCWCourses questions are part of supplementary material
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [Yes] : We share the main instructions given to contractors.
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [No] : We hired a dedicated workforce for this
task rather than using a crowdsourcing service. As such, our approach to hiring and
compensation is proprietary.
15