0% found this document useful (0 votes)

75 views15 pages

NeurIPS 2022 Solving Quantitative Reasoning Problems With Language Models Paper Conference

Uploaded by

Nguyễn Dũng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views15 pages

NeurIPS 2022 Solving Quantitative Reasoning Problems With Language Models Paper Conference

Uploaded by

Nguyễn Dũng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Solving Quantitative Reasoning Problems with

Language Models

Aitor Lewkowycz⇤, Anders Andreassen†, David Dohan† , Ethan Dyer† ,

Henryk Michalewski† , Vinay Ramasesh† , Ambrose Slone,
Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu,
Behnam Neyshabur⇤ , Guy Gur-Ari⇤ , Vedant Misra⇤

Google Research

Abstract
Language models have achieved remarkable performance on a wide range of tasks
that require natural language understanding. Nevertheless, state-of-the-art models
have generally struggled with tasks that require quantitative reasoning, such as
solving mathematics, science, and engineering problems at the college level. To
help close this gap, we introduce Minerva, a large language model pretrained on
general natural language data and further trained on technical content. The model
achieves state-of-the-art performance on technical benchmarks without the use of
external tools. We also evaluate our model on over two hundred undergraduate-
level problems in physics, biology, chemistry, economics, and other sciences that
require quantitative reasoning, and find that the model can correctly answer nearly
a third of them.

1 Introduction
Artificial neural networks have seen remarkable success in a variety of domains including computer
vision, speech recognition, audio and image generation, translation, game playing, and robotics. In
particular, large language models have achieved excellent performance across a variety of natural
language tasks including common-sense reasoning, question answering, and summarization [1, 2, 3,
4, 5]. However, these models have struggled with tasks that require quantitative reasoning, such as
solving mathematics, science, and engineering problems [6, 7, 8].
Quantitative reasoning problems are an interesting domain of application for language models because
they test the capability of models on several fronts. They require the solver to correctly parse a
natural language input, potentially recall world knowledge that pertains to the problem, and apply
an algorithm or series of computations to the information provided in order to arrive at a correct
solution. They also require that the solver is able to correctly parse and generate precise sequences
of mathematical tokens, as well as apply a computational procedure to tokens via symbolic and
numerical manipulation. Finally, such problems are a proving ground for research toward robust
quantitative reasoning solvers that are useful in supporting the work of humans in scientific and
technical fields.
Previous research has shown that large language models achieve impressive performance on math and
programming questions after training on domain specific datasets [9, 10, 11]. In this work, we apply
this approach to quantitative reasoning problems, in which the model is required to provide a complete
and self-contained solution, without relying on external tools. Such tasks include mathematics word
problems, competition mathematics evaluations, and many problems in science and engineering.
⇤
Equal leadership and advising contribution
†
Equal contribution

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

Question: A line parallel to y = 4x + Question: Assume that variance of first n nat-
6 passes through (5, 10). What is the y- ural numbers is 10 and variance of first m even
coordinate of the point where this line crosses natural numbers is 16. Compute m + n.
the y-axis?
Model output: Variance of first n natural
Target: Since the line is parallel to y = 2

4x + 6, we know that its slope is 4. Thus numbers is n 12 1 . Variance of first m even

22 (m2 1) n2 1
the equation of the line is y = 4x + b where natural numbers is 12
. So, 12
= 10
b is the y-intercept. Plugging (5, 10) into this and
2
2 (m 2
1)
= 16. Solving for n and m, we
12
gives 10 = 4 · 5 + b ) b = 10 which is get n = 11 and m = 7. So, m + n = 18.
what we wanted.

Model output: The line is parallel to y = Question: Prove that for every a, b, b 6= a,
4x+6, which means that it has the same slope 2
holds a +b
2
> a+b
2
.
2 2
as y = 4x + 6. Since the line passes through
(5, 10), we can use the point-slope form of a
line to find the equation of the line: Model output: We have
✓ ◆2
y 10 = 4(x 5) a2 + b2 a+b
>
y 10 = 4x 20 2 2
y = 4x 10 a2 + b2 a2 + 2ab + b2
>
2 4
Now that we have the equation of the line, we
can find the y-coordinate of the point where 2a2 + 2b2 > a2 + 2ab + b2
the line crosses the y-axis by substituting a2 2ab + b2 > 0
x = 0 into the equation:
(a b)2 > 0,
y =4·0 10 = 10 . which is true for all a and b.

Figure 1: Example questions with solutions from Minerva 62B. These solutions were selected by majority vote
over 256 samples, or maj1@256. All questions were prompted with the same prompt (a 4-shot prompt from
MATH; see appendix). Left: Question from the MATH dataset (Algebra 4). The model derives the solution by
writing down a line equation, simplifying it, substituting x = 0, and solving for y. The model outputs valid
LATEX along with natural language explanations. This detailed approach is different from the one used in the
ground truth target. Top right: Example question from the Joint Entrance Examination Main Math 2020 exam
taken each year by almost 2M Indian high-school students intended to study engineering and similar fields.
Bottom right: Example question from the National Math Exam in Poland (May 2022). The exam is taken by
approximately 270K high-school students every year.

1.1 Our Contribution

We present Minerva, a language model that achieves strong performance on many quantitative
reasoning tasks. Our model is able to process scientific and mathematical questions formulated in
natural language, and generate step-by-step solutions using correct LATEX notation. Figures 1 and 2
show example responses from Minerva to mathematics and physics questions.
Minerva is based on PaLM general language models [5] that are further trained on a high-quality
dataset containing scientific and mathematical data. In particular, we start with 8B, 62B, and 540B
parameter pretrained models, and continue training them on our technical content dataset. We achieve
state-of-the-art performance on MATH [6], GSM8k [7], and a STEM subset of the MMLU [12]
dataset, all of which include math and science questions formulated in natural language. Notably, our
models achieve robust performance in a few-shot setting with no explicit training on these datasets.
The main novelty of this paper is the use of a large training dataset that juxtaposes natural language
with correctly-formatted mathematical content in LATEX format. The data is collected from the arXiv
preprint server and from web pages that we carefully process to minimise the loss of mathematical
content. In addition to achieving state-of-the-art results, our work sets a new baseline for the
performance achievable on quantitative reasoning by increasing data quality and model size.
To further increase the coverage of quantitative reasoning benchmarks, we build a dataset of over 200
undergraduate-level questions in science and mathematics from MIT’s OpenCourseWare (OCW).

2
Question: A particle moves so that it is at 3 sin 4t , 3 cos t
4
at time t. Find the speed of the particle,
measured in unit of distance per unit of time.

Target: The parametric curve (x, y) = 3 sin 4t , 3 cos t

4
describes a circle with radius 3. Furthermore,
3
it makes a full revolution at time t = 8⇡. (figure) Therefore, the speed of the particle is 6⇡
8⇡
= .
4

Model output: The speed is the magnitude of the velocity vector. Therefore, we must first find the
velocity vector, which is the derivative of the position vector. We have
✓ ◆ ✓ ◆ ✓ ◆
d t t t t 1 3 t 3 t
v= 3 sin , 3 cos = 3 cos , sin = cos , sin .
dt 4 4 4 4 4 4 4 4 4
Therefore, the magnitude of the velocity vector is
s✓ ◆2 ✓ ◆2 r
3 t 3 t 9 t 9 t
|v| = cos + sin = cos2 + sin2
4 4 4 4 16 4 16 4
r
3 t t 3
= cos2 + sin2 = .
4 4 4 4
3
Therefore, the speed of the particle is .
4

Figure 2: Example question from the MATH dataset (Precalculus 4), along with the ground truth target and
the Minerva solution. The model correctly takes the derivative of a position vector to compute the velocity. It
then computes the norm of the resulting vector, and uses a trigonometric identity to obtain a final numerical
answer. Minerva takes a completely different approach from the ground truth solution. The model output was
not modified, except for a few newlines that were removed to save space.

This provides a measure of our model’s quantitative reasoning abilities in a chain-of-thought context
beyond a pure mathematical setting.

1.2 Related Works

Solving quantitative reasoning problems expressed in natural language has been an active area of
study [13, 14]. Prompting language models using scratchpad [15] or chain-of-thought [16] solutions
can lead them to generate step-by-step solutions to unseen problems. The GSM8k work [7] showed
that training verifiers to rerank model outputs can lead to improved performance. The original version
of GSM8k included special syntax for algebraic calculations, which were processed by a calculator.
In this work we focus on self-contained models without access to external tools.
The standard method for evaluating language models on generative tasks is to greedily sample one
solution per problem. Recent works [9, 17, 18, 19] have shown that it is advantageous to sample
multiple solutions per problem, and then filter those down to a final answer. We find that majority
voting [19] significantly improves performance over greedy decoding.
The work [11] includes an evaluation of davinci-002, OpenAI’s latest publicly available language
model, on a subset of 90 problems from the MATH dataset. Due to the focus on a subset of questions,
as well as changes made to the way questions are formatted, it is difficult to directly compare our
results with those of [11]. In Section 3, we compare OpenAI davinci-002 with our models under the
same experimental conditions.

Code generation. Applying code generating models to mathematical problems has been an active
area of exploration. PaLM [5] showed that a large language model with code in its training dataset
can achieve good performance on a code version of GSM8k. Furthermore, the Codex model [9] can
generate code solutions to MATH problems [11]. These solutions often rely on external libraries to
perform mathematical operations such as solving equations or taking limits. This is a complementary
approach to ours, in which we directly probe the model’s ability to arrive at an answer by relying
only on its own reasoning capability.

3
Formal mathematics. Mathematics developed as a discipline based in natural language, but its
axiomatic fundamentals make it possible to simulate mathematical thinking. This can be achieved
using specialized programming languages that facilitate the simulation of logical and mathematical
thinking using a computer, such as Coq [20], Isabelle [21], HOL4 [22], Lean [23], Metamath [24]
and Mizar [25]. Work on automation of proof assistants and automated theorem provers such as
E [26], leanCoP [27], and Vampire [28] has substantially benefited from integration with machine
learning methods [29, 30, 31, 32, 33].

Language models applied to formal and synthetic mathematical problems. Previous work
trained language models to predict mathematical expressions [34, 31, 32, 35, 36, 37, 38, 39]. In turn,
such a predictive model can be used to guide a proof search, as done by [32]. Large language models
excel in modelling natural language, though in the case of formal languages, models that facilitate
retaining information about the graph structure of a given mathematical formula, such as GNNs, are
still very competitive.

Modelling mathematics as a discipline of natural language. New benchmark datasets [6, 40]
cover more advanced mathematical topics. For specific subtasks that pertain to understanding
mathematics in natural language, such as searching for related equations, it is possible to use graph
convolutional networks [41]. However for directly solving such tasks from natural language, language
models are facing limited competition from other classes of models.

2 Training and Evaluation

2.1 Mathematical Training Dataset
Our models were trained on a dataset of 38.5B tokens from webpages filtered for mathematical
content and from papers submitted to the arXiv preprint server. In addition, the dataset includes
general natural language data, which is randomly sampled from the same dataset that was used
for pretraining PaLM. Our mathematical webpage dataset was constructed by collecting pages that
contain mathematical expressions in MathJax format. The pages underwent a cleaning process
that removes most HTML tags but preserves natural language as well as mathematical notation,
including LATEX symbols and formatting. The result is that mathematical formulae like e⇡i + 1 = 0
or E = mc2 are presented in full to the model during training. This procedure makes it possible
for the model to perform well on tasks that require calculation and symbolic manipulation. Table
1 provides a breakdown of the training dataset. See Appendix B for more details.

Table 1: Proportion of data, and number of tokens, from each source in the technical training dataset. The
General Natural Language dataset is a subset of the dataset used to pretrain the model.

Data source Proportion of data Tokens Present during pretraining

Math Web Pages 47.5% 17.5B No
arXiv 47.5% 21.0B No
General Natural Language Data 5% >100B Yes

2.2 Models and Training Procedure

Our approach is to start with the PaLM pretrained decoder-only transformer language models [5] , and
further train (finetune) them on our mathematical dataset using an autoregressive objective. Table 2
contains the main model and training hyperparameters. The largest model, with 540B parameters,
was finetuned on 26B tokens. While this model is highly undertrained compared to the 8B and 62B
models, it still achieves superior performance. Additional details can be found in Appendix C.

2.3 Evaluation Datasets

We mainly focus on few shot evaluation, though see Appendix E.3 for a discussion of finetuned
evaluation. For evaluation, we truncate the inputs from the left to 1024 tokens and we use the model
to generate up to 512 tokens. When sampling once per problem, we sample greedily. When sampling

4
Table 2: Model architecture and continued training hyperparameters. Model training was resumed from the
pretrained PaLM models, and the number of steps quoted refers only to continued training on our technical
dataset.

Model Layers Heads dmodel Parameters Steps Tokens

Minerva 8B 32 16 4096 8.63B 624k 164B
Minerva 62B 64 32 8192 62.50B 416k 109B
Minerva 540B 118 48 18 432 540.35B 399k 26B

Figure 3: Performance on MATH and MMLU-STEM by subtopic. Minerva achieves state-of-the-art results
on both datasets. maj1@k denotes evaluations where k samples were generated for each problem and only the
majority vote (most common answer) was selected [19]. For MATH, k = 256 for Minerva 8B and 62B, and
k = 64 for 540B. For MMLU-STEM, k = 16. davinci-002 is the latest publicly available model from OpenAI.

multiple times per problem we use nucleus sampling [42] with temperature T = 0.6, p = 0.95. For
generative tasks, the model produces a chain-of-thought answer and demarcates a final answer. We
evaluate a solution as correct if the final answer matches the ground truth solution, independent of the
quality of the chain-of-thought preceding it. To evaluate correctness, we parse the final answers and
compare them using the SymPy library [43]. p Thispis done in order to correctly identify answers that
are mathematically equivalent such as 1/ 3 and 3/3. See Appendix D.1 for further details.
The existing datasets on which we focus are:

• MATH: a dataset of 12K middle school and high school mathematics problems [6]. Problem
statements are written in LATEX. We prompt the model with a fixed 4-shot prompt (listed
in Appendix D.2). This prompt includes four random examples from the training dataset
whose ground truth targets are not too long.
• GSM8k: middle school math word problems [7]. Models are evaluated using the chain-of-
thought prompt from Wei et al. [16]. Previous models evaluated on GSM8k made use of an
external calculator. In this work, our model does not have access to any external tools.
• MMLU-STEM: subset of the MMLU dataset [12] focused on science, technology, engi-
neering, and mathematics (STEM). For the original version, we use the 5-shot prompt from
the development set for each task. We also consider chain-of-thought prompting for this
task, where we prompt the model with examples that include step-by-step solutions. We
use a multiple-choice version of the MATH prompt for topics that involve mathematical
reasoning, and add step-by-step solutions to the standard 5-shot prompts for the rest of the
topics. See Appendix G for more details.

2.4 Undergraduate-Level STEM Problems

To evaluate the scientific reasoning capabilities of Minerva, we harvested a set of STEM problems at
the undergraduate level, most of which involve multi-step reasoning, which we refer to in this paper

5
Table 3: Model performance on several quantitative reasoning datasets. For majority voting we use
k = 256 (64 for 540B) samples for MATH, k = 64 for OCWCourses , k = 100 (40 for 540B) for GSM8k
and k = 16 for MMLU-STEM. The PaLM GSM8k results do not use a calculator and were reported in [5].
We evaluated datasets that did not have published results on recent models on OpenAI davinci-002. Despite
MMLU-STEM being a multiple choice task, we can apply majority vote by prompting the model to generate a
rationale prior to the final answer, sampling multiple times, and then using majority vote on the final answers.
Superscripts denote results that are quoted from previous work: a GPT-2 [6], b PaLM 540B maj1@40 [19], and
c
Chinchilla [44].

MATH OCWCourses GSM8k MMLU-STEM

PaLM 8B 1.5% 1.5% 4.1% 22.0%
Minerva 8B 14.1% 7.7% 16.2% 35.6%
Minerva 8B, maj1@k 25.4% 12.5% 28.4% 43.4%
PaLM 62B 4.4% 5.9% 33.0% 39.1%
Minerva 62B 27.6% 12.9% 52.4% 53.9%
Minerva 62B, maj1@k 43.4% 23.5% 68.5% 63.5%
PaLM 540B 8.8% 7.1% 56.5% 58.7%
Minerva 540B 33.6% 17.6% 58.8% 63.9%
Minerva 540B, maj1@k 50.3% 30.8% 78.5% 75.0%
OpenAI davinci-002 19.1% 14.8% - -
Published SOTA 6.9%a - 74.4%b 54.9%c

as OCWCourses. Using publicly-available course materials offered by MIT (OpenCourseWare), we

collected problems with automatically-verifiable solutions (either numeric or symbolically verifiable
via SymPy) from courses including “solid-state chemistry”, “information and entropy”, “differential
equations”, and “special relativity.” These problems were processed by contractors to be self-
contained and to have a clearly-delineated final answer. Problems asking for a proof or open-ended
short answer were not included. In total we curated 272 problems, 191 of which have numeric
solutions and 81 have symbolic solutions. In Appendix F, we detail the contributions from each
course, and the process of converting these course materials into a format suitable for processing by
language models. We also provide the text of all problems. We plan to release these as part of an
open-source dataset which will be detailed in an upcoming manuscript.

2.5 Inference-Time Techniques

We find that we can considerably outperform greedy decoding by sampling k > 1 solutions (with
a non-zero temperature) and selecting one using majority voting [19]. This consists of grouping
predictions with respect to their final answer and selecting the most common answer. We denote this
as maj1@k, following [17]. A variation of this algorithm, denoted majn@k, involves selecting the n
most common answers. Intuitively, the reason majority voting improves performance is that while
there are many ways to answer a question incorrectly, there are typically very few ways to answer
correctly. In other words, the truth is a Schelling point.
Contrast majority voting with pass@k, where a task is considered solved if any single sample solves
it out of k samples. See Section 4.2 for more details on pass@k performance. In Appendix E.1,
we report on how performance depends on k for different metrics. We find that while pass@k
continues to improve as k is increased, majority voting performance saturates faster: 97% of the large
k accuracy is achieved at k = 64 for MATH and k = 16 for GSM8k. This is likely because majority
voting selects the most common answer in the modeled distribution, and the error of this estimate
decreases with increasing k. This is in contrast to pass@k where the performance improvement
comes from the tail of the distribution, which can keep improving as k is increased.
Log-likelihood is another metric that can be used to rerank samples. We found that majority voting
performs significantly better than log-likelihood reranking (see Appendix E.2).
6
3 Results
Table 3 summarizes the results for Minerva models and other models, on the evaluation datasets
described in Section 2.3. Figure 3 presents a breakdown of the MATH dataset results by subtopic.
For MMLU evaluations, unless otherwise noted, performance is measured by using the standard
5-shot prompt per topic and picking the answer with the highest score. When evaluating MMLU with
majority voting, we sample k = 16 model answers using a chain-of-thought prompt.
We present model output samples in Figures 1 and 2, and additional output samples are listed in the
Appendix. In addition, we evaluated Minerva 62B on the National Math Exam in Poland and found
that it achieves a score of 57%, which happened to be the national average in 2021 [45, p. 23]. The
540B model achieves 65%.
We include results on the latest publicly available language model from OpenAI, davinci-002,
evaluated using the OpenAI API with temperature set to the official recommendation (T = 0.2). The
combination of training data, scale and inference techniques yields state of the art results on all the
technical tasks that we considered. For all tasks (with the exception of GSM8k), the improvement
with respect to previous results is considerable.
While our main focus is on few shot evaluation, we also tried to finetune Minerva on MATH; this did
not yield any improvement; interestingly, however, finetuning PaLM on MATH did give a significant
improvement. Further details can be found in Appendix E.3.

3.1 Basic arithmetic

In Appendix I, we study the performance of Minerva 540B on simple arithmetic tasks. The model
achieves over 80% accuracy on 10-digit addition and over 20% accuracy on 18-digit addition.

4 Performance Analysis
4.1 Model Mistakes
To better understand the types of mistakes our models make, we compare the performance of
Minerva 8B and Minerva 62B on 216 problems with high confidence majority decisions of both
models. Specifically, we selected examples where the top answer received at least 15% of votes, and
that either Minerva 8B was correct and Minerva 62B was incorrect (15 samples), or vice versa (201
samples). The categories and examples for each category are described in Appendix J.2.
As shown in Table 4, the prevailing errors of the 8B model were related to incorrect reasoning or
calculations. Many of the calculation errors were relatively benign arithmetic mistakes. Solutions
that were too short were relatively rare (in these cases, the model immediately produces an incorrect
answer without any intermediate reasoning steps). Finally, in a few cases, the model hallucinates an
equation or mathematical fact that is not real.
In the samples where the 62B model was incorrect, the dominating failure modes were again incorrect
reasoning and incorrect calculations. In summary, we find that the 62B Minerva model retains most
of the skills of the 8B model and improves upon both reasoning and calculation robustness.

Table 4: Failure modes of the 8B Minerva model, out of 201 samples which the 62B model solved correctly
and the 8B model did not.

Type of mistakes Occurrences Type of mistakes Occurrences

Incorrect reasoning 82 Incorrect calculation 70
Misunderstands question 22 Uses incorrect fact 16
Solution too short 4 Hallucinated math objects 4

4.2 False Positives

In our approach to solving quantitative reasoning problems, we are able to automatically verify
whether the final answer to a problem is correct, but we do not have an automatic way to verify the

7
model’s chain of reasoning. This leaves open the possibility of false positives: samples which have
the correct final answer, but for which the reasoning is incomplete or incorrect.
We selected 100 random questions from MATH (20 per difficulty level), along with answers sampled
at zero temperature from the 62B model. We then manually inspected the answers to determine
the false positive rate, which is the ratio between number of false positive examples and number of
examples for which the final answer is correct; for difficulty levels 1, 2, 3, 4, and 5, the false positive
rates are < 5%, 10%, < 5%, 15%, and 30%, respectively. Averaging these per-level false positive
rates (weighted by true positive rates), the overall false positive rate is estimated to be 8%. Although
this overall rate is low, we find that it does increase with difficulty level.
Our focus on pass@1 and majority voting as the primary evaluation metrics is due in part to the fact
that they are less susceptible to false positives than pass@k [17]. While the pass@256 accuracy
is 84.5% for the 62B model, false positives account for part of it. We inspected the samples that
failed in majority voting but passed on pass@k due to a single correct answer, and estimate the
false positive rate for pass@256 to be 30% among samples selected in this way. After removing
false positives, we estimate that the pass@256 accuracy at greater than 68%; see Appendix J.3 for
details.

5 Memorization
A central question in interpreting Minerva’s solutions is whether performance reflects genuine analytic
capability or instead rote memorization. This is especially relevant as there has been much prior work
indicating that language models often memorize some fraction of their training data [46, 47, 48].
When examining model solutions, we find that memorization of intermediate facts, such as numerical
values of square roots or trigonometric identities, are crucial elements of model solutions. We would
like to investigate a strong form of memorization, where model performance is a result of memorizing
the explicit problems and solutions in our evaluation set, but also a weaker form, where the model
has memorized alternate answers to the same questions.
In order to evaluate the degree to which our models solve problems by recalling information memo-
rized from training data, we conduct three analyses on the MATH dataset. First we directly search
for problems and solutions in our training corpus. Next, we generate modified versions of problems
and evaluate our models’ robustness to these changes. Finally, we measure the degree of overlap
between the ground truth solutions and solutions generated by our model and measure the effect of
this similarity on model performance. Overall, we find little evidence that the model’s performance
can be attributed to memorization.

5.1 Training and Evaluation Dataset Overlap

We selected the problems for which our 62B parameter model produced a correct answer, and filtered
them to the 100 problems with the highest majority vote score, expecting that problems with a high
majority vote score are more likely to have been memorized. For each of these question-answer
pairs, we compute the BLEU score across chunks of 500 characters in our Math Web Pages dataset (a
histogram of the BLEU scores is shown in Appendix Figure 9). We then manually inspect the 250
documents with the highest BLEU scores. While many of the top matches were from homework help
sites with math questions and solutions, none of the questions matched the questions in the subset of
MATH under consideration. We have included these 250 segments in Appendix K.1. We note that
some problems from MATH can be found on the web. Nevertheless, this analysis concludes that
these problems did not make it through our data collection process.

5.2 Performance on Modified MATH Problems

To further investigate memorization, we randomly selected twenty problems which the 62B model
answered correctly under majority voting. We manually modified each problem either by introducing
minor changes to problem wording (framing) or by changing the numbers which appeared in the
problem and modifying the solution accordingly. We then compared the accuracy over sampled
solutions before and after the modification. Results are shown in Figure 4. In both cases the accuracy
before and after modifications are correlated, with no clear bias in favor of the original formulation.
This is suggestive of minimal memorization. The modified problems are listed in Appendix K.2.

8
Figure 4: Results indicating lack of memorization on MATH. Left, Center: Accuracy of original questions
from the MATH dataset and their modified versions. Each point represents a question. The x axis is accuracy on
the original question, and the y axis is accuracy on the modified one. Right: Majority vote accuracy, computed
only on samples with BLEU score to the ground truth solution less than or equal to the x-axis value.

5.3 BLEU Score Between Ground Truth and Generated Solutions

We seek to detect memorization of solutions by computing BLEU score between ground truth
answers and model generated answers. We use the 62B model and analyze 256 samples per problem
in the MATH dataset. First, we compute overlap statistics for all correct samples. We find that
160 out of 5,000 test questions have a sample with a BLEU score greater than or equal to 80 (see
Appendix K.3). We note that they tend to be short solutions. To understand the effect of answer
similarity on performance, we remove model samples above a certain BLEU score threshold, and
recompute the majority vote accuracy. We find that majority vote performance is robust even down to
relatively low similarities (see Figure 4), indicating that performance cannot be attributed to model
outputs that are very similar to ground truth answers.

6 Conclusions and Discussion

In this work, we take an approach to quantitative reasoning that relies on solving problems using
mathematical reasoning expressed in natural language. We show that by training a large language
model on a high quality mathematical dataset, we are able to achieve strong performance on tasks
that require logical reasoning, numerical calculation, and symbolic manipulation. Our model does
not make use of external tools, and at inference time relies exclusively on autoregressive sampling
to achieve this performance. Complementary approaches to quantitative reasoning include code-
generating models and formal methods. These are all different routes toward a common goal: an
agent that can reason about and solve quantitative problems.

6.1 Limitations of Our Approach

Our approach to quantitative reasoning has several limitations. First, we have no automatic way
of verifying the correctness of the model’s answers. This is in contrast to formal approaches, for
which automatic verification is intrinsic. Second, our model has no access to external tools such as a
calculator or a Python interpreter. It is therefore limited in its ability to perform quantitative reasoning
tasks that require complicated numerical calculations. Third, because our model was trained on a large
amount of data, we have little direct control over the specific capabilities that the model acquired.
Our inference time strategy also has limitations. Majority voting requires generating many samples
and selecting the most common final answer. This can be computationally costly, and importantly
only works for problems with well defined final answers, and so is less well suited for domains such
as theorem proving or open-ended scientific reasoning.

6.2 Societal Impact

Artificial neural networks capable of solving quantitative reasoning problems in a general setting have
the potential of substantial societal impact. Minerva, while a step in this direction, is still far from
achieving this goal, and its potential societal impact is therefore limited. The model’s performance is
still well below human performance, and furthermore, we do not have an automatic way of verifying

9
the correctness of its outputs. If these issues could be solved, we expect the impacts of this model
to be broadly positive. A direct application could be an accessible and affordable math tutor which
could help improve educational inequalities.

7 Acknowledgments
We thank David Andre, Jacob Austin, Maarten Bosma, Aakanksha Chowdhery, Sergey Ioffe, Colin
Raffel, Charles Sutton, Christian Szegedy, Stanislas Polu, and Harrison Edwards for helpful discus-
sions.

References
[1] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified
text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[3] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song,
John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language
models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446,
2021.
[4] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari,
Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model. arXiv preprint arXiv:2201.11990, 2022.
[5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker
Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes,
Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson,
Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier
Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David
Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat,
Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei
Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language
modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
[6] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn
Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset,
2021. URL https://arxiv.org/abs/2103.03874.
[7] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.
org/abs/2110.14168.
[8] Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta
Baral, and Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical
reasoning tasks, 2022. URL https://arxiv.org/abs/2204.05660.
[9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto,
Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul
Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke
Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad

10
Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex
Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant
Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie
Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and
Wojciech Zaremba. Evaluating large language models trained on code, 2021. URL https:
//arxiv.org/abs/2107.03374.
[10] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large
language models. arXiv preprint arXiv:2108.07732, 2021.
[11] Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin
Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti,
Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang. A neural network
solves, explains, and generates university math problems by program synthesis and few-shot
learning at human level, 2021. URL https://arxiv.org/abs/2112.15594.
[12] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. CoRR, abs/2009.03300,
2020. URL https://arxiv.org/abs/2009.03300.
[13] Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Du-
mas Ang. Parsing algebraic word problems into equations. Transactions of the Association for
Computational Linguistics, 3, 2015.
[14] Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning
to solve arithmetic word problems with verb categorization. EMNLP, 523533, 2014.
[15] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin,
David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton,
and Augustus Odena. Show Your Work: Scratchpads for Intermediate Computation with
Language Models. arXiv e-prints, art. arXiv:2112.00114, November 2021.
[16] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models, 2022. URL
https://arxiv.org/abs/2201.11903.
[17] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond,
Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy,
Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl,
Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson,
Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level
code generation with alphacode, 2022. URL https://arxiv.org/abs/2203.07814.
[18] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha,
Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee,
Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun,
Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts,
Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch,
Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel
Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen,
Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina,
Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm,
Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-
Arcas, Claire Cui, Marian Croak, Ed Chi, and Quoc Le. Lamda: Language models for dialog
applications, 2022. URL https://arxiv.org/abs/2201.08239.
[19] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language
models, 2022. URL https://arxiv.org/abs/2203.11171.

11
[20] The Coq development team. The Coq reference manual, 2022. URL http://coq.inria.
fr. Version 8.15.
[21] Makarius Wenzel, Lawrence C. Paulson, and Tobias Nipkow. The isabelle framework. In
Otmane Ait Mohamed, César Muñoz, and Sofiène Tahar, editors, Theorem Proving in Higher
Order Logics, pages 33–38, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN
978-3-540-71067-7.
[22] John Harrison. Hol light: A tutorial introduction. In Mandayam Srivas and Albert Camilleri,
editors, Formal Methods in Computer-Aided Design, pages 265–269, Berlin, Heidelberg, 1996.
Springer Berlin Heidelberg. ISBN 978-3-540-49567-3.
[23] Leonardo Mendonça de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn, and Jakob
von Raumer. The lean theorem prover (system description). In Amy P. Felty and Aart Mid-
deldorp, editors, CADE, volume 9195 of Lecture Notes in Computer Science, pages 378–388.
Springer, 2015. ISBN 978-3-319-21400-9. URL http://dblp.uni-trier.de/db/
conf/cade/cade2015.html#MouraKADR15.
[24] Norman D. Megill and David A. Wheeler. Metamath: A Computer Lan-
guage for Pure Mathematics. Lulu Press, Morrisville, North Carolina,
2019. URL http://us.metamath.org/downloads/metamath.pdf.
http://us.metamath.org/downloads/metamath.pdf.
[25] Adam Grabowski, Artur Kornilowicz, and Adam Naumowicz. Mizar in a nutshell. J. Formalized
Reasoning, 3(2):153–245, 2010. doi: 10.6092/issn.1972-5787/1980. URL https://doi.
org/10.6092/issn.1972-5787/1980.
[26] Stephan Schulz. System Description: E 1.8. In Ken McMillan, Aart Middeldorp, and Andrei
Voronkov, editors, Proc. of the 19th LPAR, volume 8312 of LNCS. Springer, 2013.
[27] Jens Otten. leancop 2.0 and ileancop 1.2: High performance lean theorem proving in classical
and intuitionistic logic (system descriptions). In Alessandro Armando, Peter Baumgartner,
and Gilles Dowek, editors, Automated Reasoning, pages 283–291, Berlin, Heidelberg, 2008.
Springer Berlin Heidelberg. ISBN 978-3-540-71070-7.
[28] Laura Kovács and Andrei Voronkov. First-order theorem proving and Vampire. In CAV, 2013.
[29] Alexander A. Alemi, François Chollet, Niklas Een, Geoffrey Irving, Christian Szegedy, and
Josef Urban. Deepmath - Deep Sequence Models for Premise Selection. In Proceedings
of the 30th International Conference on Neural Information Processing Systems, NIPS’16,
pages 2243–2251, USA, 2016. Curran Associates Inc. ISBN 978-1-5108-3881-9. URL
http://dl.acm.org/citation.cfm?id=3157096.3157347.
[30] Zarathustra Amadeus Goertzel, Karel Chvalovský, Jan Jakubuv, Miroslav Olsák, and Josef
Urban. Fast and slow enigmas and parental guidance. CoRR, abs/2107.06750, 2021. URL
https://arxiv.org/abs/2107.06750.
[31] Wenda Li, Lei Yu, Yuhuai Wu, and Lawrence C. Paulson. Isarstep: a benchmark for high-level
mathematical reasoning. In ICLR, 2021.
[32] Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving.
arXiv preprint arXiv:2009.03393, 2020.
[33] Cezary Kaliszyk, Josef Urban, Henryk Michalewski, and Miroslav Olšák. Reinforcement
learning of theorem proving. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31.
Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/
2018/file/55acf8539596d25624059980986aaa78-Paper.pdf.
[34] Markus Norman Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Mathematical
reasoning via self-supervised skip-tree training. 2021. URL https://openreview.net/
forum?id=YmqAnY0CMEy.

12
[35] Yuhuai Wu, Markus N. Rabe, Wenda Li, Jimmy Ba, Roger B. Grosse, and Christian Szegedy.
LIME: learning inductive bias for primitives of mathematical reasoning. In Marina Meila and
Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning,
ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning
Research, pages 11251–11262. PMLR, 2021. URL http://proceedings.mlr.press/
v139/wu21c.html.

[36] Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward Ayers, and Stanislas Polu. Proof artifact
co-training for theorem proving with language models. In International Conference on Learning
Representations, 2022. URL https://openreview.net/forum?id=rpxJc9j04U.

[37] Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya
Sutskever. Formal mathematics statement curriculum learning. CoRR, abs/2202.01344, 2022.
URL https://arxiv.org/abs/2202.01344.

[38] Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygózdz,
Piotr Milos, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding hammers to integrate language
models and automated theorem provers. CoRR, abs/2205.10893, 2022. doi: 10.48550/arXiv.
2205.10893. URL https://doi.org/10.48550/arXiv.2205.10893.

[39] Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and
Christian Szegedy. Autoformalization with large language models. CoRR, abs/2205.12615,
2022. doi: 10.48550/arXiv.2205.12615. URL https://doi.org/10.48550/arXiv.
2205.12615.

[40] Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun
Cho. Naturalproofs: Mathematical theorem proving in natural language. CoRR, abs/2104.01112,
2021. URL https://arxiv.org/abs/2104.01112.

[41] Lukas Pfahler and Katharina Morik. Semantic search in millions of equations, 2020.

[42] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural
text degeneration, 2019. URL https://arxiv.org/abs/1904.09751.

[43] Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev,
Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rath-
nayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam
Vats, Fredrik Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka,
Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. Sympy:
symbolic computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 2376-
5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103.

[44] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom
Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia
Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent
Sifre. Training compute-optimal large language models, 2022. URL https://arxiv.org/
abs/2203.15556.

[45] CKE. Skale centylowe wyników - matura 2021, 2021. URL https://g.infor.pl/p/
_files/37182000/1-matura-2021-centyle-37181904.pdf.

[46] Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint
arXiv:1806.02847, 2018.

[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[48] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and
Chiyuan Zhang. Quantifying memorization across neural language models, 2022. URL
https://arxiv.org/abs/2202.07646.

13
[49] Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel
Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor
Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini
Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis
Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan
Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten
Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan
Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. Scaling
up models and data with t5x and seqio, 2022. URL https://arxiv.org/abs/2203.
17189.
[50] Abhilasha Ravichander, Aakanksha Naik, Carolyn Rose, and Eduard Hovy. Equate: A bench-
mark evaluation framework for quantitative reasoning in natural language inference, 2019. URL
https://arxiv.org/abs/1901.03735.

14
Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] See Section 6
(c) Did you discuss any potential negative societal impacts of your work? [Yes] See
Section 6
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [No] The code and
models are proprietary.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes]
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [No] We only run the experiments once. This is the standard
approach for work with large language models because of their significant compute
requirements.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes]
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes] OCWCourses will be released with
license CC BY-NC-SA 4.0. [No] The mathematical webpages dataset is proprietary.
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
OCWCourses questions are part of supplementary material
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [Yes] : We share the main instructions given to contractors.
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [No] : We hired a dedicated workforce for this
task rather than using a crowdsourcing service. As such, our approach to hiring and
compensation is proprietary.

Realtivistic Quantum Mechanics PPT Slides
No ratings yet
Realtivistic Quantum Mechanics PPT Slides
72 pages
E01451 Hydraulic Measurement and Head Loss Study Unit
No ratings yet
E01451 Hydraulic Measurement and Head Loss Study Unit
86 pages
Instruction Manual Bump - Turn Module / Stabilization EM120
No ratings yet
Instruction Manual Bump - Turn Module / Stabilization EM120
31 pages
Mathematical Language Models: A Survey
No ratings yet
Mathematical Language Models: A Survey
34 pages
A Survey of Deep Learning For Mathematical Reasoning
No ratings yet
A Survey of Deep Learning For Mathematical Reasoning
24 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
22 pages
Symmetry 15 00916 v2
No ratings yet
Symmetry 15 00916 v2
13 pages
Llemma - An Open Language Model For Mathematics
No ratings yet
Llemma - An Open Language Model For Mathematics
28 pages
08 Math
No ratings yet
08 Math
22 pages
Lemma An Open Language Model For Mathematics
No ratings yet
Lemma An Open Language Model For Mathematics
28 pages
Dynamic Question Answer Generator An Enhanced Approach To Question Generation
No ratings yet
Dynamic Question Answer Generator An Enhanced Approach To Question Generation
5 pages
Lisp Programming
No ratings yet
Lisp Programming
22 pages
Resolving Mathematical Word Problems Through Gener
No ratings yet
Resolving Mathematical Word Problems Through Gener
6 pages
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
No ratings yet
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
34 pages
Solving Math Word Problems by Combining Language Models With Symbolic Solvers
No ratings yet
Solving Math Word Problems by Combining Language Models With Symbolic Solvers
7 pages
How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in A Pre-Trained Language Model
No ratings yet
How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in A Pre-Trained Language Model
26 pages
Solving Arithmetic Word Problems Using Natural Language Processing and Rule-Based Classification
No ratings yet
Solving Arithmetic Word Problems Using Natural Language Processing and Rule-Based Classification
12 pages
Math Odyssey Benchmarks
No ratings yet
Math Odyssey Benchmarks
14 pages
2023 Acl-Long 817
No ratings yet
2023 Acl-Long 817
27 pages
A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level
No ratings yet
A Neural Network Solves, Explains, and Generates University Math Problems by Program Synthesis and Few-Shot Learning at Human Level
10 pages
Handwritten Mathematical Expression Solver Using CNN
No ratings yet
Handwritten Mathematical Expression Solver Using CNN
8 pages
1 Automatic - Text - Categorization - of - Mathematical - Word
No ratings yet
1 Automatic - Text - Categorization - of - Mathematical - Word
7 pages
1 s2.0 S2666827023000592 Main
No ratings yet
1 s2.0 S2666827023000592 Main
8 pages
Six, Lies, and Calculators
No ratings yet
Six, Lies, and Calculators
8 pages
Mathematical Capabilities of ChatGPT
100% (1)
Mathematical Capabilities of ChatGPT
29 pages
1 s2.0 S0306457325000019 Main
No ratings yet
1 s2.0 S0306457325000019 Main
16 pages
Perception and Experience in Problem Solving
No ratings yet
Perception and Experience in Problem Solving
6 pages
2011, A Reasoning Method On Computational Network
No ratings yet
2011, A Reasoning Method On Computational Network
5 pages
86 Putnam AXIOM A Functional A
No ratings yet
86 Putnam AXIOM A Functional A
20 pages
Mini A Cross System Benchmark For Formal Lympiad Level Mathematics
No ratings yet
Mini A Cross System Benchmark For Formal Lympiad Level Mathematics
11 pages
SAMPLE MA Admission Exam
No ratings yet
SAMPLE MA Admission Exam
38 pages
MathScale Scaling Instruction Tuning For Mathematical Reasoning
No ratings yet
MathScale Scaling Instruction Tuning For Mathematical Reasoning
15 pages
Don't Trust Verify
No ratings yet
Don't Trust Verify
20 pages
2425 Deep Learning For Symbolic Mat
No ratings yet
2425 Deep Learning For Symbolic Mat
24 pages
Math Coaching1 1stbooklet (FINAL)
No ratings yet
Math Coaching1 1stbooklet (FINAL)
15 pages
Automatic Generation of Mathematic Tasks
No ratings yet
Automatic Generation of Mathematic Tasks
6 pages
Green Wizards
No ratings yet
Green Wizards
17 pages
Programming Puzzles: Tal Schuster Ashwin Kalyan Oleksandr Polozov Adam Tauman Kalai
No ratings yet
Programming Puzzles: Tal Schuster Ashwin Kalyan Oleksandr Polozov Adam Tauman Kalai
40 pages
Source
No ratings yet
Source
16 pages
Neural Math Word Problem Solver With Reinforcement Learning
No ratings yet
Neural Math Word Problem Solver With Reinforcement Learning
11 pages
AI Answers
No ratings yet
AI Answers
4 pages
How Well Do LLM Perform Iin Arithmetic Tasks
No ratings yet
How Well Do LLM Perform Iin Arithmetic Tasks
10 pages
Are NLP Models Really Able To Solve Simple Math Word Problems?
No ratings yet
Are NLP Models Really Able To Solve Simple Math Word Problems?
15 pages
Large Language Models For Mathematical Reasoning - Progresses and Challenges
No ratings yet
Large Language Models For Mathematical Reasoning - Progresses and Challenges
14 pages
pdf2306 08997 PDF
No ratings yet
pdf2306 08997 PDF
20 pages
Southeastern European Regional Programming Contest Bucharest, Romania October 18, 2003
No ratings yet
Southeastern European Regional Programming Contest Bucharest, Romania October 18, 2003
2 pages
Pert Bootcamp Math Packet Spring 2020
No ratings yet
Pert Bootcamp Math Packet Spring 2020
10 pages
Ijimai8 - 5 - 8 Evaluating ChatGPT-Generated Linear Algebra Formative Assessments
No ratings yet
Ijimai8 - 5 - 8 Evaluating ChatGPT-Generated Linear Algebra Formative Assessments
8 pages
Mathematics: Rule-Based Production of Mathematical Expressions
No ratings yet
Mathematics: Rule-Based Production of Mathematical Expressions
19 pages
Training Verifiers To Solve Math Word Problems
No ratings yet
Training Verifiers To Solve Math Word Problems
22 pages
Natural Language Processing NLP and Machine Learning MLTheory and Applications PDF
100% (2)
Natural Language Processing NLP and Machine Learning MLTheory and Applications PDF
306 pages
ARTICLE SIXTO - QUADERNI Cieaem67
No ratings yet
ARTICLE SIXTO - QUADERNI Cieaem67
10 pages
Autonomous Data Selection With Language Models For Mathematical Texts
No ratings yet
Autonomous Data Selection With Language Models For Mathematical Texts
25 pages
EVIL COLLECTION v1 11
No ratings yet
EVIL COLLECTION v1 11
4 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Differential Equations (Calculus) Mathematics E-Book For Public Exams
From Everand
Differential Equations (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
From Everand
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
Manish Soni
No ratings yet
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
From Everand
IGNOU MCA Discrete Mathematics Previous Years Unsolved Papers MCS 212
Manish Soni
No ratings yet
IGNOU BCA Discrete Mathematics Previous Year Unsolved Papers MCS 013
From Everand
IGNOU BCA Discrete Mathematics Previous Year Unsolved Papers MCS 013
Manish Soni
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Shuttlecock Dynamics
No ratings yet
Shuttlecock Dynamics
6 pages
General Objective 1 - Know The Various Types of Indicating Instruments
No ratings yet
General Objective 1 - Know The Various Types of Indicating Instruments
18 pages
00 Kom CDS
No ratings yet
00 Kom CDS
4 pages
Cambridge IGCSE™: Biology 0610/53
No ratings yet
Cambridge IGCSE™: Biology 0610/53
9 pages
Using Synthetic MR Images For Distortion C - 2023 - Developmental Cognitive Neur
No ratings yet
Using Synthetic MR Images For Distortion C - 2023 - Developmental Cognitive Neur
17 pages
Örnek Sorular 2
No ratings yet
Örnek Sorular 2
17 pages
Overview of The Application of Radiation in Medical Diagnosis and Therapy
No ratings yet
Overview of The Application of Radiation in Medical Diagnosis and Therapy
3 pages
Soil Report JNV North Dinajpur
No ratings yet
Soil Report JNV North Dinajpur
18 pages
Dokumen - Pub Operations Research 34nbsped 9788182837218
No ratings yet
Dokumen - Pub Operations Research 34nbsped 9788182837218
882 pages
Notes On Mass and Energy Balances For Membranes 2007 PDF
No ratings yet
Notes On Mass and Energy Balances For Membranes 2007 PDF
83 pages
MET 3145 ASSIGN 2 - 2025 - Solutions
No ratings yet
MET 3145 ASSIGN 2 - 2025 - Solutions
6 pages
Amenti Class 2: Questions and Answers
No ratings yet
Amenti Class 2: Questions and Answers
4 pages
m1 Rws Sept
No ratings yet
m1 Rws Sept
2 pages
Business Research Methods Updat
No ratings yet
Business Research Methods Updat
4 pages
F. Y. Bsc. (Electronics) Question Bank
No ratings yet
F. Y. Bsc. (Electronics) Question Bank
51 pages
The Importance and Potential of Golden Ratio in Architecture Design
No ratings yet
The Importance and Potential of Golden Ratio in Architecture Design
26 pages
Bell Ringer: Explain One of The Stations From Yesterday in Detail
No ratings yet
Bell Ringer: Explain One of The Stations From Yesterday in Detail
77 pages
Improving Electrical System Reliability With Infrared Thermography
No ratings yet
Improving Electrical System Reliability With Infrared Thermography
20 pages
Abe Quiz Bee Season II
No ratings yet
Abe Quiz Bee Season II
4 pages
EE104x
No ratings yet
EE104x
77 pages
Mathematics 10 Challenge
No ratings yet
Mathematics 10 Challenge
3 pages
Seepage Theories: (I) Safety Against Piping or Undermining
No ratings yet
Seepage Theories: (I) Safety Against Piping or Undermining
10 pages
Bachelor Thesis Vub
100% (2)
Bachelor Thesis Vub
5 pages
12th Physics All Chapter 1 & 2 Marks Question
No ratings yet
12th Physics All Chapter 1 & 2 Marks Question
6 pages
Chapter 3
No ratings yet
Chapter 3
90 pages
Lecture 10 Liquid in A Container Subjected To Uniform Acceleration
No ratings yet
Lecture 10 Liquid in A Container Subjected To Uniform Acceleration
8 pages
Concrete-Filled Steel Tube Columns
100% (1)
Concrete-Filled Steel Tube Columns
26 pages

NeurIPS 2022 Solving Quantitative Reasoning Problems With Language Models Paper Conference

Uploaded by

NeurIPS 2022 Solving Quantitative Reasoning Problems With Language Models Paper Conference

Uploaded by

Solving Quantitative Reasoning Problems with

Aitor Lewkowycz⇤, Anders Andreassen†, David Dohan† , Ethan Dyer† ,

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

4x + 6, we know that its slope is 4. Thus numbers is n 12 1 . Variance of first m even

1.1 Our Contribution

Target: The parametric curve (x, y) = 3 sin 4t , 3 cos t

1.2 Related Works

2 Training and Evaluation

Data source Proportion of data Tokens Present during pretraining

2.2 Models and Training Procedure

2.3 Evaluation Datasets

Model Layers Heads dmodel Parameters Steps Tokens

2.4 Undergraduate-Level STEM Problems

MATH OCWCourses GSM8k MMLU-STEM

as OCWCourses. Using publicly-available course materials offered by MIT (OpenCourseWare), we

2.5 Inference-Time Techniques

3.1 Basic arithmetic

Type of mistakes Occurrences Type of mistakes Occurrences

4.2 False Positives

5.1 Training and Evaluation Dataset Overlap

5.2 Performance on Modified MATH Problems

5.3 BLEU Score Between Ground Truth and Generated Solutions

6 Conclusions and Discussion

6.1 Limitations of Our Approach

6.2 Societal Impact

You might also like