100% found this document useful (1 vote)

154 views29 pages

Mathematical Capabilities of ChatGPT

Mathematical

Uploaded by

Fabiana Suares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

154 views29 pages

Mathematical Capabilities of ChatGPT

Mathematical

Uploaded by

Fabiana Suares

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Mathematical Capabilities of ChatGPT

Simon Frieder∗1,4 , Luca Pinchetti1 , Ryan-Rhys Griffiths3 , Tommaso Salvatori2 , Thomas

Lukasiewicz1,2 , Philipp Christian Petersen4,5 , Alexis Chevalier6 , and Julius Berner4
1
Department of Computer Science, University of Oxford, Oxford, UK
2
Institute of Logic and Computation, TU Wien, Vienna, Austria
3
Department of Physics, University of Cambridge, Cambridge, UK
arXiv:2301.13867v1 [cs.LG] 31 Jan 2023

4
Faculty of Mathematics, University of Vienna, Vienna, Austria
5
Research Network Data Science, University of Vienna, Vienna, Austria
6
School of Mathematics, Institute for Advanced Study, Princeton, US

February 1, 2023

Abstract
We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as
well as hand-crafted ones, and measuring its performance against other models trained on a mathematical
corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional
mathematicians by emulating various use cases that come up in the daily professional activities of
mathematicians (question answering, theorem searching). In contrast to formal mathematics, where
large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of
natural-language mathematics, used to benchmark language models, only cover elementary mathematics.
We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made
and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics
and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark
ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new
dataset publicly available1 to assist a community-driven comparison of ChatGPT with (future) large
language models in terms of advanced mathematical comprehension. We conclude that contrary to many
positive reports in the media (a potential case of selection bias), ChatGPT’s mathematical abilities are
significantly below those of an average mathematics graduate student. Our results show that ChatGPT
often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to
pass a university exam, you would be better off copying from your average peer!

1 Introduction
Since its introduction, ChatGPT has rapidly become a widely known question-and-answer dialogue system.
It has been mentioned in traditional media across the globe [33, 28, 22] and across all major internet
platforms [40, 43]. According to Twitter data, it is by far the most talked about language model to date; cf.
Figure 1.
The performance of ChatGPT has been analyzed in a large number of exam-related use cases, with varying
degrees of scientific rigor, ranging from detailed studies to anecdotal evidence. Use cases include passing the
United States Medical Licensing Examination [17], scoring highly on the Psychology Today Verbal-Linguistic
Intelligence IQ Test [34], and answering (and generating) Operations Management exam questions that were
∗ Corresponding author: [email protected]. The subsequent author list is ordered randomly.
1 github.com/friederrr/science-GHOSTS

1
N
N %(57
/D0'$
N &RGH[
*37

1XPEHURI0HQWLRQV
N
,QVWUXFW*37
N &KDW*37

<HDU
Figure 1: Twitter count data relating the counts of a selection of notable large language models from the
beginning of the release date of GPT-3. The x-axis is a log-scaled. ChatGPT counts by far dominate those
of all other language models. Vertical year-ticks denote the end of the mentioned year.

deemed to be within the scope of a typical MBA curriculum [41], all with a performance that elicited positive
surprise of the authors. Due to this and other reasons, it is widely believed that large language models
(LLMs) will impact a large number of areas and will be used as assistants by many professionals.
In this article, we will focus on performing a detailed analysis of the mathematical capabilities of ChatGPT.
This includes, but is not limited to, answering exam-style mathematical questions and investigating how
ChatGPT behaves in a number of mathematical contexts. Our analysis includes testing how many of the
skills ChatGPT can emulate that are necessary to do professional mathematics. Examples of such skills are
Rπ
the ability to answer computational questions (“What is the value of 02 arccos( 1+2 cos x ) dx?”), the ability
cos x

to complete mathematical proofs that have gaps or missing steps, the ability to solve questions that are
more focused on deep insights and original solutions, such as those of mathematical Olympiads, and the
ability to survey the literature and think across domains (“which other theorems do we need to prove a given
theorem?”).
To do this, we have designed a thorough testing methodology to evaluate the outputs of ChatGPT, including
error codes that represent various possible failure modes of ChatGPT (see Section 3). We score ChatGPT’s
responses, we report on the results using this methodology, and we compare ChatGPT to state-of-the-art
models trained for mathematical comprehension.
Moreover, we have created new datasets of prompts that are aimed at testing specific aspects of ChatGPT
related to mathematical comprehension. We evaluate ChatGPT by comparing it to random samples
from existing datasets that were devised to test models that were specifically trained for mathematical
comprehension [18, 16]. A number of the datasets are specifically designed so that the questions could not be
answered if ChatGPT were memorizing the results. All of those datasets are created by the authors.
In summary, the contributions of this article are threefold:
• First, insight for mathematical use is provided. We show for which types of questions and which
domains of mathematics, ChatGPT may be useful and how it could be integrated into the workflow of
a mathematician.
• Second, the failure modes of ChatGPT are identified, as well as the limits of its capabilities. This
can aid future efforts to develop LLMs that perform better in mathematics. Our analysis is akin to

2
a mathematical model card, where the mathematical strengths and weaknesses are summarized (see
Section 4).
• Third, we provide benchmarks for testing the mathematical capabilities of future LLMs so that they can
be compared to ChatGPT across a range of aspects regarding advanced mathematical comprehension.
This is achieved by introducing new natural-language math datasets. Two of these benchmarks are
derived from the most advanced datasets regarding mathematical queries for language models that exist
today. Additionally, we devise four more datasets on which we benchmark ChatGPTs performance. We
release the collection of these datasets publicly on GitHub2 , and we encourage community participation
by allowing GitHub pull requests in order to grow the datasets beyond their current sizes.

2 Related Work
As a large language model, ChatGPT can be universally employed to perform mathematical reasoning
and therefore has to compare with technologies that in this space are sometimes decades old. Performing
mathematical reasoning in an automated way has a long history and can be traced back to 1959 [37], the most
focus being devoted to proving theorems [11]. Presently, there is a realization that the classical approaches,
using a symbolic encoding of mathematics, have reached a plateau [14].
There is now a growing body of literature on learning mathematical relationships directly in a supervised-
learning manner [2, 10, 15] or by using LLMs to perform mathematical reasoning directly on mathematics
encoded in natural language [20]. Sometimes, the distinction is blurred because Transformers can also
be used in a supervised-learning setting and have been employed successfully in learning mathematical
relationships [18, 6].
Most recently published large language models, such as PaLM [8], released in 2022, are tested only on
elementary-level mathematical reasoning datasets, such as the GSM8K dataset [9]. We speculate this is
because the obtained results already suggest that the models struggle on much simpler datasets than ours,
such as the MathQA [1] dataset or the GSM8K dataset [9], respectively. For example, the version of PaLM
with 540 billion parameters with chain-of-thought prompting and access to an external calculator solves
only 58% on the GSM8K dataset [8, Table 10]. This model nonetheless outperforms GPT-3 [5] on the same
dataset, which only solves at best 54%; this performance is consistent with the performance of older models.
Variations of BERT [30] have been shown to only solve between 28% and 37% of the problems when fine-tuned
and tested on the AQuA-RAT dataset [21], which is the direct predecessor of MathQA. In some cases, such as
the LaMDA model [42] or BLOOM [19], both released also in 2022 by Google, an evaluation of mathematical
reasoning capability is missing entirely.
Among the mentioned LLMs, Minerva [20], based on PaLM, stands out, being trained in equal parts on websites
that contain MathJax elements and arXiv preprints (and on general natural language data on which PaLM
was trained), achieving a score of roughly 50% on a significantly harder dataset, the MATH (Mathematics
Aptitude Test of Heuristics) dataset [16] that was sourced from various mathematical competitions.
One distinguishing feature of the MATH dataset [16] is that its problems admit 1) a unique answer (no
open-ended questions) 2) and the answer can be condensed within a few characters (a number, for example).
This is beneficial in terms of automatic evaluation of a model on such a dataset since one can simply ask
for the final answer, ignoring the step-by-step solution (moreover, one can train models, as [16] do, to fit
this style of inquiry and output either the final solution only, or the step-by-step derivation leading to the
solution).
Among the supervised approaches, we mention [18], where a Transformer architecture was used to generate
symbolic solutions to integrating functions and finding closed-form solutions to first-order and second-order
differential equations, which outperformed classical solvers, such as Mathematica, MATLAB, and Maple
2 github.com/friederrr/science-GHOSTS

3
by at least 14% on a test set of integration problems. On the task of solving differential equations, the
Transformer-based approach still exceeds the classical approach, but by a smaller margin (at least 4% in
the case of first-order differential equations and with more varied results for second-order equations). An
up-to-date survey on mathematical datasets and performance of various LLMs can be found in [23].
For ChatGPT, most investigations related to mathematical reasoning consist to date of anecdotal evidence
concerning its performance and its failure modes; see, e.g., [43, 32, 24, 40]. Unfortunately, a clear methodology
is missing, as most of the results are scattered on various internet platforms and are not easily reproducible.
To the best of our knowledge, the only mathematical investigation was undertaken in [4], which mainly
investigated ChatGPT’s capability to compute irrational numbers to high accuracy.
On the other hand, we like to mention the case of formalized mathematics, where large databases that encode
advanced mathematical concepts exist, e.g. the Lean Mathematical Library [25]. Some of the ideas that we
have used in this article, such as prompting with missing proofs, are echoed in [31] for formal mathematics.
Yet, for the purpose of doing mathematics with large language models, these formal datasets cannot be
leveraged since no straightforward way exists to convert them to natural language (in addition to various
issues, such as bias, that might occur in the context of an automatic conversion).

3 Datasets
3.1 Dataset creation
We assess the mathematical reasoning capabilities of ChatGPT by creating a collection of multiple datasets
of prompts, totaling 728 prompts, for which ChatGPT’s output was manually rated by experts. Then, we
record and rate each of the outputs provided by the model. The combined effort of devising mathematically
insightful prompts, some of which are at graduate-level mathematics, and carefully rating the the output of
ChatGPT amount to several hundreds of person-hours.
We divide our entire collection of prompts into six subdatasets 3 , called
• Grad-Text
• H oles-in-Proofs
• Olympiad-Problem-Solving
• Symbolic-Integration
• MAT H
• Search-Engine-Aspects
We summarize those in Table 1. The letters that are set in boldface make up the GHOSTS acronym.
Two of the subdatasets, the MATH subdataset and the Symbolic-Integration subdataset, use prompts taken
from existing datasets, [16] and [18], respectively. This was done in order to be able to compare how ChatGPT
performs against existing state-of-the-art models, one based on an LLM, Minvera [20], and one based on a
supervised-learning approach [18]. Nonetheless, significant, additional annotation effort was involved since in
both cases the authors, as experts in the field, rated the output. Furthermore, in the second case, a conversion
from Polish notation was necessary.
The other subdatasets were hand-crafted by the authors. We note that it is neither possible to outsource the
creation of these datasets to a crowdsourcing service, such as Amazon Mechanical Turk, nor is it possible to
generate these datasets automatically from code because advanced mathematical insight is required for the
creation of each prompt (though based on our work, it might be possible to extend the dataset by creating
variations of our questions in a purely programmatic manner; see Section 5). Furthermore, unlike in the case
3 In the GitHub repository, each subdataset corresponds to a folder, which in turn can consist of multiple files.

4
Dataset name Comprised of the file(s) Tags
Grad-Text W. Rudin, Functional Analysis (ch. 1) M3 Q4
W. Rudin, Functional Analysis (ch. 2) M3 Q4
J. Munkres, Topology (ch. 1) M3 Q4
J. Munkres, Topology (ch. 2) M3 Q4
R. Durret, Probability Theory M3 Q4
H oles-in-Proofs Proofs Collection A M3 Q2 Q5
Proofs Collection B Prealgebra M1 Q5
Proofs Collection B Precalculus M1 Q5
Olympiad-Problem-Solving Olympiad Problem Solving M4 Q4 D2
Symbolic-Integration Symbolic Integration M2 Q3 D1
MAT H MATH Algebra M1 M2 M3 Q3 Q4
MATH Counting and Probability M1 M2 M3 Q3 Q4
MATH Prealgebra M1 Q3 Q4
MATH Precalculus M1 Q3 Q4
Search-Engine-Aspects Definition Retrieval M3 Q1 Q2 D3
Reverse Definition Retrieval M3 Q2 D3
Named Theorem Proof Completion M3 Q1 Q2 D3

Table 1: A summary of all datasets, together with their associated tags. The tags Mi, Qi, and Di relate to
the level of M athematical difficulty, the Question type, and the Out-of-Distribution type from Section 3.3,
respectively.

of the MATH dataset by [16] (see Section 2), the output of ChatGPT cannot be automatically evaluated and
a professional opinion on its output needs to be given. This raises the difficulty of the creation of more data,
since graduate-level mathematics (and in some cases PhD-level mathematics) is required. The mathematical
skill level of the authors matches this requirement.
Our dataset goes beyond all the mentioned mathematical datasets for LLMs in Section 2 in terms of
mathematical sophistication and in terms of the different aspects of mathematical reasoning that are being
tested. It also surpasses datasets mentioned in the survey [23] along these dimensions. Futhermore, unlike for
the MATH dataset by [16], a large number of our questions are not of the type of having an answer that
can be condensed into a few tokens (e.g., when the answer is a mathematical proof), such as a number or a
function, and therefore evaluation cannot be performed automatically.

3.2 Format
The format of each of our subdatasets that make up our GHOSTS dataset follows the same convention
and each subdataset consists of (potentially multiple) JSON-formatted files with entries such as in the
example below. Our format is similar to, e.g., the AQuA-RAT (Algebra Question Answering with Rationales)
dataset [21]. A single datapoint in a JSON file has the following form:
{
" prompt " : "Can you q u o t e a famous m a t h e m a t i c a l theorem ? " ,
" output " : " E u c l i d ’ s Elements , P r o p o s i t i o n 4 7 : I n r i g h t −a n g l e d t r i a n g l e s , t h e
s q u a r e on t h e s i d e o p p o s i t e t h e r i g h t a n g l e i s e q u a l t o t h e sum o f t h e s q a r e s
on t h e o t h e r two s i d e s . " ,
" r a t i n g " : " 5" ,
" e r r o r c o d e s " : [ "" ] ,
" w a r n i n g c o d e s " : [ "w1" ] ,
"comment" : " This i s a c t u a l l y Pythagoras ’ theorem , which was not mentioned , hence
o n l y a warning f l a g i s r a i s e d . " ,
" r e f " : ""

5
"msc" : " 51M04" ,
" c o n f i d e n c e " : " high " ,
" timestamp " : "2023−01−30"
}

An example where ChatGPT was less successful is the following:

{
" prompt " : " Suppose X i s a v e c t o r s p a c e . Prove t h a t $0x = 0 = a0$ i f $x i n X$ and
$a$ i s a s c a l a r . " ,
" output " : "To p r o v e t h a t $0x = 0 $ f o r any $x i n X$ , we u s e t h e f a c t t h a t $0$ i s
t h e a d d i t i v e i d e n t i t y o f t h e v e c t o r s p a c e $X$ , meaning t h a t f o r any $x i n X$ ,
$0 + x = x$ . Thus , i f we m u l t i p l y $0$ by any s c a l a r $a$ , we have $a ( 0 + x ) =
ax = a0 + ax = a0$ , which shows t h a t $a0 = 0 $ . \ n\ n S i m i l a r l y , t o p r o v e t h a t $a0
= 0 $ f o r any s c a l a r $a$ , we u s e t h e f a c t t h a t $0$ i s t h e m u l t i p l i c a t i v e
i d e n t i t y o f t h e s c a l a r f i e l d , meaning t h a t f o r any s c a l a r $a$ , $a c d o t 1 = a$ .
Thus , i f we m u l t i p l y $0$ by any s c a l a r $a$ , we have $a c d o t 0 = a c d o t 1 c d o t
0 = a c d o t 0 = a0$ , which shows t h a t $a0 = 0 $ . " ,
" r a t i n g " : " 3" ,
" e r r o r c o d e s " : [ "e5_2" , "e5_4" ] ,
" warningcodes " : [ ] ,
"comment" : " p a r t o f t h e r e s u l t i s shown , but t h e argument f o r t h e s e c o n d h a l f i s
n o n s e n s e . A l s o makes wrong s t a t e m e n t s . " ,
"msc" : " 46A03 " ,
" r e f " : " Rudin−F u n c t i o n a l A n a l y s i s −Second−Ed . Part1−ex1 /b−page38 " ,
" c o n f i d e n c e " : " high " ,
" timestamp " : "2023−01−18"
}

We explain each of these fields; unless explicitly mentioned below:

• prompt denotes the input that we supply to ChatGPT via its web interface. We always use a new
session for each prompt to avoid biasing ChatGPT by using previous prompts.
• output denotes the raw output that ChatGTP supplies us with;
• rating is a number from 1 to 5 that shows how many points ChatGPT has scored, 5 being a perfect
answer and 1 being the lowest. A detailed explanation regarding the rating policy that we followed is
contained in the appendix, Section A.
• errorcodes highlight a list of error types that illustrate the failure mode of ChatGPT in a more fine-
grained way. Not all types of errors apply to all (sub)datasets: For example, an error code for a missing
proof step would not be applicable on a dataset that tests whether ChatGPT can multiply numbers or
find prime divisors. This field can be empty if no error code applies. The detailed explanation of the
error codes (and the warning codes; see below) that was provided to the annotators is contained in the
appendix, Section B.
• warningcodes highlight any problematic aspects of ChatGPT; for example, ChatGPT might be rambling
and providing the user with unrelated information or use a poor (but correct) way of solving problems.
This field can be empty if no warning code applies.
• comment denotes any noteworthy commentary that an assessor of ChatGPTs may make. This can be
related to giving a more detailed explanation of output (or its failure modes), providing reasoning
behind awarding a certain error code, generally providing context, etc. For some subdatasets (see
Section 3.3), this field was used to indicate the difficulty level of the prompt, as well as an official
solution, if available. This field is not required to always have a value.

6
• msc denotes the math subject classification4 (MSC) that pertains to the output, not the prompt that one
gives to ChatGPT. This is because the prompt, unlike the output, might not really have a classification,
for example, if ChatGPT is asked what the most important theorem in all of math is5 .
• ref indicates a reference to where the prompt was originally taken from (for some subdatasets, such as
Holes-in-Proofs, we have changed proofs from various books or math.stackexchange.com; the original
source was recorded in this field). This field can be empty if the question was formulated by the authors
and no authoritative source was plausible.
• confidence indicates how confident we have perceived ChatGPT to be when presenting us with its
output. We allow values of “high”, “medium”, and “low”.
• timestamp denotes when the prompt was entered into ChatGPT.
Each subdataset is made up of multiple such data points. In cases where we used prompt engineering (the
Olympiad-Problem-Solving dataset), and asked variations of a single question, we enclosed multiple such data
points in an array within the JSON file.
The fields within a single data point interact in nontrivial ways: If a rating of 5 is given, then it is expected
that no error code is present—though there may be warning codes that are used. The error codes and warning
codes are loosely in the spirit of a compiler throwing errors and warnings if it is given incorrect or sloppy
code—although we have a role reversal, where the human is now the compiler and the machine produced the
code. In this sense, for some prompts, we have used multiple error and/or warning codes, which is why these
fields are arrays of strings. We use these codes to collect statistics on the behavior of ChatGPT; see Section 4.
The usage of MSC codes can be useful for mathematicians who want to integrate ChatGPT in their daily
workflow, as it allows them to know in which areas the model performs better and can hence be trusted more.
Our dataset is very diverse, as it has a total of 41 MSC codes. The top short version of these codes (first
two digits) is 26 (“Real functions”, 125 occurrences) followed by 05 (“Combinatorics”, 124 occurrences) and
60 (“Probability theory and stochastic processes”, 100 occurrences). An exhaustive survey of ChatGPT’s
performance across every MSC code would necessitate a large, community-driven effort to set up a very large
database. Because of the high cost of rating each output, which requires specialized skills, this is something
that no individual research group could reasonably do—but we hope that our approach is a starting point for
such an effort.
For end-users of ChatGPT, it is desirable to avoid having a long-winded dialogue to arrive at a solution.
Therefore, we require that ChatGPT gives us the correct solution by providing it only the input without any
subsequent clarification. All chats with ChatGPT are thus “cold”. But we do allow the possibility of prompt
engineering of the input, where more information is added beyond the core prompt content.

3.3 The subdatasets

For most of our subdatasets, we have used LATEX to encode mathematical input. Our experiments have
shown that ChatGPT can process LATEX-encoded mathematics well. For example, on the Holes-in-Proofs
dataset, except for one case, the output of the prompts by ChatGPT was valid LATEX source code and could
be rendered.
The Grad-Text subdataset consists of a collection of books that are used widely in universities to teach upper
undergraduate or first-year graduate courses in a degree in mathematics. We have used as prompt and rated
the output of most of the exercises from the first and second chapters of these books (except for the book
[12] of which we only used exercises from the (quite long) first chapter).
The Olympiad-Problem-Solving subdataset consists of a selection of exercises from the book Problem-Solving
Strategies, [13], that is often used when preparing for mathematical competitions. We selected and graded
4A complete list of MSC codes can be accessed under the URL zbmath.org/static/msc2020.pdf.
5 The answer is Pythagoras’ theorem, according to ChatGPT.

7
ChatGPT output on one hundred exercises drawn from all chapters.
The Holes-in-Proofs subdataset consists of a number of proofs drawn from math.stackexchange.com, a
collection of books [3, 35], and the MATH dataset [16] where intentionally parts of the proofs where deleted
and ChatGPT was prompted to fill the gaps: This was done either by (1) using a MISSING token, (2) finishing
the proof early and prompting ChatGPT to complete it, or (3) explicitly asking for certain conditions or
results.
The MATH subdataset consists of a random sample of prompts from the MATH dataset [16]. The MATH
dataset by [16] comes with a level of difficulty for each problem. We focused our random samples on two
domains, Algebra and Probability Theory, but sampled the same number of problems at each level of difficulty.
The Symbolic-Integration subdataset consists of random samples of integrals that were in the test set of [18].
We converted these from Polish notation to LATEX before prompting ChatGPT. The assessment was done by
comparing it with a classical symbolic integration system, Mathematica.
The Search-Engine-Aspects subdataset consists of problems that were not sampled from a particular source
and generated by a human expert in the field. In the file Named Theorem Proof Completion we focused on
prompting ChatGPT to provide proof outlines of various theorems that are sufficiently well-known within
Functional Analysis to have names. In the Definition Retrieval file, we prompted ChatGPT to state correctly
various definitions centered around Functional Analysis and Topology. In contrast, in the Reverse Definition
Retrieval file, we verified whether ChatGPT was able to deduce the name of a mathematical object by
describing its properties.
Our subdatasets can be categorized along multiple dimensions, which we state below:

Mathematical difficulty (ascending)

1. elementary arithmetic problems, as found in the MATH dataset [16] at lower levels of difficulty;
2. symbolic problems (integration of functions) that can be also solved via a supervised-learning, data-driven
approach to mathematics [18];
3. (under)graduate-level exercises from well-known textbooks [35, 36, 26, 12, 3] as well as questions from
math.stackexchange.com, spanning diverse domains of mathematics;
4. exercises that are in the style of mathematical olympiad problems, such as those taken from Engel’s
Problem-Solving Strategies book [13];

Question type
1. Review questions, which ask to state or name correctly certain mathematical facts (the Definition
Retrieval file of the Search-Engine-Aspects subdataset);
2. Overview-type review questions, which cut through an entire field of mathematics (the Named Proof
Completion and Reverse Definition Retrieval parts of the Search-Engine-Aspects subdataset, as well the
Holes-in-Proofs subdataset);
3. Computational questions (the Symbolic Integration subdataset as well as various prompts from the
MATH dataset);
4. Proof-based questions, which ask for a theorem proof or for a puzzle solution (The Olympiad-Problem-
Solving subdataset, the Grad-Text subdataset);
5. Proof-completion questions, which ask for a proof that has missing gaps, or is incomplete, to be
completed (the Holes-in-Proofs subdataset as well as various prompts from the MATH dataset)

8
Types of high out-of-distribution likelihood
1. Nontrivial problem encoding: The data points from the Symbolic Integration subdataset come from [18]
and are publicly available6 . Since the online training set uses Polish notation, it is very unlikely that
ChatGPT has seen these exact prompts before;
2. Succinct solution: The solutions for the Olympiad-Problem-Solving subdataset are included in the book
by Engel [13]. But the solutions are extremely concise, and simply repeating them would not show an
immediate understanding of the problem;
3. Spoken dialogue: The Search-Engine-Aspects subdataset is unlikely to be well represented in the data
on which ChatGPT has been trained since its prompts resemble word fragments that might appear in a
mathematical dialogue (e.g., an oral mathematical exam), rather than in a textbook;
One could, in theory, start to investigate every possible combination of these attributes (e.g., for elementary
arithmetic problems, in a non-trivial encoding, one could generate data to cover every possible question type
listed above). This would lead to 60 (sub)datasets, which, because of the manual, skilled curation effort,
is by far too much for a single research group. Hence, we will allow pull requests in order to encourage
the community to contribute and grow these datasets, so that they can be used as a useful benchmark for
other LLMs. We have nonetheless striven in this work to cover each of these aspects individually, in some
subdataset, as can be seen in Table 1. Investigating every possible combination of attributes with a separate
(sub)dataset is not feasible.
Existing datasets do not cover all possible combinations of possibilities across all these dimensions. Devising
further datasets to do so is not feasible. Instead, we will devise a specific, well-crafted dataset to cover a
number of illustrative cases, which we describe below, so that we cover all these dimensions individually. We
collect statistics for ChatGPT’s output, including output length, the stability of the answer under prompt
engineering, as well as a personal rating of how close we perceived ChatGPT to be to the correct answer.
Because input to ChatGPT is purely textual, certain types of questions that have a strong geometrical
flavor and might be stated and solved in non-text-based fashion (e.g., questions involving diagrams or small
algorithms, as occasionally occur in [13]) have been excluded.

4 Results
Will ChatGPT get you through a university math class? No, you would be better off copying from your
average peer! In this section, we analyze why by going through the common mistakes performed by ChatGPT
one dataset at a time.

Grad-Text ChatGPT performed best on simple set-theory and logic questions (the first chapter from the
book Topology by J. Munkres ([26])), which is reflected in its rating; see Figure 2. On the rest of the books,
it performed substantially worse. We note that it never failed to understand a query; the lowest grade it
received is a 2. Because of the confidence (high) with which it outputs the answer, the use of ChatGPT is
particularly deceiving in this use-case, since it may be intensively used by students studying these subjects.

Olympiad-Problem-Solving On this subdataset, ChatGPT performed poorly. Extra points were awarded
when the answer started to show promise, giving a score of 3, but most scores are 2 because the answer
does no show promise. No rating of 5 was awarded, and only two ratings of 4 were achieved. ChatGPT
had a tendency to try and solve many questions using induction arguments. While this is not obviously
false, this was very far from the solutions given in the book. ChatGPT’s inductive proofs were easily seen
to contain mistakes. ChatGPT often struggled to understand unusual puzzles and strange situations. For
example, on the questions based on changing the colour of the squares on a chess board, the solution offered
6 github.com/facebookresearch/SymbolicMathematics

9
2O\PSLDG3UREOHP6ROYLQJ
:5XGLQ)XQFWLRQDO$QDO\VLVFK
6\PEROLF,QWHJUDWLRQ
-0XQNUHV7RSRORJ\FK
5'XUUHW3UREDELOLW\7KHRU\
:5XGLQ)XQFWLRQDO$QDO\VLVFK
0$7+$OJHEUD
1DPHG7KHRUHP3URRI&RPSOHWLRQ
'DWDVHW
0$7+&RXQWLQJDQG3UREDELOLW\
0$7+3UHFDOFXOXV
3URRIV&ROOHFWLRQ%3UHFDOFXOXV
3URRIV&ROOHFWLRQ$
-0XQNUHV7RSRORJ\FK
'HILQLWLRQ5HWULHYDO
0$7+3UHDOJHEUD
3URRIV&ROOHFWLRQ%3UHDOJHEUD
5HYHUVH'HILQLWLRQ5HWULHYDO

$YHUDJH5DWLQJ

Figure 2: Average rating for each file in each subdataset (same colors indicate the files come from the same
subdataset). Since the maximal ranking is 5, and the minimal ranking where the question was at least
understood is 2, a passing grade (50% of points) would be 3.5, as indicated by the dotted line. The error bars
represent 95% confidence intervals.

by ChatGPT obviously did not cover an 8 × 8 chessboard. Sometimes, ChatGPT would try to solve the
problem by changing only 5 squares, far from the 32 required. Similarly, ChatGPT struggled to respect
the unusual constraints given in the questions. This gave rise to lots of e6 errors. The solutions offered by
ChatGPT obviously did not respect the rules set out in the question and show that ChatGPT struggles
to integrate surprising information into its answers. In some cases where the problem seemed to require
complicated mathematics but was actually solvable by elementary techniques, ChatGPT did not spot this but
instead referred to the general theory of, e.g., diophantine equations. ChatGPT would often say, e.g., that the
question could be solved with these means but that this was hard, so the confidence score was downgraded to
medium or low; this was the only dataset we gave such confidence score. From a mathematical point of view,
these questions were also by far the hardest, as they can also pose difficulty to professional mathematicians.

Holes-in-Proofs ChatGPT correctly recognized most well-known results or concepts (e.g., filling the
mean-value theorem, given a proof where a reference to this theorem was missing, and ChatGPT needed to
fill it in). In only three cases within Proofs Collection A, the question was not understood, which resulted in
the lowest possible rating of 1. We noted that ChatGPT was very strong at recognizing the context. The
notation of the output has always been the one given in the prompt. The ability of ChatGPT to execute
algebraic manipulations is surprisingly inconsistent. On many occasions, ChatGPT executes complicated
symbolic tasks with ease, and on many occasions, ChatGPT fails on basic arithmetic or basic rearranging.
The mistakes do not seem to correlate with the complexity of the algebraic expression. When ChatGPT
makes an algebraic mistake, it carries over this mistake reliably to the rest of the computation. We did not
spot any cases where the intermediate workings were wrong, but the final solution was correct.

MATH On the questions related to Algebra and Probability theory, ChatGPT got the reasoning often
correctly. However, the most common type of error was e4: ChatGPT may struggle when confronted with
standard operations, such as inverting fractions, least common multiples, changing the sign of numbers when
moving them from one side of the equal sign to the other. Often, in these questions, a correct solution requires

10
performing multiple operations in sequence. In most cases, at least one operation was wrong, preventing the
model to get a rating of 5 on the output.

Symbolic-Integration ChatGPT was dominated by systems that were trained specifically to solve inte-
gration problems [18]. In a number of instances, ChatGPT got the structure of terms right (for example,
the number of summands in the output, as well as where factors had to be placed before summands), but it
failed at concrete computations. Even very simple examples were not correct. For example, the antiderivative
of x 7→ x2 /2 is evaluated to x 7→ x3 /3 + C, where C is a constant of integration (the correct answer is
x 7→ x3 /6 + C). For a number of prompts, ChatGPT claims there is no closed form solution for the integral
with complete confidence when in fact there is a solution (see selection of worst-3 examples from Section F.

Named Theorem Proof Completion On this part of the Search-Engine-Aspects subdataset, ChatGPT
knew almost all the theorems that it was asked at a basic level but made mistakes when stating them. When
it came to listing other results required for the proofs, ChatGPT typically requested way more than the
necessary theory—occasionally even results that only follow from the theorem which was asked for (error
code e5_5).

Definition Retrieval On this part of the Search-Engine-Aspects subdataset, ChatGPT had a quite good
performance: it recited most definitions correctly. It sometimes got confused when being asked about
distributions in the sense of elements of the dual space of test functions. ChatGPT strongly favors the notion
of distributions in the stochastic sense. Similarly, for the adjective “closed”, where it chose to pick the context
of algebra (instead of topology) and interpreted it to mean “algebraically closed”.

Reverse Definition Retrieval On this part of the Search-Engine-Aspects subdataset, ChatGPT had the
strongest performance, being able to recover most definitions from their descriptions, with an average rating
of 4.3. This indicates the usefulness of ChatGPT as a general-purpose mathematical search engine. This
subdataset is also the simplest from a mathematical point of view since no logical thinking is required, but
only a name needs to be found.

4.1 Overall Performance

If we take a rating of 3.5 to be the threshold between success and failure, then Figure 2 shows that for most
(in particular, harder) problems, ChatGPT will not pass. In particular, on problems that are within the style
of mathematical olympiads, ChatGPT performs badly. Moreover, Figure 2 shows that the achieved ratings
correspond closely to the ranking of mathematical difficulty of the exercises that a mathematician would
assign. We analyze the results for different mathematical fields in Figure 3.
The prompt length has no clear effect on the rating; see Figure 4. This eliminates prompt length as a
confounding variable, and in combination with the findings of Figure 2 strongly indicates that ChatGPT’s
rating is indeed reflective of mathematical difficulty.
Comparing ChatGPT to the performance obtained by [18], who obtained a percentage of correctly solved
integral that was close to 100% on a collection of 500 test equations [18, Table 3], our random sample of their
dataset (after translation from Polish notation and conversion to LATEX) achieves an average rating of 2.6,
where a rating of 2 is dominating 60% of the time followed by a rating of 4 almost 20% of the time. These
scores trail behind the performance that the model trained in [18] achieved.
The situation is similar when comparing ChatGPT to Minerva [20]. Their best model achieved an accuracy
of 50% on the MATH dataset [16, Table 3]. On our random sample of the MATH dataset, ChatGPT achieves
a perfect score only in 26% of cases.
One interesting finding is related to prompt engineering. On 20% of the questions from Olympiad-Problem-
Solving, the question was asked a second time, where ChatGPT was additionally instructed to proceed

11

06&FRGH

$YHUDJH5DWLQJ

Figure 3: Average rating over mathematical fields (only those MSC codes that have at least 6 occurrences are
shown). Note that the ranking is not indicative of the complexity of the fields since we do not use equally
complicated exercises for all fields. The error bars represent 95% confidence intervals.

step-by-step. This type of engineering is recommended by OpenAI in their cookbook, to improve reliability7 .
This had an interesting effect: Less severe errors e2 went down, though more severe errors (e5-type errors)
stayed about the same, as did the average rating, see Figure 5.
ChatGPT is usually very confident, unlike other GPT-like models, such as Codex, that we have experimented
with. Consider in this sense the following statement (of statements):
Here a r e two s t a t e m e n t s : " Let (X, A, μ ) be a measure s p a c e . Show t h a t t h e s i g n e d
m e a s u r e s λ : A -> R form a Banach s p a c e . " ; " Let $ (X, A, \mu) $ be a measure s p a c e .
Show t h a t t h e s i g n e d measures $ \ lambda : A \ r i g h t a r r o w R$ form a Banach s p a c e . " Are
they e q u i v a l e n t ?

The response by ChatGPT is not phrased in order to show any nuance in terms of confidence (this is typical,
even if ChatGPT is wrong):
Yes , t h e two s t a t e m e n t s a r e e q u i v a l e n t . They both e x p r e s s t h e same m a t h e m a t i c a l
i d e a , but u s e d i f f e r e n t n o t a t i o n and f o r m a t t i n g c o n v e n t i o n s . The f i r s t s t a t e m e n t
u s e s p a r e n t h e s s t o e n c l o s e t h e e l e m e n t s o f t h e measure s p a c e and r e g u l a r l e t t e r s
to r e p r e s e n t the functions , while the second statement uses s u b s c r i p t e d
p a r e n t h e s e s and u s e s t h e LaTeX f o r m a t t i n g to r e p r e s e n t the elements of the
measure s p a c e and t h e f u n c t i o n .

The response by Codex [7], another model that we briefly tested (but whose scope would have exceeded that
of a single conference article) gives a cautions response and, unlike ChatGPT, is capable of voicing doubt:
I t h i n k t h e y a r e e q u i v a l e n t , but I am not s u r e .
7 github.com/openai/openai-cookbook/blob/main/techniques_to_improve_reliability.md

3URPSW/HQJWK

$YHUDJH5DWLQJ

3URPSW/HQJWK

$YHUDJH5DWLQJ

Figure 4: Prompt length (for two binnings) has no significant effect on the rating. Over all prompts, average
is 3.2, with a standard deviation of 1.23.

5 Conclusion
We have examined the behavior of ChatGPT across various datasets that test multiple aspects of mathematical
skill. Contrary to the media sensation that ChatGPT has caused (see the Twitter counts from Figure 1),
ChatGPT is not yet ready to deliver high-quality proofs or calculations consistently. At the same time, the
quality of the answer can be positively surprising. In Section F in the appendix, we collect the best and the
worst results for a number of selected datasets. The best responses can be seen to justify the media sensation.
It seems fair to say that ChatGPT is inconsistently bad at advanced mathematics: While its ratings drop
with the mathematical difficulty of a prompt, it does give insightful proofs in a few cases.
However, ChatGPT falls short of achieving the same performance as models that are specifically trained for
one single task. These models, in contrast, lack the flexibility of ChatGPT, which is a universal tool suitable
for any area of mathematics.
Nonetheless, its ability to search for mathematical objects, given information about them, is where ChatGPT
shines. It received its highest scores on the Reverse Definition Retrieval files from the Search-Engine-Aspects
subdataset.
Because of prohibitive annotation effort, our dataset is not large enough to be used to fine-tune LLMs in

13
50

Figure 5: Effect of prompt engineering on the rating (left-most bars). Prompt engineering seems to only
reduce error codes with smaller indexes (e2, e3, e4) that are not rooted in faulty logic. On the other hand,
error codes related to logical mistakes (e5) even increase. Moreover, prompt engineering does not affect the
average rating at all.

order to increase their mathematical ability; though we believe it is sufficiently comprehensive to allow an
evaluation of existing LLMs. We also note as a recommendation for future LLM design, that incorporating
some form automatic evaluation capabilities, as done by [16] is essential for lowering the cost of rating the
output.
We hope that the dataset that we release with this publication will motivate other professional mathematicians
to contribute in order to establish a thorough benchmark for assessing the mathematical abilities of LLMs.
We will allow pull requests on our GitHub repository and encourage public participation. We encourage other
researchers to mine our dataset beyond the descriptive statistics we have computed, in order to gain a deeper
understanding of the behavior of ChatGPT (and other LLMs) on mathematics.

References
[1] Aida Amini, Saadia Gabriel, Shanchuan Lin, and Rik Koncel-Kedziorski et al. MathQA: Towards
interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota,
June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1245.

[2] Malik Amir, Yang-Hui He, Kyu-Hwan Lee, Thomas Oliver, and Eldar Sultanow. Machine Learning
Class Numbers of Real Quadratic Fields. arXiv preprint arXiv:2209.09283, 2022.
[3] Sheldon Axler. Linear algebra done right. Springer, 2015.
[4] Amos Azaria. ChatGPT Usage and Limitations, Retrieved 2023-01-15. URL https://hal.science/hal-
03913837.

[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and Jared D Kaplan et al. Language
models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages
1877–1901. Curran Associates, Inc., 2020.
[6] Francois Charton, Amaury Hayat, and Guillaume Lample. Learning advanced mathematical computations
from examples. In International Conference on Learning Representations, 2021.

[7] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, and Henrique Ponde de Oliveira Pinto et al.
Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

14
[8] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, and Gaurav Mishra et al. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[9] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, and Heewoo Jun et al. Training
verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[10] Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, and Daniel Zheng et al. Advancing
mathematics by guiding human intuition with AI. Nature, 600(7887):70–74, 2021.
[11] Jörg Denzinger, Matthias Fuchs, Christoph Goller, and Stephan Schulz. Learning from previous proof
experience: A survey. Technical report, TU München, 1999.
[12] R. Durrett. Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, 2019.
[13] Arthur Engel. Problem-Solving Strategies. Springer, New York, 1998.
[14] John Harrison, Josef Urban, and Freek Wiedijk. History of interactive theorem proving. In Computational
Logic, volume 9, pages 135–214, 2014.
[15] Yang-Hui He. Machine-learning the string landscape. Physics Letters B, 774:564–568, 2017. doi:
https://doi.org/10.1016/j.physletb.2017.10.024.
[16] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, and Steven Basart et al. Measuring
mathematical problem solving with the MATH dataset. arXiv preprint arXiv:2103.03874, 2021.
[17] Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, and Lorie De Leon et al.
Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language
Models. medRxiv, 2022. doi: 10.1101/2022.12.19.22283643.
[18] Guillaume Lample and François Charton. Deep learning for symbolic mathematics. arXiv preprint
arXiv:1912.01412, 2019.
[19] Teven Le Scao and Angela Fan et al. Bloom: A 176b-parameter open-access multilingual language model.
arXiv preprint arXiv:2211.05100, 2022.
[20] Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, and Henryk Michalewski et el.
Solving quantitative reasoning problems with language models. In Advances in Neural Information
Processing Systems, 2022.
[21] Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation:
Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada,
July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1015.
[22] Sascha Lobo. Das Ende von Google, wie wir es kannten, Retrieved 2023-01-10. URL
https://www.spiegel.de/netzwelt/netzpolitik/bessere-treffer-durch-chatgpt-das-ende-
von-google-wie-wir-es-kannten-kolumne-a-77820af6-51d7-4c03-b822-cf93094fd709.
[23] Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. A survey of deep learning for
mathematical reasoning. arXiv preprint arXiv:2212.10535, 2022.
[24] Tranquil Sea Of Math. Does ChatGPT code LaTeX and write proofs?, Retrieved 2023-01-12. URL
https://www.youtube.com/watch?v=ge2N7VI_8P0.
[25] The mathlib Community. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN
International Conference on Certified Programs and Proofs. ACM, jan 2020. doi: 10.1145/3372885.
3373824.

15
[26] James R. Munkres. Topology. Prentice-Hall, Upper Saddle River, NJ, 2000.
[27] Natalie. ChatGPT – Release Notes, Retrieved 2023-01-31. URL https://help.openai.com/en/
articles/6825453-chatgpt-release-notes.
[28] John Naughton. The ChatGPT bot is causing panic now – but it’ll soon be as mundane a tool as Excel,
Retrieved 2023-01-14. URL https://www.theguardian.com/commentisfree/2023/jan/07/chatgpt-
bot-excel-ai-chatbot-tec.
[29] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, and Carroll L. Wainwright et al. Training language
models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
[30] Piotr Piękos, Mateusz Malinowski, and Henryk Michalewski. Measuring and improving BERT’s mathe-
matical abilities by predicting the order of reasoning. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan-
guage Processing (Volume 2: Short Papers), pages 383–394. Association for Computational Linguistics,
August 2021. doi: 10.18653/v1/2021.acl-short.49.
[31] Markus N. Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. Language modeling for formal
mathematics. arXiv preprint arXiv:2006.04757, 2020.
[32] Richard Van Noorden @[email protected] [@Richvn]. Huh. ChatGPT confidently gives the right
kind of reasoning to solve this math problem, but whiffs on the algebra in the middle and gets the
answer wrong. [Tweet]. Twitter, Retrieved 2023-01-09. URL https://twitter.com/Richvn/status/
1598714487711756288.

[33] Kevon Roose. The Brilliance and Weirdness of ChatGPT, Retrieved 2023-01-24. URL https://www.
nytimes.com/2022/12/05/technology/chatgpt-ai-twitter.html.
[34] David Rozado. What is the IQ of ChatGPT?, Retrieved 2023-01-09. URL https://davidrozado.
substack.com/p/what-is-the-iq-of-chatgpt.

[35] W. Rudin. Principles of Mathematical Analysis. International series in pure and applied mathematics.
McGraw-Hill, 1976.
[36] Walter Rudin. Functional analysis. McgGraw-Hill, Inc, New York, 1991.
[37] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research
and Development, 3(3):210–229, 1959. doi: 10.1147/rd.33.0210.

[38] Sarah Wiegreffe (sigmoid.social/@sarah) [@sarahwiegreffe]. If text-davinci-001 is a rough approximate to

the model reported in the NeurIPS 2020 paper, and text-davinci-002 is InstructGPT in the 2022 preprint,
then what is just "davinci"? Trying to reproduce results from a time before this naming existed. [Tweet].
Twitter, 2023-01-15. URL https://twitter.com/BlancheMinerva/status/1584788418751406080.
[39] John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, and Jacob Menick et al. ChatGPT: Optimizing
Language Models for Dialogue, Retrieved 2023-01-24. URL https://openai.com/blog/chatgpt.
[40] teddy [@teddynpc]. I made ChatGPT take a full SAT test. Here’s how it did: [Tweet]. Twitter, 2023-01-13.
URL https://twitter.com/teddynpc/status/1598767389390573569.
[41] Christian Terwiesch. Would Chat GPT3 Get a Wharton MBA? A Prediction Based on Its Performance
in the Operations Management Course, Retrieved 2023-01-04. URL https://mackinstitute.wharton.
upenn.edu/wp-content/uploads/2023/01/Christian-Terwiesch-Chat-GTP.pdf.
[42] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, and Apoorv Kulshreshtha et al. Lamda:
Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.

16
[43] Timothy Gowers [@wtgowers]. It’s amusing when ChatGPT makes ridiculous mathematical mistakes.
But of course, it’s more interesting to find out what it can do well. Here’s one example that wasn’t bad:
I gave it a very rough outline of a proof and asked it to fill in the details. [Tweet]. Twitter, 2023-01-13.
URL https://twitter.com/wtgowers/status/1611750773607604224.
[44] Carroll Wainwright and Ryan Lowe. InstructGPT: Training Language Models to Follow Instruc-
tions with Human Feedback , Retrieved 2023-01-09. URL https://github.com/openai/following-
instructions-human-feedback.

17
Appendix
A Rating policy
Below is the policy that was followed by each assessor of ChatGPT’s output regarding the rating score:
• 0 → ChatGPT cannot process the query (due to consistently occurring timeouts or errors);
• 1 → failure to understand the query (e.g., the user asks it something about number theory and it
responds with information about differential equations);
• 2 → query was understood but the answer was entirely wrong (e.g., the user asks what the prime
divisors of 111 are8 and it responds with 8 and 6);
• 3 → query was understood but the answer was only partially correct (e.g., the user asks it what the
prime divisors of 111 are and it responds with 3 and 6);
• 4 → query was understood and the answer was mostly correct (e.g, the user asks it what the prime
divisors of 222 are9 and it responds with 3 and 37);
• 5 → query was understood and answer was completely correct.

B Error and warning code policy

Below is the policy that was followed by each assessor of ChatGPT’s output regarding the error codes and
warning codes:
Error codes:
• e1 → missing examples (e.g., the user asks it what the prime divisors of 111 are and it responds with
3, missing 37);
• e2 → a few wrong statements (e.g., the user asks it what the prime divisors of 30030 are10 and it
responds with 2, 3, 5, 7, 13);
• e3 → a lot of wrong statements (e.g., the user asks it what the prime divisors of 30030 are and it
responds with 2, 5, 8, 12, 13, 15);
• e4 → wrong computations (an additional error flag to disambiguate between statements that are of
computational nature or not);
• e5 → denotes wrong logic or wrong flow of arguments, which we further subdivide into specific flags, as
we prohibit the use of e5 on its own, as it would be uninformative:
– e5_1 → ChatGPT claims that to complete a proof, statements need to be shown that are unrelated
to the claim;
– e5_2 → a proof step is missing;
– e5_3 → an edge case has not been considered ;
– e5_4 → an inference step is not supported (e.g., ChatGPT claims that from A follows B, but this
claim is not true);
– e5_5 → circular logical argument (using the hypothesis to prove the hypothesis);
8 They are 37 and 3.
9 They are 2, 37 and 3.
10 They are 2, 3, 5, 7, 11.

18
• e6 → the general set-up is understood but the legal operations are not respected (e.g., we are given a
puzzle where we’re only allowed to add even integers but it changes the rules and motivates the solution
by allowing addition of odd integers).
Warning codes:
• w1 → ChatGPT is withholding essential information related to the prompt (e.g., the user asked it
R∞ 2
something about the integral −∞ e−x dx and it answers correctly but doesn’t tell the user that the
integral was actually a famous, named integral, the Gaussian integral)
• w2 → ChatGPT is rambling (after answering (correctly or incorrectly) ChatGPT tells the user much
more than the user wanted to know)
• w3 → ChatGPT is hallucinating (after answering (correctly or incorrectly) ChatGPT tells the user
unrelated stuff)
• w4 → weirdness (ChatGPT is being weird, for example, by using a weird proof structure (where
applicable), using strange mathematical formulations, or by adopting a strange tone of the conversation
or making opinionated statements)
• w5 → it takes a number of tries to get ChatGPT to answer a prompt (because occasional timeouts or
errors occur that are related to this particular prompt)
• w6 → ChatGPT changes the notation from the prompt without being instructed to do so (e.g., the
prompt contains a vector space X, but ChatGPT calls it F)

C ChatGPT version
We focus on the 9th-January-2023 version of ChatGPT [27], as made available through web access at chat.
openai.com/chat. This version was online for the majority of the writing process of this article. Since the
30th of January 2023, a new version has been online which we will comment on at the end of this section.
Focusing on one version is necessary because precise details of the model architecture and, in particular,
ChatGPT’s training methodology have not been released in the introductory statement [39] by its creator,
OpenAI. ChatGPT is the latest model of the GPT lineage [27], being based on InstructGPT, which in turn
is based on a trained GPT-3 [5], and fine-tuned using reinforcement learning with human feedback [29]. We
note that even for models that predate ChatGPT, such as InstructGPT, where research articles and model
cards [44] have been released, full reproducibility is not possible since the code and exact datasets have not
been released. Furthermore, it was confirmed by OpenAI employees that a slight mismatch exists between the
trained model that is accessible via OpenAI web interface and the model referred to in the official paper [38].
This indicates how essential it is to document carefully which model our analysis pertains to; in our dataset,
we have included time stamps for each prompt in order to be able to track any changes in ChatGPT’s version
that have occurred.
In contrast to the version that was studied in this manuscript, the latest version from 30th-January-2023
is claimed to have “improved factuality and mathematical capabilities.” [27]. We compared the outputs
computed by the latest version to the old version on two of our subdatasets and could not find any substantial
differences in the resulting average rating: On the W. Rudin Functional Analysis (ch. 1) file of the Grad-Text
subdataset (29 prompts) the average rating of the 9th January version was 2.96 and on the 30th January
version it was 2.90. Additionally, on the 20 prompts of the Proofs Collection A file in the Holes-in-Proofs
subdataset, where missing equations or conditions need to be filled in, the average rating slightly dropped
from 3.5 to 3.3. It is planned to add a full comparison between the 09th and 30th January versions in the
near future.

19
D Further figures of ChatGPT’s performance
In this section, we collect further figures, which are not directly related to our discussion in the main body of
the article, but which provide other views on the data and other descriptive statistics.

:5XGLQ)XQFWLRQDO$QDO\VLVFK
5'XUUHW3UREDELOLW\7KHRU\
:5XGLQ)XQFWLRQDO$QDO\VLVFK
-0XQNUHV7RSRORJ\FK
3URRIV&ROOHFWLRQ$
2O\PSLDG3UREOHP6ROYLQJ H
-0XQNUHV7RSRORJ\FK H
0$7+&RXQWLQJDQG3UREDELOLW\ H
'DWDVHW

0$7+$OJHEUD H
1DPHG7KHRUHP3URRI&RPSOHWLRQ H
0$7+3UHDOJHEUD
3URRIV&ROOHFWLRQ%3UHDOJHEUD H
3URRIV&ROOHFWLRQ%3UHFDOFXOXV
0$7+3UHFDOFXOXV
'HILQLWLRQ5HWULHYDO
5HYHUVH'HILQLWLRQ5HWULHYDO
6\PEROLF,QWHJUDWLRQ

(UURUFRGHVSHUFHQWDJH
Figure 6: Error types per dataset sorted by relative frequency of e5 (the most high-level error type).
HB
HB
HB
HB
HB

Z
Z
Z
Z
Z
Z
H
H
H
H

3URRIV&ROOHFWLRQ$
3URRIV&ROOHFWLRQ%3UHDOJHEUD
3URRIV&ROOHFWLRQ%3UHFDOFXOXV
-0XQNUHV7RSRORJ\FK
-0XQNUHV7RSRORJ\FK

5'XUUHW3UREDELOLW\7KHRU\
:5XGLQ)XQFWLRQDO$QDO\VLVFK
:5XGLQ)XQFWLRQDO$QDO\VLVFK
'DWDVHW

0$7+$OJHEUD
0$7+&RXQWLQJDQG3UREDELOLW\
0$7+3UHDOJHEUD
0$7+3UHFDOFXOXV
2O\PSLDG3UREOHP6ROYLQJ
'HILQLWLRQ5HWULHYDO
1DPHG7KHRUHP3URRI&RPSOHWLRQ
5HYHUVH'HILQLWLRQ5HWULHYDO
6\PEROLF,QWHJUDWLRQ

Figure 7: Relative frequencies of error codes by subdatasets and files.

20
%
&
$
$
$
&
'
*
)
$
$
)
%
+
(
%
$
)
06&FRGH

.
$
$
%
%
$
%
$
%
&
$
(
$
&
$
'
$
(
'
'

$YHUDJH5DWLQJ

Figure 8: Rating by MSC codes (MSC code longer version).

E Copyright
Some of the subdatasets contain prompts that may be protected under copyright (e.g., exercises from
textbooks). In these cases, we have used the SHA256 hash function, and the publicly released datasets will
only contain the hashed value of the prompt that we used, together with a detailed reference that points to
the exact text page and theorem or exercise that was used. The non-hashed version of these datasets is kept
privately.

21
F Best-3 and worst-3 across selected subdatasets and files
We list below the best and worst answers of Chat-GPT over a selection of subdatasets. For readability,
the prompts and answers are lightly modified so that the LATEX-based formulas are correctly displayed and
whitespace is removed.

F.1 Grad-Text
Examples from Grad-Text subdataset, comprising the books [36, 26, 12].

Best-3
" prompt " :R " Let µ be a measure , kf k∞ = inf{M : µ({x : |f (x)| > M }) = 0} and kgk1 = |g|dµ . Prove
R
that |f g|dµ ≤ kgk1 kf k∞ . " ,
" output " : " Let M = kf k∞ and > 0 . Then , by t h e d e f i n i t i o n o f kf k∞ , we have
µ({x : |f (x)| > M + }) = 0 . T h e r e f o r e ,
Z Z Z Z Z
|f g|dµ = |f g|dµ + |f g|dµ ≤ (M + ) |g|dµ + |f ||g|dµ.
|f |≤M + |f |>M + |f |>M +

The s eR c o n d term on Rt h e r i g h t s i d e i s z e r o b e c a u s e µ(x : |f (x)| > M + ) = 0 . So we

have |f g|dµ ≤ (M + ) |g|dµ =R(M + )kgk1 . As can be a r b i t r a r i l y s m a l l , we can l e t
i t go t o z e r o and we g e t |f g|dµ ≤ M kgk1 = kf k∞ kgk1 . Hence proved . "

" prompt " : " I f {Tα } i s a f a m i l y o f t o p o l o g i e s on X , show t h a t T i s a topology

T
α α
on X . " ,
" output " : "To show t h a t T i s a t o p o l o g y on X , we need t o show T that i t
T
α α
s a t i s f i e s t h e t h r e e axioms o f a t o p o l o g y : ∅ and X a r e both i n α α
T sT ince ∅
and XS a r e i n e v e r y t o p T o l o g y Tα . I f {Ui }i∈I i s a c o l l e c t i o n oSf s e t s i n T ,
α α
then i∈I
U i i s a l s o i n α
T α . S i n c e each T α i s a t o p o l o g y , i∈I
U i i s i n e very
and s o i t i s a l s o i n . I f i s a c o l l e c t i o n o f s e t s i n , then
T T
Tα , α
T α {Ui }i∈I α
Tα
U i s also in Tα . S i n c e each Tα i s a t o p o l o g y , Ui i s i n e v e r y Tα , and
T T T
i∈I i α i∈I
so i t i s a l s o in T . Therefore , T i s a t o p o l o g y on X . "
T T
α α α α

" prompt " : " Let X be a v e c t o r s p a c e and A and B s u b s e t s o f X . I f A and B a r e

balanced , so i s A + B . " ,

" output " : "A s u b s e t S o f a v e c t o r s p a c e X i s c a l l e d b a l a n c e d i f f o r e v e r y v e c t o r

x i n X and e v e r y s c a l a r λ with |λ| ≤ 1 , λx i s i n S i f x i s i n S . To p r o v e t h a t
A + B i s b a l a n c e d , we must show t h a t f o r e v e r y v e c t o r x = a + b i n A + B , and
e v e r y s c a l a r λ with |λ| ≤ 1 , λx = λ(a + b) = λa + λb i s i n A + B i f x i s i n A + B . S i n c e
A and B a r e b a l a n c e d , we know t h a t f o r e v e r y a i n A and e v e r y s c a l a r λ with
|λ| ≤ 1 , λa i s i n A i f a i s i n A . S i m i l a r l y , f o r e v e r y b i n B and e v e r y s c a l a r λ
with |λ| ≤ 1 , λb i s i n B i f b i s i n B . So , i f x = a + b i s i n A + B , then a i s i n A
and b i s i n B . T h e r e f o r e , λa i s i n A and λb i s i n B , and t h u s λx = λa + λb i s i n
A + B . Therefore , A + B i s balanced ."

Worst-3
" prompt " : " Find a f u n c t i o n f : R → R t h a t i s c o n t i n u o u s a t p r e c i s e l y one p o i n t . " ,
" output " : "One example o f such a f u n c t i o n i s

if x 6= 0

x2
f (x) =
1 if x = 0.
This f u n c t i o n i s c o n t i n u o u s a t x = 0 but not a t any o t h e r p o i n t . "

22
" prompt " : " Let X be a t o p o l o g i c a l v e c t o r s p a c e . A l l s e t s mentioned below a r e
u n d e r s t o o d t o be t h e s u b s e t s o f X . Prove t h e f o l l o w i n g s t a t e m e n t : I f A i s
compact and B i s c l o s e d , then A + B i s c l o s e d . " ,
" output " : "A t o p o l o g i c a l v e c t o r s p a c e i s a v e c t o r s p a c e with a t o p o l o g y such t h a t
t h e v e c t o r a d d i t i o n and s c a l a r m u l t i p l i c a t i o n a r e c o n t i n u o u s . To p r o v e t h a t
A + B i s c l o s e d , we must show t h a t t h e complement o f A + B i s open . Let
y ∈ X \ (A + B) . Then t h e r e e x i s t s some x1 ∈ A and x2 ∈ B such t h a t y = x1 + x2 . S i n c e
A i s compact and B i s c l o s e d , t h e r e e x i s t n e i g h b o r h o o d s U1 o f x1 and U2 o f x2
such t h a t U1 i s r e l a t i v e l y compact and X \ U2 i s open . Now c o n s i d e r t h e s e t
V = U1 + U2 = {x1 + x2 : x1 ∈ U1 , x2 ∈ U2 } . This s e t i s an open n e i g h b o r h o o d o f y and i t
i s c o n t a i n e d i n X \ (A + B) , which shows t h a t X \ (A + B) i s open and hence A + B i s
closed ."

" prompt " : " Let X be a t o p o l o g i c a l v e c t o r s p a c e . A l l s e t s mentioned below a r e

u n d e r s t o o d t o be t h e s u b s e t s o f X . Prove t h e f o l l o w i n g s t a t e m e n t : The sum o f
two c l o s e d s e t s may f a i l t o be c l o s e d . " ,
" output " : "A c o u n t e r e x a m p l e can be used t o p r o v e t h a t t h e sum o f two c l o s e d s e t s
may f a i l t o be c l o s e d . For example , l e t X = R and l e t A = [0, 1] and B = (1, 2] . Both
A and B a r e c l o s e d s e t s i n R , but A + B = [0, 2] i s not a c l o s e d s e t i n R . "

F.2 Holes-in-Proofs (Proofs Collection A)

Examples from the Holes-in-Proofs subdataset, Proofs Collection A file, based on the books [3, 35] and
questions from math.stackexchange.com

Best-3
" prompt " : "We want t o show t h a t any open s u b s e t o f R i s a c o u n t a b l e union o f
d i s j o i n t open i n t e r v a l s . Let U be a non−empty open s u b s e t o f R . For x, y ∈ U
d e f i n e x ∼ y i f f [ MISSING ] . I t ’ s e a s i l y c h e c k e d t h a t ∼ i s an e q u i v a l e n c e
r e l a t i o n on U whose e q u i v a l e n c e c l a s s e s a r e p a i r w i s e d i s j o i n t open i n t e r v a l s
i n R . ( The term \ e n q u o t e { i n t e r v a l } h e r e i n c l u d e s unbounded intervals , i . e . ,
r a y s . ) Let I be t h e s e t o f ∼− c l a s s e s . C l e a r l y U = I∈I I . For each I ∈ I c h o o s e
S
a r a t i o n a l qI ∈ I ; t h e map I → Q : I 7→ qI i s i n j e c t i v e , s o I i s c o u n t a b l e . Can you
f i l l in the missing c o n d i t i o n ?" ,
" output " : "The m i s s i n g c o n d i t i o n i s " t h e r e e x i s t s an open i n t e r v a l c o n t a i n i n g x
and y t h a t i s c o n t a i n e d i n U . " "

Pn 1/p
" prompt " : "The p−norm on Rn i s g i v e n by kxkp = k=1
|xk |p . We want t o show show
t h a t kxkp ≤ Ckxkq f o r a r b i t r a r y 1 ≤ p < q . We compute

!1/p  ! pq 1/p ! q1
n n n
1− p 1
− q1
X X X
p q q
kxkp = |xi | ≤ |xi | n q  = |xi | np = n1/p−1/q kxkq .
i=1 i=1 i=1

Which r e s u l t has been used i n t h e i n e q u a l i t y above ? " ,

" output " : "The i n e q u a l i t y used i n t h e above c a l c u l a t i o n i s Hö l d e r ’ s i n e q u a l i t y . I t
s t a t e s t h a t f o r any p, q ≥ 1 such t h a t p1 + 1q = 1 , and f o r any v e c t o r s x ∈ Rn , y ∈ Rn ,

n
! n
!1/p n
!1/q
X X X
|xi yi | ≤ |xi |p |yi |q ."
i=1 i=1 i=1

23
" prompt " : " Let f be a r e a l −v a l u e d f u n c t i o n c o n t i n u o u s on [a, b] and d i f f e r e n t i a b l e
on (a, b) . Suppose t h a t limx→a f 0 (x) e x i s t s . Then , we want t o p r o v e t h a t f i s
d i f f e r e n t i a b l e a t a and f 0 (a) = limx→a f 0 (x) . Using t h e d e f i n i t i o n o f t h e
d e r i v a t i v e , we need t o show t h a t
f (a + h) − f (a)
lim
h→0+ h
e x i s t s and i s e q u a l t o lim f 0 (x) . Note t h a t f o r h > 0 t h e Mean Value Theorem
x→a+
p r o v i d e s a p o i n t ch with a < ch < a + h such t h a t [ MISSING ] . F i n a l l y , n o t e t h a t
ch → a+ a s h → 0+ . Can you f i l l i n t h e m i s s i n g e q u a t i o n ? " ,
" output " : "The m i s s i n g e q u a t i o n from t h e Mean Value Theorem i s :
f (a + h) − f (a)
f 0 (ch ) = ."
h

Worst-3
" prompt " : "We want t o compute
∞
X 2n
.
3n+1
n=1
Pm
Let ’ s f i r s t f i n d a g e n e r a l f o r m u l a f o r t h e f o l l o w i n g sum : Sm = n=1 nrn . N o t i c e
that
m
X r − rm+1 mrm+2 − (m + 1)rm+1 + r
Sm − rSm = −mrm+1 + rn = −mrm+1 + = .
1−r 1−r
n=1
Hence
mrm+2 − (m + 1)rm+1 + r
Sm = .
(1 − r)2
Can you f i n i s h t h e c a l c u l a t i o n ? " ,
" output " : " Yes , t o f i n i s h t h e c a l c u l a t i o n we can s u b s t i t u t e t h e v a l u e s o f m = ∞ ,
r = 2/3 i n t o t h e f o r m u l a we found f o r Sm :
2 ∞+2 2 ∞+1
2
∞ 3 − (∞ + 1) 3 + 3
S∞ =
2 2

1− 3

This s i m p l i f i e s t o
2
S∞ = 9
1
= 2, "
9

" prompt " : " Suppose V i s an i n n e r p r o d u c t s p a c e and T ∈ L(V ) . We want t o p r o v e t h a t

√
| det(T )| = det T ∗ T .
By t h e P o l a r D e c o m p o s i t i o n t h e r e i s an i s o m e t r y S ∈ L(V ) such t h a t
√
T = S T ∗T .
Can you f i n i s h t h e p r o o f ? " , √
" output " : " Yes , once we have t h a t T = S T ∗ T , we can t a k e t h e d e t e r m i n a n t o f both
s i d e s to get √
det(T ) = det(S) det( T ∗ T ).
S i n c e S i s an i s o m e t r y , i t p r e s e r v e s i n n e r p r o d u c t s , s o det(S) = ±1 . T h e r e f o r e ,
√ √
| det(T )| = | det(S)|| det( T ∗ T )| = | det( T ∗ T )|.
S i n c e t h e d√e t e r m i n a n t o f a m a t r i x i s a l w a y s non−n e g a t i v e , we have t h a t
| det(T )| = det( T ∗ T ) "

24
" prompt " : " F i s a f i n i t e f i e l d o f o r d e r q . We want t o compute t h e s i z e o f GLn (F ) .
I n o r d e r f o r an n × n m a t r i x t o be i n v e r t i b l e , we need t h e rows t o be l i n e a r l y
i n d e p e n d e n t . As you note , we have q n − 1 c h o i c e s f o r t h e f i r s t row ; now , t h e r e
a r e q v e c t o r s i n t h e span o f t h e f i r s t row , s o we have q n − q c h o i c e s f o r t h e
s e c o n d row . Now , l e t v1 , v2 be t h e f i r s t two rows . Then t h e s e t o f v e c t o r s i n
t h e span o f v1 , v2 i s o f t h e form {c1 v1 + c2 v2 |c1 , c2 ∈ F } . This s e t i s o f s i z e q 2 , a s
we have q c h o i c e s f o r c1 and q c h o i c e s f o r c2 . Thus , we have q n − q 2 c h o i c e s f o r
t h e t h i r d row . C o n t i n u i n g t h i s g i v e s t h e d e s i r e d f o r m u l a . Can you s t a t e t h e
formula ?" ,
" output " : "The s i z e o f GLn (F ) i s g i v e n by t h e f o r m u l a

(q n − 1)(q n−1 − q)(q n−2 − q 2 ) · · · (q 2 − q)(q − 1)."

F.3 Holes-in-Proofs (Proofs Collection B Prealgebra and Precalculus)

Examples from the Holes-in-Proofs subdataset, Proofs Collection B Prealgebra and Proofs Collection B
Precalculus files, based on [16].

Best-3
" prompt " : " Find t h e s m a l l e s t p o s i t i v e i n t e g e r t h a t i s g r e a t e r than 1 and
r e l a t i v e l y prime t o t h e p r o d u c t o f t h e f i r s t 20 p o s i t i v e i n t e g e r s . Reminder :
two numbers a r e r e l a t i v e l y prime i f t h e i r g r e a t e s t common d i v i s o r i s 1 .
S o l u t i o n : <MISSING> t h e s m a l l e s t p o s s i b l e i n t e g e r i s 23 . F i l l i n t h e m i s s i n g
steps ." ,
" output " : "The f i r s t 20 p o s i t i v e i n t e g e r s a r e 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 1 0 , 1 1 ,
1 2 , 1 3 , 1 4 , 1 5 , 1 6 , 1 7 , 1 8 , 1 9 , and 2 0 . The p r o d u c t o f t h e s e numbers i s
2 4 3 2 9 0 2 0 0 8 1 7 6 6 4 0 0 0 0 . To f i n d t h e s m a l l e s t p o s i t i v e i n t e g e r g r e a t e r than 1 t h a t
i s r e l a t i v e l y prime t o t h i s product , we can s t a r t a t 2 and i n c r e m e n t a l l y c h e c k
each number u n t i l we f i n d one t h a t has a g r e a t e s t common d i v i s o r (GCD) o f 1
with t h e p r o d u c t . We can c h e c k 2 , 3 , 4 , . . . , 2 2 , and s e e t h a t t h e GCD o f each
o f t h e s e numbers with t h e p r o d u c t i s not 1 . However , when we c h e c k 2 3 , we f i n d
t h a t t h e GCD o f 23 and 2432902008176640000 i s 1 . T h e r e f o r e , 23 i s t h e s m a l l e s t
p o s i t i v e i n t e g e r t h a t i s g r e a t e r than 1 and r e l a t i v e l y prime t o t h e p r o d u c t o f
t h e f i r s t 20 p o s i t i v e i n t e g e r s . " ,

" prompt " : " Find t h e number o f s o l u t i o n s t o cos x4 = cos x i n t h e i n t e r v a l 0 < x < 24π.
S o l u t i o n : From t h e e q u a t i o n cos x4 = cos x, cos x − cos x4 = 0. From t h e sum−to−p r o d u c t
fo rm ula , we can w r i t e t h i s a s
5x 3x
−2 sin sin = 0.
8 8
<MISSING> F i l l i n t h e m i s s i n g s t e p s . " ,
" output " : " S i n c e t h e p r o d u c t o f s i n e s i s e q u a l t o z e r o , e i t h e r sin 5x 8 = 0 o r sin 8 = 0.
3x

S o l v i n g f o r each c a s e : sin 8 = 0 =⇒ 8 = nπ, where n i s an i n t e g e r . This g i v e s us

5x 5x

t h e s o l u t i o n s x = 8n 5 π. sin 8 = 0 =⇒ 8 = nπ, where n i s an i n t e g e r . This g i v e s us

3x 3x

t h e s o l u t i o n s x = 8n 3 π. Now we need to f i n d the s o l u t i o n s that are in the

i n t e r v a l 0 < x < 24π. x = 8n 5 π g i v e s s o l u t i o n s x = 16
5 π, 5 π, 5 π, 5 π, 5 π which a r e i n t h e
24 32 40 48

g i v e n i n t e r v a l . x = 3 π g i v e s s o l u t i o n s x = 3 π, 3 π, 3 π which a r e i n t h e g i v e n
8n 8 16 24

i n t e r v a l . T h e r e f o r e , t h e r e a r e 5 s o l u t i o n s f o r x = 8n 5 π and 3 s o l u t i o n s f o r
x = 8n 3 π i n t h e g i v e n i n t e r v a l . The t o t a l number o f s o l u t i o n s i s 5+3=8."

" prompt " : " Let a, b, c be t h r e e v e c t o r s such t h a t kak = √1 ,

2
kbk = √1 ,
3
kck = √1 ,
6
ka + b + ck = 1, and
a = k(b × c)

25
f o r some s c a l a r k. Find t h e a n g l e between b and c, i n d e g r e e s . S o l u t i o n : S i n c e
b × c i s o r t h o g o n a l t o both b and c, a · b = 0 and a · c = 0. S i n c e ka + b + ck = 1 ,

(a + b + c) · (a + b + c) = 1.

<MISSING> F i l l i n t h e m i s s i n g s t e p s . " ,
" output " : " Expanding t h e dot p r o d u c t on t h e l e f t s i d e g i v e s :
a · a + a · b + a · c + b · a + b · b + b · c + c · a + c · b + c · c = 1. S i n c e a · b = a · c = b · a = c · a = 0, t h i s
s i m p l i f i e s t o : |a|2 + |b|2 + |c|2 + 2b · c = 1. We know t h a t |a|2 = 12 , |b|2 = 31 , |c|2 = 61 , and
s u b s t i t u t i n g t h e s e v a l u e s i n g i v e s : 12 + 13 + 16 + 2b · c = 1 S o l v i n g f o r b · c g i v e s :
b · c = 16 − 61 = 0 Which means t h e v e c t o r s a r e o r t h o g o n a l , hence t h e a n g l e between
them i s 90 d e g r e e s . "

Worst-3
" prompt " : "The l i n e d e s c r i b e d by
! !
2 k
−1 +t 2
3 1
i s p e r p e n d i c u l a r t o t h e l i n e d e s c r i b e d by
! !
2 2
−1 +u 1
1 2

and p a s s e s through t h e p !o i n t (4, a, b). Find a + b + k. S o l u t i o n : The d i r e c t i o n v e c t!

or
k 2
of the f i r s t l i n e i s 2 , and t h e d i r e c t i o n v e c t o r o f t h e s e c o n d l i n e i s 1 .
1 2
<MISSING> Then t = −1, s o a = −3 and b = 2, s o a + b + k = −3 . F i l l i n t h e m i s s i n g
steps ." ,
!
k
" output " : "The d i r e c t i o n v e c t o r o f t h e f i r s t l i n e i s 2 , and t h e d i r e c t i o n
1
!
2
vector o f the second l i n e i s 1 . To f i n d t h e p o i n t o f i n t e r s e c t i o n , we can
2
s e t t h e two p a r a m e t r i c e q u a t i o n s e q u a l t o each o t h e r :
! ! ! !
2 k 2 2
−1 +t 2 = −1 +u 1
3 1 1 2

S o l v!i n g f!o r t , u we g e t t =−1,u=−3. So t h e p o i n t o f i n t e r s e c t i o n i s

2 k
−1 − 2 = (4, a, b). Given t h a t , we g e t a = −3 and b = 2 , then a + b + k = −3 . " ,
3 1

" prompt " : "How many p o s i t i v e i n t e g e r s l e s s than 100 and d i v i s i b l e by 3 a r e

a l s o d i v i s i b l e by 4? S o l u t i o n : I f we s t a r t c o u n t i n g from 1 , e v e r y t h i r d
c o n s e c u t i v e i n t e g e r i s d i v i s i b l e by 3 and e v e r y f o u r t h c o n s e c u t i v e i n t e g e r
i s d i v i s i b l e by 4 . <MISSING> s o t h e r e a r e 8 m u l t i p l e s o f 12 between 1 and
100. F i l l in the missing s t e p s . " ,
" output " : "The m i s s i n g s t e p s a r e : We need t o f i n d t h e m u l t i p l e s o f 1 2 , which
a r e d i v i s i b l e by both 3 and 4 . To f i n d t h e f i r s t m u l t i p l e o f 12 l e s s than
1 0 0 , we d i v i d e 100 by 12 and round down t o t h e n e a r e s t i n t e g e r , which i s

26
8 . To f i n d t h e l a s t m u l t i p l e o f 12 l e s s than 1 0 0 , we d i v i d e 12 by 100 and
round up t o t h e n e a r e s t i n t e g e r , which i s 8 . The number o f m u l t i p l e s o f 12
between 1 and 100 i s t h e d i f f e r e n c e between t h e l a s t and f i r s t m u l t i p l e ,
which i s 8 − 8 = 0 . So t h e r e a r e no p o s i t i v e i n t e g e r s l e s s than 100 t h a t
a r e d i v i s i b l e by both 3 and 4 . " ,

" prompt " : "The complex numbers 5 + 10i, 7 + 2i, and 11 + 3i a r e t h r e e v e r t i c e s o f a

r e c t a n g l e . Find t h e f o u r t h v e r t e x . S o l u t i o n : Let a = 5 + 10i, b = 7 + 2i, and c = 11 + 3i.
Let d be t h e f o u r t h v e r t e x . Note t h a t

c−b=4+i
and
a − b = −2 + 8i = 2i(4 + i) = 2i(c − b).
Thus , t h e a n g l e between t h e segment j o i n i n g a and b and t h e segment j o i n i n g b
and c i s 90◦ . <MISSING> F i l l i n t h e m i s s i n g s t e p s . " ,
" output " : " S i n c e t h e a n g l e between t h e segment j o i n i n g a and b and t h e segment
j o i n i n g b and c i s 90◦ , i t f o l l o w s t h a t t h e segment j o i n i n g c and d i s
p e r p e n d i c u l a r t o t h e segment j o i n i n g a and b . S i n c e c − b = 4 + i , we have
c − d = −(b − d) = −(−2 + 8i) = 2 − 8i . T h e r e f o r e , d = c − (2 − 8i) = 11 + 3i − (2 − 8i) = 9 − 5i . So
t h e f o u r t h v e r t e x i s (9, −5i) . "

F.4 Olympiad-Problem-Solving
Examples from the Olympiad-Problem-Solving subdataset based on the book [13].

Best-3
" prompt " : "n numbers a r e w r i t t e n on a b l a c k b o a r d . I n one s t e p , you may e r a s e any
two o f t h e numbers , s a y a and b , and w r i t e , i n s t e a d ( a+b ) / 4 . R e p e a t i n g t h i s
s t e p n−1 times , t h e r e i s one number l e f t . Prove t h a t , i n i t i a l l y , i f t h e r e were
n o n e s on t h e board , a t t h e end , a number which i s not l e s s than 1/n w i l l
remain . " ,
" output " : " This can be proven by i n d u c t i o n . Base c a s e : n = 2 . I f t h e r e a r e two
o n e s on t h e board , a f t e r one s t e p , t h e r e m a i n i n g number i s (1+1) /4 = 1 / 2 ,
which i s not l e s s than 1 / 2 . I n d u c t i v e c a s e : Assume t h a t t h e s t a t e m e n t i s t r u e
f o r n = k . Now c o n s i d e r n = k +1. I f t h e r e a r e k+1 o n e s on t h e board , we can
e r a s e any two o f them and w r i t e t h e i r a v e r a g e , which i s not l e s s than 1/ k+1 by
t h e i n d u c t i o n h y p o t h e s i s . T h e r e f o r e , we can c o n c l u d e t h a t t h e s t a t e m e n t h o l d s
for a l l n."

" prompt " : "The i n t e g e r s 1 , . . . , n a r e a r r a n g e d i n any o r d e r . I n one s t e p you may

s w i t c h any two n e i g h b o r i n g i n t e g e r s . Prove t h a t you can n e v e r r e a c h t h e
i n i t i a l o r d e r a f t e r an odd number o f s t e p s . " ,
" output " : " This problem can be proven u s i n g t h e c o n c e p t o f p a r i t y . S p e c i f i c a l l y ,
t h e p a r i t y o f a p e r m u t a t i o n i s t h e number o f i n v e r s i o n s ( i . e . p a i r s o f
i n t e g e r s ( i , j ) such t h a t i < j but t h e i t h i n t e g e r a p p e a r s a f t e r t h e j t h
i n t e g e r i n t h e p e r m u t a t i o n ) modulo 2 . I t can be shown t h a t t h e p a r i t y o f a
p e r m u t a t i o n r e m a i n s unchanged a f t e r any s i n g l e swap o f n e i g h b o r i n g i n t e g e r s .
S i n c e t h e i n i t i a l o r d e r i n g has z e r o i n v e r s i o n s ( i . e . i t has even p a r i t y ) , i t
f o l l o w s t h a t an odd number o f swaps w i l l a l w a y s y i e l d a p e r m u t a t i o n with odd
p a r i t y , and t h u s i t can no t be t h e same a s t h e i n i t i a l o r d e r i n g . "

27
" prompt " : " Let ’ s answer t h i s q u e s t i o n s t e p by s t e p . D e r i v e d i v i s i b i l i t y c r i t e r i a
f o r 9 and 1 1 . " ,
" output " : " D i v i s i b i l i t y C r i t e r i a f o r 9 : I f t h e sum o f d i g i t s o f a number i s
d i v i s i b l e by 9 , then t h e number i s a l s o d i v i s i b l e by 9 . I f t h e d i f f e r e n c e
between t h e sum o f d i g i t s i n t h e odd p o s i t i o n s and even p o s i t i o n s i s d i v i s i b l e
by 9 , then t h e number i s a l s o d i v i s i b l e by 9 . D i v i s i b i l i t y C r i t e r i a f o r 1 1 : I f
t h e d i f f e r e n c e between t h e sum o f d i g i t s i n t h e odd p o s i t i o n s and even
p o s i t i o n s i s d i v i s i b l e by 1 1 , then t h e number i s a l s o d i v i s i b l e by 1 1 . I f t h e
a l t e r n a t i n g sum o f d i g i t s ( f i r s t minus s e c o n d p l u s t h i r d minus f o u r t h , e t c . )
i s d i v i s i b l e by 1 1 , then t h e number i s a l s o d i v i s i b l e by 1 1 . "

Worst-3
" prompt " : "Assume an 8 × 8 c h e s s b o a r d with t h e u s u a l c o l o r i n g . You may r e p a i n t a l l
s q u a r e s ( a ) o f a row o r column ( b ) o f a 2 × 2 s q u a r e . The g o a l i s t o a t t a i n
j u s t one b l a c k s q u a r e . Can you r e a c h t h e g o a l ? " ,
" output " : " Yes , i t i s p o s s i b l e t o a t t a i n j u s t one b l a c k s q u a r e on an 8 × 8
c h e s s b o a r d . One way t o do t h i s i s t o r e p a i n t a l l t h e s q u a r e s o f t h e top− l e f t
2 × 2 s q u a r e , which w i l l l e a v e one b l a c k s q u a r e i n t h e bottom−r i g h t c o r n e r o f
t h e c h e s s b o a r d . Another way t o do t h i s i s t o r e p a i n t a l l t h e s q u a r e s o f t h e
f i r s t column , which w i l l l e a v e one b l a c k s q u a r e i n t h e bottom−r i g h t c o r n e r o f
the chessboard .

" prompt " : " S t a r t with t h e p o s i t i v e i n t e g e r s 1, . . . , 4n − 1 . I n one move you may r e p l a c e

any two i n t e g e r s by t h e i r d i f f e r e n c e . Prove t h a t an even i n t e g e r w i l l be l e f t
a f t e r 4n−2 s t e p s . " ,
" output " : "We can p r o v e t h i s by m a t h e m a t i c a l i n d u c t i o n . Base c a s e : When n = 1 , we
have t h e i n t e g e r s 1 , 2 , 3 , 4 . We can p e r f o r m t h e f o l l o w i n g moves : ( 1 , 2 ) → 1 ,
3 , 4 ; ( 3 , 4 ) → 1 , 3 ; ( 1 , 3 ) → 2 , an even i n t e g e r . I n d u c t i v e c a s e : Assume t h a t
t h e s t a t e m e n t i s t r u e f o r n = k . We w i l l show t h a t i t i s a l s o t r u e f o r n =
k +1. We have t h e i n t e g e r s 1 , 2 , 3 , . . . , 4 ( k+1) −1. We can p e r f o r m t h e f o l l o w i n g
moves : ( 1 , 2 ) → 1 , 3 , 4 , . . . , 4 ( k+1) −1; ( 3 , 4 ) → 1 , 3 , . . . , 4 ( k+1) −3; . . . ;
( 4 k −1 , 4k ) → 1 , 3 , . . . , 4 ( k+1) −5. Now we have 4k−2 i n t e g e r s l e f t , 1 , 3 , . . . ,
4 ( k+1) −5, which a r e 4k−2 c o n s e c u t i v e odd i n t e g e r s . By t h e i n d u c t i o n
h y p o t h e s i s , a f t e r 4k−2 s t e p s , we w i l l be l e f t with an even i n t e g e r . T h e r e f o r e ,
by m a t h e m a t i c a l i n d u c t i o n , we can p r o v e t h a t an even i n t e g e r w i l l be l e f t
a f t e r 4n−2 s t e p s . "

" prompt " : " Let ’ s answer t h i s q u e s t i o n s t e p by s t e p . . Assume an 8 × 8 c h e s s b o a r d

with t h e u s u a l c o l o r i n g . You may r e p a i n t a l l s q u a r e s ( a ) o f a row o r column
( b ) o f a 2 × 2 s q u a r e . The g o a l i s t o a t t a i n j u s t one b l a c k s q u a r e . Can you
reach the goal ?" ,
" output " : " Yes , i t i s p o s s i b l e t o r e a c h t h e g o a l o f h a v i n g j u s t one b l a c k s q u a r e
on an 8 × 8 c h e s s b o a r d . One way t o do t h i s i s t o r e p a i n t a l l o f t h e s q u a r e s i n
t h e f i r s t row and f i r s t column b l a c k , and then r e p a i n t t h e top− l e f t s q u a r e o f
t h e 2 × 2 s q u a r e w h i t e . This r e s u l t s i n one b l a c k s q u a r e and a l l o t h e r s q u a r e s
white . "

F.5 Symbolic-Integration
Examples from our Symbolic-Integration subdataset based on [18].

28
Best-3
" prompt " : "What i s t h e i n t e g r a l o f 4x^7 + s i n ( 1 + x ) " ,
" output " : "The i n t e g r a l o f 4x^7 i s ( 4 / 8 ) x^8 + C, where C i s t h e c o n s t a n t o f
i n t e g r a t i o n . The i n t e g r a l o f s i n ( 1 + x ) i s −c o s ( 1 + x ) + C . So t h e i n t e g r a l o f
4x^7 + s i n ( 1 + x ) i s ( 4 / 8 ) x^8 − c o s ( 1 + x ) + C. "

" prompt " : "What i s t h e i n t e g r a l o f 2048 + 32 x ^3" ,

" output " : "The a n t i d e r i v a t i v e o r i n d e f i n i t e i n t e g r a l o f 2048 + 32 x^3 with r e s p e c t
to x i s ( 2 0 4 8 + 32 x ^3) dx = 2048 x + ( 3 2 / 4 ) x^4 + C where C i s t h e c o n s t a n t o f
R
integration ."

" prompt " : "What i s t h e i n t e g r a l o f 12 + 6 c o s h ( x ) " ,

" output " : "The i n t e g r a l o f 12 + 6 c o s h ( x ) i s 12 x + 6 s i n h ( x ) + C, where C i s t h e
constant of integration ."

Worst-3
" prompt " : "What i s t h e i n t e g r a l o f ( 4 + x ) [ 4 + 3x ] [ e ^{x∗(4+x ) ^ 2 } ] " ,
" output " : "The i n t e g r a l o f ( 4 + x ) [ 4 + 3x ] [ e ^{x∗(4+x ) ^ 2 } ] i s not e x p r e s s i b l e i n
terms o f e l e m e n t a r y f u n c t i o n s . "

" prompt " : "What i s t h e i n t e g r a l o f xe^{−3x − c o s h ( x ) } " ,

" output " : "The i n t e g r a l o f xe^{−3x − c o s h ( x ) } with r e s p e c t t o x d o e s not have a
c l o s e d −form s o l u t i o n . I t must be e v a l u a t e d n u m e r i c a l l y o r approximated u s i n g
n u m e r i c a l i n t e g r a t i o n methods such a s Simpson ’ s r u l e o r t h e t r a p e z o i d a l r u l e . "

" prompt " : "What i s t h e i n t e g r a l o f 1 / ( 2 + l n ( 3 / 2 ) ) " ,

" output " : "The i n t e g r a l o f 1 / ( 2 + l n ( 3 / 2 ) ) ca nn ot be e x p r e s s e d i n terms o f
e l e m e n t a r y f u n c t i o n s and must be approximated n u m e r i c a l l y o r s o l v e d u s i n g
special functions ."

Building Generative AI Agents - Tom Taulli
No ratings yet
Building Generative AI Agents - Tom Taulli
305 pages
Amber by Infeedo - The CEO's Virtual Assistant Revolutionizing Employee Engagement
No ratings yet
Amber by Infeedo - The CEO's Virtual Assistant Revolutionizing Employee Engagement
3 pages
A Teacher's Prompt Guide To ChatGPT Aligned With 'What Works Best'
100% (3)
A Teacher's Prompt Guide To ChatGPT Aligned With 'What Works Best'
9 pages
Secrets of Tree of Thoughts (ToT) Prompting For Programmers
No ratings yet
Secrets of Tree of Thoughts (ToT) Prompting For Programmers
12 pages
UP Guide For ChatGTP For Teaching and Learning
100% (4)
UP Guide For ChatGTP For Teaching and Learning
11 pages
Chatgpt
100% (7)
Chatgpt
13 pages
OpenAI ChatGPT
No ratings yet
OpenAI ChatGPT
3 pages
Having A Personal Python Teacher Using ChatGPT - by Josep Ferrer - Geek Culture - Jan, 2023 - Medium
100% (1)
Having A Personal Python Teacher Using ChatGPT - by Josep Ferrer - Geek Culture - Jan, 2023 - Medium
16 pages
Customizing GPT-3 For Your Application
No ratings yet
Customizing GPT-3 For Your Application
8 pages
A Conversation With ChatGPT
No ratings yet
A Conversation With ChatGPT
2 pages
Whatis Chat GPT
No ratings yet
Whatis Chat GPT
2 pages
ChatGPT For DataScience CheatSheet v1.0
No ratings yet
ChatGPT For DataScience CheatSheet v1.0
1 page
What Is GPT-4
100% (1)
What Is GPT-4
4 pages
A Simple Guide To OpenAI API With Python
No ratings yet
A Simple Guide To OpenAI API With Python
9 pages
Open Prompt Book From Campgpt - Rachel Riggs
No ratings yet
Open Prompt Book From Campgpt - Rachel Riggs
41 pages
ChatGPT Yoga Sequence
No ratings yet
ChatGPT Yoga Sequence
1 page
100 Best ChatGPT Prompts To Unleash AI's Potentia
No ratings yet
100 Best ChatGPT Prompts To Unleash AI's Potentia
22 pages
OpenAI GPT-3 Prominent Features
No ratings yet
OpenAI GPT-3 Prominent Features
1 page
5 Useful Prompts To Try in Chat GPT As A Product
100% (1)
5 Useful Prompts To Try in Chat GPT As A Product
7 pages
Let'S Have A Chat! A Conversation With Chatgpt: Technology, Applications, and Limitations
No ratings yet
Let'S Have A Chat! A Conversation With Chatgpt: Technology, Applications, and Limitations
15 pages
Sora LLM
0% (1)
Sora LLM
37 pages
ChatGPT For Data Science Cheat Sheet KDnuggets
100% (1)
ChatGPT For Data Science Cheat Sheet KDnuggets
1 page
Artificial Intelligence: Spring 2008, Juris Vīksna
No ratings yet
Artificial Intelligence: Spring 2008, Juris Vīksna
67 pages
AI Toolkit For Educators - v2 - Def - For - Now
No ratings yet
AI Toolkit For Educators - v2 - Def - For - Now
23 pages
6 ChatGPT Mind-Blowing Extensions To Use It Anywhere - by Josep Ferrer - Geek Culture - Medium
No ratings yet
6 ChatGPT Mind-Blowing Extensions To Use It Anywhere - by Josep Ferrer - Geek Culture - Medium
11 pages
ChatGPT Is Having A Thomas Edison Moment
No ratings yet
ChatGPT Is Having A Thomas Edison Moment
5 pages
ChatGPT 4 UX
No ratings yet
ChatGPT 4 UX
10 pages
Chat GPT
No ratings yet
Chat GPT
8 pages
HISTORY of AI
No ratings yet
HISTORY of AI
103 pages
AI & ChatGPT
No ratings yet
AI & ChatGPT
28 pages
MBoC6e Ch02 Lecture-UP
100% (1)
MBoC6e Ch02 Lecture-UP
149 pages
Codesaif Chatgpt Prompt
No ratings yet
Codesaif Chatgpt Prompt
59 pages
ChatGPT in Teaching and Education - No Cheating Required (Part I - Students) - Victor Tan
No ratings yet
ChatGPT in Teaching and Education - No Cheating Required (Part I - Students) - Victor Tan
16 pages
Google BARD
No ratings yet
Google BARD
7 pages
GPT 4
No ratings yet
GPT 4
98 pages
Unleashing The Power of ChatGPT
No ratings yet
Unleashing The Power of ChatGPT
3 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
Chatbot Application For Tourism Using Deep Learning
No ratings yet
Chatbot Application For Tourism Using Deep Learning
5 pages
How Close Is Chatgpt To Human Experts? Comparison Corpus, Evaluation, and Detection
100% (1)
How Close Is Chatgpt To Human Experts? Comparison Corpus, Evaluation, and Detection
20 pages
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
No ratings yet
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
20 pages
A Guide To Studying and Living in The UK
No ratings yet
A Guide To Studying and Living in The UK
27 pages
You'Re Using ChatGPT Wrong! Here's How To Be Ahead of 99% of ChatGPT Users - by The PyCoach - Mar, 2023 - Artificial Corner
No ratings yet
You'Re Using ChatGPT Wrong! Here's How To Be Ahead of 99% of ChatGPT Users - by The PyCoach - Mar, 2023 - Artificial Corner
1 page
Chat GPT
100% (1)
Chat GPT
1 page
Unlocking The Potential of Talent Acquisition With ChatGPT
100% (1)
Unlocking The Potential of Talent Acquisition With ChatGPT
25 pages
2.1 ChatGPT Clone
No ratings yet
2.1 ChatGPT Clone
12 pages
LI Carousel - Convert Data To Charts With ChatGPT
50% (2)
LI Carousel - Convert Data To Charts With ChatGPT
19 pages
2 - ChatGPT - Optimizing Language Models For Dialogue
No ratings yet
2 - ChatGPT - Optimizing Language Models For Dialogue
1 page
97 - How-to-use-ChatGPT-to-automate-tasks-and-improve-efficiency-in-various-industries PDF
No ratings yet
97 - How-to-use-ChatGPT-to-automate-tasks-and-improve-efficiency-in-various-industries PDF
1 page
10+ Ai Tools
No ratings yet
10+ Ai Tools
3 pages
101 ChatGPT Prompts That SLAY Part 1 Produced by SAVAGE 1673713575
100% (2)
101 ChatGPT Prompts That SLAY Part 1 Produced by SAVAGE 1673713575
55 pages
CHATGPT
No ratings yet
CHATGPT
7 pages
Chat GPT
No ratings yet
Chat GPT
1 page
ChatGPT and School
No ratings yet
ChatGPT and School
7 pages
3 ChatGPT Extensions To Automate Your Life - by The PyCoach - Jan, 2023 - Artificial Corner
100% (1)
3 ChatGPT Extensions To Automate Your Life - by The PyCoach - Jan, 2023 - Artificial Corner
9 pages
AI Model Comparison - Gemini AI Vs ChatGPT 4
No ratings yet
AI Model Comparison - Gemini AI Vs ChatGPT 4
2 pages
Chat Bot For College Management System U
No ratings yet
Chat Bot For College Management System U
4 pages
Prompt Engineering in English
100% (2)
Prompt Engineering in English
87 pages
100 Rust ChatGPT Prompts
No ratings yet
100 Rust ChatGPT Prompts
24 pages
ChatGPT For Coders
33% (3)
ChatGPT For Coders
46 pages
The Future by ChatGPT
No ratings yet
The Future by ChatGPT
41 pages
Revolutionizing Education with ChatGPT: A Tutorial for Teachers
From Everand
Revolutionizing Education with ChatGPT: A Tutorial for Teachers
Moris Polanco
No ratings yet
Chapter II MindMate AI
No ratings yet
Chapter II MindMate AI
2 pages
GENED2 - Reading in Philippine History
No ratings yet
GENED2 - Reading in Philippine History
13 pages
MODULE 5 Learning Thinking Styles and Multiple Intelligences
No ratings yet
MODULE 5 Learning Thinking Styles and Multiple Intelligences
26 pages
Tesfalem Adisu Thesis
No ratings yet
Tesfalem Adisu Thesis
65 pages
Piagets Cognitive Developmental Theory Critical Review
No ratings yet
Piagets Cognitive Developmental Theory Critical Review
10 pages
Time Table 2025
No ratings yet
Time Table 2025
20 pages
Career Readiness Skills SA
No ratings yet
Career Readiness Skills SA
4 pages
21st Century Skills
No ratings yet
21st Century Skills
15 pages
LASU Course Allocation Nursing
No ratings yet
LASU Course Allocation Nursing
4 pages
Evolution of Nursing
100% (2)
Evolution of Nursing
17 pages
2025 Yiss 6W Com3134-11 Media Psychology Namkee Park P1
No ratings yet
2025 Yiss 6W Com3134-11 Media Psychology Namkee Park P1
2 pages
DLL - Mathematics 1 - Q1 - W6
No ratings yet
DLL - Mathematics 1 - Q1 - W6
6 pages
NCM 119 Lec Week 10pdf - Io
No ratings yet
NCM 119 Lec Week 10pdf - Io
2 pages
Reading and Writing in Esp
No ratings yet
Reading and Writing in Esp
6 pages
Qualitative Vs Quantitative Research Design - 21st October 2024
No ratings yet
Qualitative Vs Quantitative Research Design - 21st October 2024
6 pages
Skills in Reading
No ratings yet
Skills in Reading
15 pages
Gatela SF22 - M1L1-4
No ratings yet
Gatela SF22 - M1L1-4
9 pages
L400 (IT) Project Proposal Template
No ratings yet
L400 (IT) Project Proposal Template
5 pages
Hfed021 Test
No ratings yet
Hfed021 Test
6 pages
PROPOSAL
No ratings yet
PROPOSAL
5 pages
Lesson 13
No ratings yet
Lesson 13
5 pages
Course Outline - BBA - Bus Stat Fall 14-15
No ratings yet
Course Outline - BBA - Bus Stat Fall 14-15
3 pages
The Effects of Job Characteristics On Work Engagement
No ratings yet
The Effects of Job Characteristics On Work Engagement
5 pages
Bongabon Senior High School
No ratings yet
Bongabon Senior High School
9 pages
"Higher, Faster, Stronger": Championing Classroom Teaching
No ratings yet
"Higher, Faster, Stronger": Championing Classroom Teaching
13 pages
All-In-One Emotion, Sentiment and Intensity Prediction Using A Multi-Task Ensemble Framework-Ppt-1
No ratings yet
All-In-One Emotion, Sentiment and Intensity Prediction Using A Multi-Task Ensemble Framework-Ppt-1
29 pages
Electronic Commerce Research and Applications: Mehmet Türkay Yoldar, U Ğur Özcan T
No ratings yet
Electronic Commerce Research and Applications: Mehmet Türkay Yoldar, U Ğur Özcan T
17 pages
Paper 1 HL: Guided Literary Analysis (First Examinations 2021)
No ratings yet
Paper 1 HL: Guided Literary Analysis (First Examinations 2021)
1 page
Bloom's Taxonomy: Jump To Navigation Jump To Search
No ratings yet
Bloom's Taxonomy: Jump To Navigation Jump To Search
10 pages

Mathematical Capabilities of ChatGPT

Uploaded by

Mathematical Capabilities of ChatGPT

Uploaded by

Mathematical Capabilities of ChatGPT

Simon Frieder∗1,4 , Luca Pinchetti1 , Ryan-Rhys Griffiths3 , Tommaso Salvatori2 , Thomas

An example where ChatGPT was less successful is the following:

We explain each of these fields; unless explicitly mentioned below:

3.3 The subdatasets

Mathematical difficulty (ascending)

4.1 Overall Performance

[38] Sarah Wiegreffe (sigmoid.social/@sarah) [@sarahwiegreffe]. If text-davinci-001 is a rough approximate to

B Error and warning code policy

Figure 7: Relative frequencies of error codes by subdatasets and files.

Figure 8: Rating by MSC codes (MSC code longer version).

The s eR c o n d term on Rt h e r i g h t s i d e i s z e r o b e c a u s e µ(x : |f (x)| > M + ) = 0 . So we

" prompt " : " I f {Tα } i s a f a m i l y o f t o p o l o g i e s on X , show t h a t T i s a topology

" prompt " : " Let X be a v e c t o r s p a c e and A and B s u b s e t s o f X . I f A and B a r e

" output " : "A s u b s e t S o f a v e c t o r s p a c e X i s c a l l e d b a l a n c e d i f f o r e v e r y v e c t o r

" prompt " : " Let X be a t o p o l o g i c a l v e c t o r s p a c e . A l l s e t s mentioned below a r e

F.2 Holes-in-Proofs (Proofs Collection A)

Which r e s u l t has been used i n t h e i n e q u a l i t y above ? " ,

" prompt " : " Suppose V i s an i n n e r p r o d u c t s p a c e and T ∈ L(V ) . We want t o p r o v e t h a t

(q n − 1)(q n−1 − q)(q n−2 − q 2 ) · · · (q 2 − q)(q − 1)."

F.3 Holes-in-Proofs (Proofs Collection B Prealgebra and Precalculus)

S o l v i n g f o r each c a s e : sin 8 = 0 =⇒ 8 = nπ, where n i s an i n t e g e r . This g i v e s us

t h e s o l u t i o n s x = 8n 5 π. sin 8 = 0 =⇒ 8 = nπ, where n i s an i n t e g e r . This g i v e s us

t h e s o l u t i o n s x = 8n 3 π. Now we need to f i n d the s o l u t i o n s that are in the

" prompt " : " Let a, b, c be t h r e e v e c t o r s such t h a t kak = √1 ,

and p a s s e s through t h e p !o i n t (4, a, b). Find a + b + k. S o l u t i o n : The d i r e c t i o n v e c t!

S o l v!i n g f!o r t , u we g e t t =−1,u=−3. So t h e p o i n t o f i n t e r s e c t i o n i s

" prompt " : "How many p o s i t i v e i n t e g e r s l e s s than 100 and d i v i s i b l e by 3 a r e

" prompt " : "The complex numbers 5 + 10i, 7 + 2i, and 11 + 3i a r e t h r e e v e r t i c e s o f a

" prompt " : "The i n t e g e r s 1 , . . . , n a r e a r r a n g e d i n any o r d e r . I n one s t e p you may

" prompt " : " S t a r t with t h e p o s i t i v e i n t e g e r s 1, . . . , 4n − 1 . I n one move you may r e p l a c e

" prompt " : " Let ’ s answer t h i s q u e s t i o n s t e p by s t e p . . Assume an 8 × 8 c h e s s b o a r d

" prompt " : "What i s t h e i n t e g r a l o f 2048 + 32 x ^3" ,

" prompt " : "What i s t h e i n t e g r a l o f 12 + 6 c o s h ( x ) " ,

" prompt " : "What i s t h e i n t e g r a l o f xe^{−3x − c o s h ( x ) } " ,

" prompt " : "What i s t h e i n t e g r a l o f 1 / ( 2 + l n ( 3 / 2 ) ) " ,

You might also like

The s eR c o n d term on Rt h e r i g h t s i d e i s z e r o b e c a u s e µ(x : |f (x)| > M + ) = 0 . So we