Mathematical Capabilities of ChatGPT
Mathematical Capabilities of ChatGPT
4
Faculty of Mathematics, University of Vienna, Vienna, Austria
5
Research Network Data Science, University of Vienna, Vienna, Austria
6
School of Mathematics, Institute for Advanced Study, Princeton, US
February 1, 2023
Abstract
We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as
well as hand-crafted ones, and measuring its performance against other models trained on a mathematical
corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional
mathematicians by emulating various use cases that come up in the daily professional activities of
mathematicians (question answering, theorem searching). In contrast to formal mathematics, where
large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of
natural-language mathematics, used to benchmark language models, only cover elementary mathematics.
We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made
and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics
and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark
ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new
dataset publicly available1 to assist a community-driven comparison of ChatGPT with (future) large
language models in terms of advanced mathematical comprehension. We conclude that contrary to many
positive reports in the media (a potential case of selection bias), ChatGPT’s mathematical abilities are
significantly below those of an average mathematics graduate student. Our results show that ChatGPT
often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to
pass a university exam, you would be better off copying from your average peer!
1 Introduction
Since its introduction, ChatGPT has rapidly become a widely known question-and-answer dialogue system.
It has been mentioned in traditional media across the globe [33, 28, 22] and across all major internet
platforms [40, 43]. According to Twitter data, it is by far the most talked about language model to date; cf.
Figure 1.
The performance of ChatGPT has been analyzed in a large number of exam-related use cases, with varying
degrees of scientific rigor, ranging from detailed studies to anecdotal evidence. Use cases include passing the
United States Medical Licensing Examination [17], scoring highly on the Psychology Today Verbal-Linguistic
Intelligence IQ Test [34], and answering (and generating) Operations Management exam questions that were
∗ Corresponding author: [email protected]. The subsequent author list is ordered randomly.
1 github.com/friederrr/science-GHOSTS
1
N
N % ( 5 7
/ D 0 '