Alpha Geometry
Alpha Geometry
https://doi.org/10.1038/s41586-023-06747-5 Trieu H. Trinh1,2 ✉, Yuhuai Wu1, Quoc V. Le1, He He2 & Thang Luong1 ✉
Accepted: 13 October 2023 Proving mathematical theorems at the olympiad level represents a notable milestone
Published online: 17 January 2024 in human-level automated reasoning1–4, owing to their reputed difficulty among the
world’s best talents in pre-university mathematics. Current machine-learning
Open access
approaches, however, are not applicable to most mathematical domains owing to the
Check for updates high cost of translating human proofs into machine-verifiable format. The problem is
even worse for geometry because of its unique translation challenges1,5, resulting in
severe scarcity of training data. We propose AlphaGeometry, a theorem prover for
Euclidean plane geometry that sidesteps the need for human demonstrations by
synthesizing millions of theorems and proofs across different levels of complexity.
AlphaGeometry is a neuro-symbolic system that uses a neural language model,
trained from scratch on our large-scale synthetic data, to guide a symbolic deduction
engine through infinite branching points in challenging problems. On a test set of
30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the
previous best method that only solves ten problems and approaching the performance
of an average International Mathematical Olympiad (IMO) gold medallist. Notably,
AlphaGeometry produces human-readable proofs, solves all geometry problems in
the IMO 2000 and 2015 under human expert evaluation and discovers a generalized
version of a translated IMO theorem in 2004.
Proving theorems showcases the mastery of logical reasoning and the By using existing symbolic engines on a diverse set of random theo-
ability to search through an infinitely large space of actions towards a rem premises, we extracted 100 million synthetic theorems and their
target, signifying a remarkable problem-solving skill. Since the 1950s proofs, many with more than 200 proof steps, four times longer than
(refs. 6,7), the pursuit of better theorem-proving capabilities has been the average proof length of olympiad theorems. We further define and
a constant focus of artificial intelligence (AI) research8. Mathematical use the concept of dependency difference in synthetic proof genera-
olympiads are the most reputed theorem-proving competitions in tion, allowing our method to produce nearly 10 million synthetic proof
the world, with a similarly long history dating back to 1959, playing an steps that construct auxiliary points, reaching beyond the scope of pure
instrumental role in identifying exceptional talents in problem solving. symbolic deduction. Auxiliary construction is geometry’s instance of
Matching top human performances at the olympiad level has become exogenous term generation, representing the infinite branching fac-
a notable milestone of AI research2–4. tor of theorem proving, and widely recognized in other mathematical
Theorem proving is difficult for learning-based methods because domains as the key challenge to proving many hard theorems1,2. Our
training data of human proofs translated into machine-verifiable lan- work therefore demonstrates a successful case of generating synthetic
guages are scarce in most mathematical domains. Geometry stands out data and learning to solve this key challenge. With this solution, we
among other olympiad domains because it has very few proof exam- present a general guiding framework and discuss its applicability to
ples in general-purpose mathematical languages such as Lean9 owing other domains in Methods section ‘AlphaGeometry framework and
to translation difficulties unique to geometry1,5. Geometry-specific applicability to other domains’.
languages, on the other hand, are narrowly defined and thus unable to We pretrain a language model on all generated synthetic data and
express many human proofs that use tools beyond the scope of geom- fine-tune it to focus on auxiliary construction during proof search, del-
etry, such as complex numbers (Extended Data Figs. 3 and 4). Overall, egating all deduction proof steps to specialized symbolic engines. This
this creates a data bottleneck, causing geometry to lag behind in recent follows standard settings in the literature, in which language models
progress that uses human demonstrations2–4. Current approaches such as GPT-f (ref. 15), after being trained on human proof examples,
to geometry, therefore, still primarily rely on symbolic methods and can generate exogenous proof terms as inputs to fast and accurate
human-designed, hard-coded search heuristics10–14. symbolic engines such as nlinarith or ring2,3,16, using the best of both
We present an alternative method for theorem proving using syn- worlds. Our geometry theorem prover AlphaGeometry, illustrated in
thetic data, thus sidestepping the need for translating human-provided Fig. 1, produces human-readable proofs, substantially outperforms
proof examples. We focus on Euclidean plane geometry and exclude the previous state-of-the-art geometry-theorem-proving computer
topics such as geometric inequalities and combinatorial geometry. program and approaches the performance of an average IMO gold
Google Deepmind, Mountain View, CA, USA. 2Computer Science Department, New York University, New York, NY, USA. ✉e-mail: [email protected]; [email protected]
1
B C B D C
Fig. 1 | Overview of our neuro-symbolic AlphaGeometry and how it solves midpoint of BC”. The proof consists of two other steps, both of which make use
both a simple problem and the IMO 2015 Problem 3. The top row shows how of the midpoint properties: “BD = DC” and “B, D, C are collinear”, highlighted in
AlphaGeometry solves a simple problem. a, The simple example and its diagram. blue. The bottom row shows how AlphaGeometry solves the IMO 2015 Problem
b, AlphaGeometry initiates the proof search by running the symbolic deduction 3 (IMO 2015 P3). e, The IMO 2015 P3 problem statement and diagram. f, The
engine. The engine exhaustively deduces new statements from the theorem solution of IMO 2015 P3 has three auxiliary points. In both solutions, we arrange
premises until the theorem is proven or new statements are exhausted. language model outputs (blue) interleaved with symbolic engine outputs to
c, Because the symbolic engine fails to find a proof, the language model reflect their execution order. Note that the proof for IMO 2015 P3 in f is greatly
constructs one auxiliary point, growing the proof state before the symbolic shortened and edited for illustration purposes. Its full version is in the
engine retries. The loop continues until a solution is found. d, For the simple Supplementary Information.
example, the loop terminates after the first auxiliary construction “D as the
to the symbolic deduction engine to generate its derivations. A full Average IMO
contestant
list of actions used for this sampling can be found in Extended Data 15.2
Table 1. In our work, we sampled nearly 1 billion of such premises in 14.3
10.0
a highly parallelized setting, described in Methods. Note that we do 10
not make use of any existing theorem premises from human-designed
problem sets and sampled the eligible constructions uniformly
randomly.
0
Next we use a symbolic deduction engine on the sampled prem- Previous Honorable Bronze Silver AlphaGeometry Gold
ises. The engine quickly deduces new true statements by following state of the art mentions medallist medallist medallist
(Wu’s method)
forward inference rules as shown in Fig. 3b. This returns a directed
acyclic graph of all reachable conclusions. Each node in the directed Fig. 2 | AlphaGeometry advances the current state of geometry theorem
acyclic graph is a reachable conclusion, with edges connecting to its prover from below human level to near gold-medallist level. The test
parent nodes thanks to the traceback algorithm described in Methods. benchmark includes official IMO problems from 2000 to the present that
This allows a traceback process to run recursively starting from any can be represented in the geometry environment used in our work. Human
performance is estimated by rescaling their IMO contest scores between 0 and
node N, at the end returning its dependency subgraph G(N), with its
7 to between 0 and 1, to match the binary outcome of failure/success of the
root being N and its leaves being a subset of the sampled premises.
machines. For example, a contestant’s score of 4 out of 7 will be scaled to 0.57
Denoting this subset as P, we obtained a synthetic training example
problems in this comparison. On the other hand, the score for AlphaGeometry
(premises, conclusion, proof) = (P, N, G(N)). and other machine solvers on any problem is either 0 (not solved) or 1 (solved).
In geometry, the symbolic deduction engine is deductive database Note that this is only an approximate comparison with humans on classical
(refs. 10,17), with the ability to efficiently deduce new statements from geometry, who operate on natural-language statements rather than narrow,
the premises by means of geometric rules. DD follows deduction rules domain-specific translations. Further, the general IMO contest also includes
in the form of definite Horn clauses, that is, Q(x) ← P1(x),…, Pk(x), in other types of problem, such as geometric inequality or combinatorial geometry,
which x are points objects, whereas P1,…, Pk and Q are predicates and other domains of mathematics, such as algebra, number theory and
such as ‘equal segments’ or ‘collinear’. A full list of deduction rules combinatorics.
cyclic(E,A,D,H)
E
D
A
… ∠EAH = ∠EDH
E
D
C
∠EDH = ∠ECB B
A
G
E
cyclic(E,B,C,D) D
HA ⊥ BC E
H
EC ⊥ EA
H
B F C
…
B C
Fig. 3 | AlphaGeometry synthetic-data-generation process. a, We first sample for the rightmost node ‘HA ⊥ BC’, traceback returns the green subgraph.
a large set of random theorem premises. b, We use the symbolic deduction c, The minimal premise and the corresponding subgraph constitute a synthetic
engine to obtain a deduction closure. This returns a directed acyclic graph problem and its solution. In the bottom example, points E and D took part in the
of statements. For each node in the graph, we perform traceback to find its proof despite being irrelevant to the construction of HA and BC; therefore, they
minimal set of necessary premise and dependency deductions. For example, are learned by the language model as auxiliary constructions.
can be found in ref. 10. To widen the scope of the generated synthetic
theorems and proofs, we also introduce another component to the Training a language model on synthetic data
symbolic engine that can deduce new statements through algebraic The transformer18 language model is a powerful deep neural network
rules (AR), as described in Methods. AR is necessary to perform angle, that learns to generate text sequences through next-token predic-
ratio and distance chasing, as often required in many olympiad- tion, powering substantial advances in generative AI technology. We
level proofs. We included concrete examples of AR in Extended Data serialize (P, N, G(N)) into a text string with the structure ‘<premises>
Table 2. The combination DD + AR, which includes both their for- <conclusion><proof>’. By training on such sequences of symbols, a
ward deduction and traceback algorithms, is a new contribution in language model effectively learns to generate the proof, conditioning
our work and represents a new state of the art in symbolic reasoning on theorem premises and conclusion.
in geometry.
105
Count (log scale)
0.05% data
IMO 2015 P3 AlphaGeometry proof
length: 112
103
0.001% data
IMO 2019 P6 AlphaGeometry proof
length: 187
101
E
A
H
G
E J
N A
I
D
A
O
C
F
B
D
H
F
B M K
C B
D C
Fig. 4 | Analysis of the generated synthetic data. Of the generated synthetic of 247 with two auxiliary constructions. Most synthetic theorem premises tend
proofs, 9% are with auxiliary constructions. Only roughly 0.05% of the synthetic not to be symmetrical like human-discovered theorems, as they are not biased
training proofs are longer than the average AlphaGeometry proof for the towards any aesthetic standard.
test-set problems. The most complex synthetic proof has an impressive length
focuses on a methodology for theorem proving. For this reason, we AlphaGeometry belongs to the second category of solvers, often
adapted geometry problems from the IMO competitions since 2000 described as search/axiomatic or sometimes ‘synthetic’ methods. These
to a narrower, specialized environment for classical geometry used in methods treat the problem of theorem proving as a step-by-step search
interactive graphical proof assistants13,17,19, as discussed in Methods. problem using a set of geometry axioms. Thanks to this, they typically
Among all non-combinatorial geometry-related problems, 75% can be return highly interpretable proofs accessible to human readers. Base-
represented, resulting in a test set of 30 classical geometry problems. lines in this category generally include symbolic engines equipped
Geometric inequality and combinatorial geometry, for example, can- with human-designed heuristics. For example, Chou et al. provided 18
not be translated, as their formulation is markedly different to classical heuristics such as “If OA ⊥ OB and OA = OB, construct C on the oppo-
geometry. We include the full list of statements and translations for site ray of OA such that OC = OA”, besides 75 deduction rules for the
all 30 problems in the Supplementary Information. The final test set symbolic engine. Large language models22–24 such as GPT-4 (ref. 25)
is named IMO-AG-30, highlighting its source, method of translation can be considered to be in this category. Large language models have
and its current size. demonstrated remarkable reasoning ability on a variety of reasoning
tasks26–29. When producing full natural-language proofs on IMO-AG-30,
however, GPT-4 has a success rate of 0%, often making syntactic and
Geometry theorem prover baselines semantic errors throughout its outputs, showing little understanding
Geometry theorem provers in the literature fall into two categories. of geometry knowledge and of the problem statements itself. Note that
The first category is computer algebra methods, which treats geom- the performance of GPT-4 performance on IMO problems can also be
etry statements as polynomial equations of its point coordinates. contaminated by public solutions in its training data. A better GPT-4 per-
Proving is accomplished with specialized transformations of large formance is therefore still not comparable with other solvers. In general,
polynomials. Gröbner bases20 and Wu’s method21 are representative search methods have no theoretical guarantee in their proving perfor-
approaches in this category, with theoretical guarantees to success- mance and are known to be weaker than computer algebra methods13.
fully decide the truth value of all geometry theorems in IMO-AG-30,
albeit without a human-readable proof. Because these methods often Synthetic data generation rediscovers known theorems and
have large time and memory complexity, especially when processing beyond
IMO-sized problems, we report their result by assigning success to any We find that our synthetic data generation can rediscover some fairly
problem that can be decided within 48 h using one of their existing complex theorems and lemmas known to the geometry literature,
implementations17. as shown in Fig. 4, despite starting from randomly sampled theorem
N
Translate A
O
Premise B C
P
Generalize
Solve
Proof R
Traceback
N
O
M
A
P B C
Unused premise
Used premises
Neural net output
Symbolic solver output
Fig. 5 | AlphaGeometry discovers a more general theorem than the midpoint of BC for P, B, C to be collinear. Right, top, the original theorem
translated IMO 2004 P1. Left, top to bottom, the IMO 2004 P1 stated in natural diagram; bottom, the generalized theorem diagram, in which O is freed from its
language, its translated statement and AlphaGeometry solution. Thanks to the midpoint position and P still stays on line BC. Note that the original problem
traceback algorithm necessary to extract the minimal premises, AlphaGeometry requires P to be between B and C, a condition where the generalized theorem
identifies a premise unnecessary for the proof to work: O does not have to be the and solution does not guarantee.
Difficulty of IMO problems for AlphaGeometry versus humans Hard problems are reflected in AlphaGeometry proof length
200 Figure 6 measures the difficulty of solved problems using public scores
Need construction
2019 P6 Deduction only of human contestants at the IMO and plots them against the corre-
AlphaGeometry proof length
Extended Data Fig. 2 | Side-by-side comparison of AlphaGeometry proof symmetrical axis of both LN and AM) to obtain a broad set of conclusions all
versus human proof on the translated IMO 2004 P1. Both the AlphaGeometry at once. For algebraic deductions, AlphaGeometry cannot flesh out its
and human solutions recognize the axis of symmetry between M and N through intermediate derivations, which is implicitly carried out by Gaussian elimination,
O. AlphaGeometry constructs point K to materialize this axis, whereas humans therefore leading to low readability. Overall, this comparison points to the use
simply use the existing point R for the same purpose. This is a case in which of higher-level tools to improve the synthetic data, proof search and readability
proof pruning itself cannot remove K and a sign of similar redundancy in our of AlphaGeometry. Note that in the original IMO 2004 P1, the point P is proven
synthetic data. To prove five-point concyclicity, AlphaGeometry outputs very to be between B and C. The generalized version needs further contraints on the
lengthy, low-level steps, whereas humans use a high-level insight (OR is the position of O to satisfy this betweenness requirement.
Extended Data Fig. 3 | Side-by-side comparison of human proof and than 100 deduction steps, with many low-level steps that are extremely tedious
AlphaGeometry proof for the IMO 2000 P6. This is a harder problem to a human reader. This is a case in which the search-based solution is much less
(average human score = 1.05/7), with a large number of objects in the problem readable and much less intuitive than coordinate bashing. A more structural
statements, resulting in a very crowded diagram. Left, the human solution uses organization, that is, a high-level proof outline, can improve readability of the
complex numbers. With a well-chosen coordinate system, the problem is greatly AlphaGeometry solution substantially. Again, this suggests building into
simplified and a solution follows naturally through algebraic manipulation. AlphaGeometry many higher-level deduction rules to encapsulate large groups
Right, AlphaGeometry solution involves two auxiliary constructions and more of low-level deductions into fewer proof steps.
Article
Extended Data Fig. 4 | Side-by-side comparison of human proof and quickly with the knowledge of Reim’s theorem, which is not included in the
AlphaGeometry proof for the IMO 2019 P2. This is one out of five unsolved deduction rule list used by the symbolic engine during synthetic data
problems by AlphaGeometry. Left, the human solution uses both auxiliary generation. Including such high-level theorems into the synthetic data
constructions and barycentric coordinates. With a well-chosen coordinate generation can greatly improve the coverage of synthetic data and thus
system, a solution becomes available through advanced algebraic manipulation. improve auxiliary construction capability. Further, higher-level steps using
Right, AlphaGeometry solution when provided with the ground-truth auxiliary Reim’s theorem also cut down the current proof length by a factor of 3.
construction for a synthetic proof. This auxiliary construction can be found
Extended Data Fig. 5 | Human proof for the IMO 2008 P6. This is an unsolved with the auxiliary constructions used in this human proof also does not yield
problem by AlphaGeometry and also the hardest one among all 30 problems, any solution. There is also no guarantee that a synthetic solution exists for
with an average human score of only 0.28/7. This human proof uses four auxiliary AlphaGeometry, across all possible auxiliary constructions, without enhancing
constructions (diameters of circles W1 and W2) and high-level theorems such its symbolic deduction with more powerful rules. Again, this suggests that
as the Pitot theorem and the notion of homothety. These high-level concepts enhancing the symbolic engine with more powerful tools that IMO contestants
are not available to our current version of the symbolic deduction engine both are trained to use can improve both the synthetic data and the test-time
during synthetic data generation and proof search. Supplying AlphaGeometry performance of AlphaGeometry.
Article
Extended Data Fig. 6 | Analysis of AlphaGeometry performance under in Table 1, with AlphaGeometry solving almost all problems. c, The effect of
changes made to its training and testing. a, The effect of reducing training reducing beam size during test time on AlphaGeometry performance. At beam
data on AlphaGeometry performance. At 20% of training data, AlphaGeometry size 8, that is, a 64 times reduction from its full setting, AlphaGeometry still
still solves 21 problems, outperforming all other baselines. b, Evaluation on a solves 21 problems, outperforming all other baselines. d, The effect of reducing
larger set of 231 geometry problems, covering a diverse range of sources outside search depth on AlphaGeometry performance. At depth 2, AlphaGeometry still
IMO competitions. The rankings of different machine solvers stays the same as solves 21 problems, outperforming all other baselines.
Extended Data Table 1 | List of actions to construct the random premises
These actions include constructions to create new points that are related to others in a certain way, for example, collinear, incentre/excentre etc., and constructions that take a number as its
parameter.
Article
Extended Data Table 2 | Three examples of algebraic reasoning (AR) in geometry theorem proving, with AR proof steps
between the two tags <AR></AR>
In AlphaGeometry, the engine AR can execute all three examples efficiently, under a unified procedure of Gaussian elimination.
Extended Data Table 3 | Examples of auxiliary constructions in four different domains
In these examples, the construction is key to the proof, whereas the remaining proof is relatively more mechanical. In AlphaGeometry, the mechanical portion is efficiently handled by the
symbolic engine DD + AR.
Article
Extended Data Table 4 | A comparison between a geometry proof and an IMO inequality proof through the lens of the
AlphaGeometry framework
We assume AM-GM to be a symbolic engine capable of (1) algebraic rewrites and simplification and (2) applying the inequality rule of arithmetic means–geometric means. With the original
premises, directly applying AM-GM fails to deliver a solution, which is similar to the geometry example, for which DD + AR fails to solve the simple problem. Some correct auxiliary constructions
are necessary for both symbolic engines (DD + AR in the case of geometry and AM-GM in the case of inequality) to succeed, as shown in the last two rows of the table. Note that there are ten
more common inequalities typically used at mathematical olympiads besides AM-GM, just as DD + AR itself encapsulates more than 50 different deduction rules for geometry commonly used at
the olympiads.