0% found this document useful (0 votes)
47 views17 pages

DeepMind - Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning

Uploaded by

lkjhgvfcdxswq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views17 pages

DeepMind - Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning

Uploaded by

lkjhgvfcdxswq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Article

Faster sorting algorithms discovered using


deep reinforcement learning

https://doi.org/10.1038/s41586-023-06004-9 Daniel J. Mankowitz1,3 ✉, Andrea Michi1,3, Anton Zhernov1,3, Marco Gelmi1,3, Marco Selvi1,3,
Cosmin Paduraru1,3, Edouard Leurent1,3, Shariq Iqbal1, Jean-Baptiste Lespiau1, Alex Ahern1,
Received: 25 July 2022
Thomas Köppe1, Kevin Millikin1, Stephen Gaffney1, Sophie Elster1, Jackson Broshear1,
Accepted: 23 March 2023 Chris Gamble1, Kieran Milan1, Robert Tung1, Minjae Hwang2, Taylan Cemgil1,
Mohammadamin Barekatain1, Yujia Li1, Amol Mandhane1, Thomas Hubert1,
Published online: 7 June 2023
Julian Schrittwieser1, Demis Hassabis1, Pushmeet Kohli1, Martin Riedmiller1, Oriol Vinyals1 &
Open access David Silver1

Check for updates

Fundamental algorithms such as sorting or hashing are used trillions of times on any
given day1. As demand for computation grows, it has become critical for these
algorithms to be as performant as possible. Whereas remarkable progress has been
achieved in the past2, making further improvements on the efficiency of these
routines has proved challenging for both human scientists and computational
approaches. Here we show how artificial intelligence can go beyond the current state
of the art by discovering hitherto unknown routines. To realize this, we formulated the
task of finding a better sorting routine as a single-player game. We then trained a new
deep reinforcement learning agent, AlphaDev, to play this game. AlphaDev
discovered small sorting algorithms from scratch that outperformed previously
known human benchmarks. These algorithms have been integrated into the LLVM
standard C++ sort library3. This change to this part of the sort library represents the
replacement of a component with an algorithm that has been automatically
discovered using reinforcement learning. We also present results in extra domains,
showcasing the generality of the approach.

Human intuition and know-how have been crucial in improving algo- can only sort sequences of length 3), whereas variable sort algorithms
rithms. However, many algorithms have reached a stage whereby can sort a sequence of varying size (for example, variable sort 5 can sort
human experts have not been able to optimize them further, leading sequences ranging from one to five elements).
to an ever-growing computational bottleneck. The work in classical We formulate the problem of discovering new, efficient sorting algo-
program synthesis literature, spanning many decades, aims to gen- rithms as a single-player game that we refer to as AssemblyGame. In this
erate correct programs and/or optimize programs using proxies for game, the player selects a series of low-level CPU instructions, which
latency. These include enumerative search techniques4–7 and stochastic we refer to as assembly instructions30, to combine to yield a new and
search5,6,8–10 as well as the more recent trend of using deep learning in efficient sorting algorithm. This is challenging as the player needs to
program synthesis for generating correct programs11–16. Using deep consider the combinatorial space of assembly instructions to yield an
reinforcement learning (DRL), we can take this a step further by generat- algorithm that is both provably correct and fast. The hardness of the
ing correct and performant algorithms by optimizing for actual meas- AssemblyGame arises not only from the size of the search space, which
ured latency at the CPU instruction level, by more efficiently searching is similar to extremely challenging games such as chess (10120 games)31
and considering the space of correct and fast programs compared to and Go (10700 games)32, but also from the nature of the reward function.
previous work. A single incorrect instruction in the AssemblyGame can potentially
One of the fundamental questions in computer science is how to invalidate the entire algorithm, making exploration in this space of
sort a sequence17–20. This is taught in elementary computer science games incredibly challenging.
classes around the world21,22 and is used ubiquitously by a vast range of To play the game, we introduce AlphaDev, a learning agent that is
applications23–25. Decades of computer science research have focused trained to search for correct and efficient algorithms. This agent is
on discovering and optimizing sorting algorithms26–28. A key component comprised of two core components, namely (1) a learning algorithm
of practical solutions is a small sort over a short sequence of elements; and (2) a representation function. The AlphaDev learning algorithm
this algorithm is called repeatedly when sorting large arrays that use can incorporate both DRL as well as stochastic search optimization
divide-and-conquer approaches29. In this work, we focus on two types algorithms to play AssemblyGame. The primary learning algorithm
of small sort algorithm: (1) the fixed sort and (2) the variable sort. Fixed in AlphaDev is an extension of AlphaZero33, a well-known DRL algo-
sort algorithms sort sequences of a fixed length (for example, sort 3 rithm, in which a neural network is trained to guide a search to solve

1
Deepmind, London, UK. 2Google, Mountain View, CA, USA. 3These authors contributed equally: Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru,
Edouard Leurent. ✉e-mail: [email protected]

Nature | Vol 618 | 8 June 2023 | 257


Article
a b
YRLG YDULDEOHBVRUWB LQW OHQJWK LQW D YDULDEOHBVRUWB LQW LQW 
^  'HWHUPLQH WKH QXPEHU RI HOHPHQWV
 'HWHUPLQH WKH QXPEHU RI HOHPHQWV FPS  HGL
VZLWFK OHQJWK  ([LW LI OHVV WKDQ  HOHPHQWV
^ MQH /DEHO
FDVH   %HORZ URXWLQH VRUWV  HOHPHQWV
FDVH  PRY UVL  HD[
 ([LW LI OHVV WKDQ  HOHPHQWV PRY  UVL  HF[
UHWXUQ FPS HD[ HF[
FDVH  PRY HD[ HG[
 %HORZ URXWLQH VRUWV  HOHPHQWV FPRYO HF[ HG[
LQW WPS D>@ PRY HG[ UVL
D>@ D>@  D>@ " D>@  D>@ FPRYJ HF[ HD[
D>@ D>@  WPS " WPS  D>@ PRY HD[  UVL
UHWXUQ
` /DEHO
` UHWT

Fig. 1 | The relationship between C++ and assembly programs. a, A C++ implementation of a variable sort 2 function that sorts any input sequence of up to two
elements. b, The C++ implementation in a is compiled to this equivalent low-level assembly representation.

AssemblyGame. The representation function is interchangeable and executing the current algorithm on a set of predefined inputs. As seen
captures the underlying structure of assembly programs. The primary in Fig. 2a, at timestep t, the player receives the current state St and
AlphaDev representation is based on Transformers34. executes an action at. This involves appending a legal assembly instruc-
Using AlphaDev, we have discovered fixed and variable sort algo- tion (for example, mov<A,B>) to the current algorithm generated thus
rithms from scratch that are both new and more efficient than the far. A reward rt is received that comprises both a measure of algorithm
state-of-the-art human benchmarks. The fixed sort solutions for sort 3, correctness and latency. Algorithm correctness (Fig. 2b) involves input-
sort 4 and sort 5 discovered by AlphaDev have been integrated into the ting a set of N test sequences into the current algorithm Pt to generate
standard sort function in the LLVM standard C++ library3. This library N outputs. These outputs are then compared to the expected outputs
is used by several million users including universities and numerous and a correctness reward rt is computed. Latency rewards can be gen-
international companies35. In addition, we analyse the new algorithm erated by either (1) penalizing the agent for increasing the length of
discoveries, compare AlphaDev to stochastic search optimization the algorithm (when length and latency are highly correlated) that we
approaches and apply AlphaDev to further domains to showcase the refer to as the algorithm length reward, or (2) measuring the actual
generality of the approach. latency of the algorithm. The game is executed for a limited number
of steps, after which the game is terminated. Winning the game corre-
sponds to generating a correct, low-latency algorithm using assembly
Representing algorithms as low-level CPU instructions instructions. Losing the game corresponds to generating an incorrect
When compiling algorithms to machine code from a high level language algorithm or a correct but inefficient algorithm.
such as C++ (for example, the sorting function in Fig. 1a), the algorithm
is first compiled into assembly (Fig. 1b). The assembler then converts
the assembly program into executable machine code. In this work, we Table 1 | AlphaDev performance when optimizing for
optimize algorithms at the assembly level30. In a typical assembly pro- algorithm length and latency
gram, the values are copied from memory into registers, manipulated
between registers and then written back to memory. The set of assembly (a) Algorithm AlphaDev Human benchmarks
instructions supported depends on the processor architecture. For the Length Length
purposes of this work, we focus on a subset of assembly instructions Sort 3 17 18
supported by the x86 processor architecture using the AT&T syntax36.
Sort 4 28 28
Each instruction is of the format Opcode⟨OperandA, OperandB⟩. An
Sort 5 42 46
example instruction is mov<A,B>, which is defined as move a value
from source (A) to destination (B). Further instruction definitions such VarSort3 21 33
as compare (cmp<A,B>), conditional move (cmovX<A,B>) and jump VarSort4 37 66
( jX<A>) can be found in Extended Data Table 1. In the example in Fig. 1b, VarSort5 63 115
%eax, %ecx, %edx, %edi correspond to four different register locations VarInt 27 31
and (%rsi), 4(%rsi) correspond to two different memory locations. The
(b) Algorithm AlphaDev Human benchmarks
symbol $2 is a placeholder for a constant value, which corresponds to
the length of the vector in this example. We use the terms assembly Latency ± (lower, upper) Latency ± (lower, upper)

program and assembly algorithm interchangeably in this work. This VarSort3 236,498 ± (235,898, 236,887) 246,040 ± (245,331, 246,470)
is because AlphaDev builds an assembly program from scratch, from VarSort4 279,339 ± (278,791, 279,851) 294,963 ± (294,514, 295,618)
an initially unordered set of instructions, each time it plays Assemb- VarSort5 312,079 ± (311,515, 312,787) 331,198 ± (330,717, 331,850)
lyGame, defining a new and efficient algorithm. VarInt 97,184 ± (96,885, 97,847) 295,358 ± (293,923, 296,297)
Competitive 75,973 ± (75,420, 76,638) 86,056 ± (85,630, 86,913)
DRL for discovering faster algorithms a, AlphaDev performance, compared to the human benchmarks, when optimizing for
algorithm length. AlphaDev discovers algorithms from scratch that match or improve on
In this section, we formulate optimizing algorithms at the CPU instruc-
the human benchmarks in each case. b, AlphaDev performance, compared to the human
tion level as a reinforcement learning (RL) problem37, in which the benchmarks, when optimizing directly for latency. In this setup, AlphaDev discovers algorithms
environment is modelled as a single-player game that we refer to as that have significantly lower latency than the human benchmarks in each case. The confidence
AssemblyGame. Each state in this game is defined as a vector St = ⟨Pt, Zt⟩ intervals are represented as latency ± (lower, upper), in which latency corresponds to the fifth
where Pt is a representation of the algorithm generated thus far in percentile of latency measurements across 100 different machines. Lower and upper refer to
the bounds of the 95% confidence interval for this percentile.
the game and Zt represents the state of memory and registers after

258 | Nature | Vol 618 | 8 June 2023


a AlphaDev Algorithm

St at MOV<Register0,Memory1>
MOV<Register0,Memory1>

b
All test input
sequences Algorithmt Output Expected output

A … A′ A′

B MOV<Register0,Memory1> D′ = B′ rt
?
C C′ C′

Fig. 2 | The AssemblyGame and algorithm correctness computation. a, The computations are used to compute the reward r t. In this example, test
AssemblyGame is played by AlphaDev, which receives as input the current sequences are input to the algorithm; for example, in the case of sorting three
assembly algorithm generated thus far S t and plays the game by selecting an elements, test inputs comprise all sequences of unsorted elements of length 3.
action to execute. In this example, the action is a mov<Register0,Memory1> For each sequence, the algorithm output is compared to the expected output
assembly instruction, which is appended to the current algorithm. The agent (in the case of sorting, the expected output is the sorted elements). In this
receives a reward that is a function of the algorithm’s correctness, discussed in example, the output D′ does not match the expected output B ′ and the
b, as well as the algorithm’s latency. The game is won by the player discovering algorithm is therefore incorrect.
a low latency, correct algorithm. b, The program correctness and latency

We refer to the agent that plays this single-player game as AlphaDev. we implemented a dual value function setup, whereby AlphaDev has
The agent’s primary learning algorithm is an extension of the AlphaZero two value function heads: one predicting algorithm correctness and
agent32 and guides a Monte Carlo tree search (MCTS) planning proce- the second predicting algorithm latency. The latency head is used to
dure using a deep neural network33,38. The input to the neural network directly predict the latency of a given program by using the program’s
is the state St and the output is a policy and value prediction. The policy actual computed latency as a Monte Carlo target for AlphaDev during
prediction is a distribution over actions and the value function is a training. This dual-head approach achieved substantially better results
prediction of the cumulative returns R that the agent should expect than the vanilla, single head value function setup when optimizing for
to receive from the current state St. During a game, the agent receives real latency.
as input the current state St. The agent then executes an MCTS pro-
cedure and uses this to select the next action to take. The generated
games are then used to update the network’s parameters, enabling Results
the agent to learn. Discovering faster sort algorithms
It is critical that AlphaDev has a representation39,40 capable of rep- We trained the AlphaDev agent from scratch to generate a range of fixed
resenting complex algorithmic structures to efficiently explore the sort and variable sort algorithms that are both correct and achieve lower
space of instructions. To achieve this, we introduce the AlphaDev latency than the state-of-the-art human benchmarks.
representation network (Extended Data Fig. 1a). This network com-
prises two components, namely (1) a transformer encoder network Fixed sorting algorithms
that provides the agent with a representation of the algorithm We considered three fundamental algorithms: sort 3, sort 4 and sort 5.
structure, and (2) the CPU state encoder network that helps the The state-of-the-art human benchmarks for these algorithms are
agent predict how the algorithm affects the dynamics of memory sorting networks43 as they generate efficient, conditional branchless
and registers. The CPU state encoder network comprises a multi- assembly code. This means that all instructions are executed sequen-
layer perceptron that receives as input the state of each register tially and there is no branching involved. Improving on these algo-
and memory location for a given set of inputs. These networks rithms is challenging as they are already highly optimized. As seen in
each output embeddings that are combined to yield the AlphaDev Table 1a, AlphaDev is able to find algorithms with fewer instructions
state representation. than the human benchmarks for sort 3 and sort 5 and matches the
state-of-the-art performance on sort 4. These shorter algorithms do
Transformer encoder indeed lead to lower latency as the algorithm length and latency are
Transformers are natural text encoders and have had much success correlated for the conditional branchless case; see Appendix B in Sup-
with language models recently14,34,41. As such, this motivated us to plementary Information for more details. We also explored scaling
adapt the standard transformer to model assembly instructions. We to slightly larger sorts using a variant of AlphaDev. We managed to
developed and incorporated a transformer encoder, our adaptation of save three instructions on sort 6, two instructions on sort 7 and one
the MultiQuery transformer encoder42, into the AlphaDev representa- instruction on sort 8, which provides a promising basis for future work.
tion network to represent the assembly instructions. Each assembly See Appendix C in Supplementary Information for an overview of the
instruction’s Opcode and corresponding Operands are converted to approach.
one-hot encodings and concatenated to form the raw input sequence.
This is fed through a multilayer transformer encoder, which maps it Variable sorting algorithms
to corresponding embedding vectors (see Extended Data Fig. 1b for We considered three variable sorting algorithms: VarSort3, VarSort4
an illustration). and VarSort5. The human benchmark in each case is defined as an algo-
rithm that, for a given input length, calls the corresponding sorting
Latency value functions network. In this case, branching is required, which greatly increases
Latency is an important reward signal that is used to guide the agent the complexity of the problem as the agent needs to (1) determine
in discovering performant algorithms. To better estimate latency, how many subalgorithms it needs to construct and (2) build the body

Nature | Vol 618 | 8 June 2023 | 259


Article
a b Original c AlphaDev

A 0HPRU\>@ $ 0HPRU\>@ $
0HPRU\>@ % 0HPRU\>@ %
0HPRU\>@ & 0HPRU\>@ &
B
PRY0HPRU\>@33 $ PRY0HPRU\>@33 $
PRY0HPRU\>@44 % PRY0HPRU\>@44 %
PRY0HPRU\>@55 & PRY0HPRU\>@55 &
C
PRY56 PRY56
FPS35 FPS35
FPRYJ355 PD[ $& FPRYJ355 PD[ $&
FPRYO366 PLQ $& FPRYO366 PLQ $&
PRY633 PLQ $&
FPS64 FPS64
FPRYJ433 PLQ $%& FPRYJ433 PLQ $%
FPRYJ644 PD[ PLQ $& % FPRYJ644 PD[ PLQ $& %

PRY30HPRU\>@ PLQ $%& PRY30HPRU\>@ PLQ $%


PRY40HPRU\>@ PD[ PLQ $& % PRY40HPRU\>@ PD[ PLQ $& %
PRY50HPRU\>@ PD[ $& PRY50HPRU\>@ PD[ $&

d e Original f AlphaDev

A 0HPRU\>@ $ 0HPRU\>@ $
0HPRU\>@ % 0HPRU\>@ %
0HPRU\>@ & 0HPRU\>@ &
0HPRU\>@ ' 0HPRU\>@ '
B
PRY0HPRU\>@33 $ PRY0HPRU\>@33 $
PRY0HPRU\>@44 % PRY0HPRU\>@44 %
C PRY0HPRU\>@55 & PRY0HPRU\>@55 &
PRY0HPRU\>@66 ' PRY0HPRU\>@66 '

D FPS63 FPS63
PRY37 PRY37
FPRYO633 PLQ $' FPRYO633 PLQ $'
FPRYO766 PD[ $' FPRYO766 PD[ $'
FPS53 FPS53
PRY37
FPRYJ533 PD[ &PLQ $' FPRYJ533 PD[ &PLQ $'
FPRYO577 PLQ $&' FPRYO577 PLQ $&
FPS47 FPS47
PRY78 PRY78
FPRYO488 PLQ $%&' FPRYO488 PLQ $%&
FPRYO744 PD[ %PLQ $&' FPRYO744 PD[ %PLQ $&

PRY80HPRU\>@ PLQ $%&' PRY80HPRU\>@ PLQ $%&'


PRY40HPRU\>@ PD[ %PLQ $&' PRY40HPRU\>@ PD[ %PLQ $&
PRY30HPRU\>@ PD[ &PLQ $' PRY30HPRU\>@ PD[ &PLQ $'
PRY60HPRU\>@ PD[ $' PRY60HPRU\>@ PD[ $'

Fig. 3 | Sorting networks and algorithmic improvements discovered by removal of a single instruction. d, An optimal classic sorting network
AlphaDev. a, An optimal classic sorting network for three inputs. The circled comparator configuration that has been improved by AlphaDev. See the
comparators have been improved by AlphaDev. See the AlphaDev swap move AlphaDev copy move for more details. e,f, The assembly pseudocode before
for more details. b,c, The assembly pseudocode before applying the AlphaDev applying the AlphaDev copy move (e) and after applying the AlphaDev copy
swap move (b) and after applying the AlphaDev swap move (c), resulting in the move (f), resulting in the removal of a single instruction.

of the main algorithm in parallel. The agent may also need to call
subalgorithms from other subalgorithms. In this case, optimizing AlphaDev swap move
for length leads to significantly shorter algorithms compared to the Figure 3a presents an optimal sorting network for three elements (see
human benchmarks as seen in Table 1a. However, owing to the com- Methods for an overview of sorting networks). We will explain how
plexities introduced by branching, latency and length are not always AlphaDev has improved the circled network segment. There are many
correlated; see Supplementary Information for more details. As such, variants of this structure that are found in sorting networks of various
we implemented a procedure that measures the actual latency of the sizes, and the same argument applies in each case. The circled part
programs by taking the fifth percentile of latency measurements across of the network (last two comparators) can be seen as a sequence of
100 different machines, with computed confidence intervals44, and instructions that takes an input sequence ⟨A, B, C⟩ and transforms each
optimize this metric. See Methods for the full benchmarking setup. input as shown in Table 2a (left). However, a comparator on wires B and
When optimizing for latency, the agent improves significantly on the C precedes this operator and therefore input sequences where B ≤ C
human benchmarks in each case as seen in Table 1b. are guaranteed. This means that it is enough to compute min(A, B) as
the first output instead of min(A, B, C) as shown in Table 2a (right).
New algorithm discoveries The pseudocode difference between Fig. 3b,c demonstrates how the
The solutions discovered by AlphaDev include new and exciting algo- AlphaDev swap move saves one instruction each time it is applied.
rithmic discoveries that lead to more efficient performance. In the
fixed sort setting, we found that AlphaDev discovered two interesting AlphaDev copy move
sequences of instructions that, when applied to a sorting network algo- Figure 3d presents a sorting network configuration, consisting of three
rithm, reduce the algorithm by one assembly instruction each time. We comparators, that is applied across four wires. This configuration is
refer to each sequence of instructions as (1) the AlphaDev swap move found in a sort 8 sorting network and corresponds to an operator tak-
and (2) the AlphaDev copy move respectively. ing four inputs ⟨A, B, C, D⟩ and transforming them into four outputs

260 | Nature | Vol 618 | 8 June 2023


a b
Length < 2 Length = 2
Length versus 2
Yes
Length = 4? Sort 4 Length > 2

Sort 3
No
Yes Sort 2
Length = 3? Sort 3
Yes
Length = 3?
No
Yes
Length = 2? Sort 2 No

Sort 4 given first 3


No
elements are sorted

Return
Return

Fig. 4 | Fundamentally different algorithms discovered by AlphaDev. three or two numbers as input. In this case, if the length is two, then it calls the
a, A flow diagram of the variable sort 4 (VarSort4) human benchmark algorithm. sort 2 sorting network and returns. If the length is three then it calls sort 3 to
In this algorithm, a sequence of unsorted numbers are input into the algorithm. sort the first three numbers and returns. If, however, the length is greater than
If the sequence length is four, three or two numbers, then the corresponding three, then it calls sort 3, followed by a simplified sort 4 routine that sorts the
sort 4, sort 3 or sort 2 sorting network is called that sorts the resulting sequence. remaining unsorted number. It is this part of the routine that results in
The result is then returned and output by the function. b, The VarSort4 algorithm significant latency savings.
discovered by AlphaDev. This algorithm also receives sequences of length four,

as seen in Table 2b (on the left). One can show that as part of sort 8, the then sort 3 is immediately called, resulting in the first three elements
input that flows into the operator satisfies the following inequality: being sorted. If the vector is greater than three elements, then a
D ≥ min(A, C). This means that the operator can be improved by apply- simpli­fied sort 4 algorithm is called that sorts the remaining unsorted
ing the AlphaDev copy move that is defined in Table 2b (on the right), elements in the input vector. It is this simplified part of the routine
resulting in one instruction less than the original operator. The code that yields significant gains in terms of algorithmic length and latency.
difference between the original operator and the code after applying
the AlphaDev copy move is visualized in Fig. 3e,f, respectively. Stochastic search optimization approaches
It is important to understand the advantages and limitations of RL
New variable sort algorithms compared to other approaches for program optimization. As such,
The VarSort4 algorithm discovered by AlphaDev is particularly inter- we implemented a state-of-the-art stochastic superoptimization
esting. The flow diagram for the human benchmark algorithm and approach8, adapted it to the sort setting and used it as the learning algo-
AlphaDev can be seen in Fig. 4a,b, respectively. The human bench- rithm in AlphaDev. We refer to this variant as AlphaDev-S (see Methods
mark algorithm determines the length of the input vector, and then for more details). We run this algorithm with at least the same amount
calls the corresponding sorting network to sort the elements. The of resources and wall-clock time as AlphaDev. AlphaDev-S requires a
AlphaDev solution has a completely different approach as seen prohibitive amount of time to optimize directly for latency as latency
in Fig. 4b. If the length of the input vector is strictly greater than 2, needs to be computed after every mutation. As such, AlphaDev-S opti-
mizes for a latency proxy, namely algorithm length and, then, at the
Table 2 | Analysis of the AlphaDev swap and copy moves end of training, we search through all correct programs generated
by AlphaDev-S and benchmark each one to find the lowest latency
(a) Input Original output AlphaDev swap move solution. In general, we find that AlphaDev consistently outperforms
A min(A, B, C) min(A, B) AlphaDev-S when learning from scratch without previous knowledge.
In addition, as the size of the program increases, AlphaDev explores
B max(min(A, C), B) max(min(A, C), B)
orders of magnitude fewer programs (12 million programs in the worst
C max(A, C) max(A, C)
case) compared to AlphaDev-S (31 trillion programs in the worst case).
(b) Input Original output AlphaDev copy move This may be because AlphaDev is able to better explore the space of
A min(A, B, C, D) min(A, B, C, D) algorithms compared to the breadth-first stochastic search proce-
B max(B, min(A, C, D)) max(B, min(A, C)) dure that gets stuck more easily into local optima; see Methods for an
C max(C, min(A, D)) max(C, min(A, D))
overview of this exploration hypothesis. In addition, AlphaDev never
evaluates latency during search as it uses the latency value function
D max(A, D) max(A, D)
predictions and, because of this, only needs to compute actual meas-
a, Left shows the transformation applied to inputs A, B and C in a classic sorting network when ured latency on less than 0.002% of generated programs. When incor-
applying the circled operator in Fig. 3a. Right shows the AlphaDev swap move transformation
porating previous knowledge into AlphaDev-S, such as warm starting
applied in place of the circled operator. Note the new transformation in bold that saves a
single instruction each time it is applied. b, Left shows the transformation applied to inputs the learning algorithm with a near-optimal solution, AlphaDev-S is
A, B, C and D according to the sorting network configuration in Fig. 3d. Right shows the more computationally efficient for sort 3, sort 4 and sort 5 (branch-
AlphaDev copy move transformation applied to this sorting network configuration. The less assembly algorithms) and also generates competitive low-latency
transformation in bold indicates the change made by the copy move, saving an instruction algorithms to that of AlphaDev in each case. However, for algorithms
each time it is applied.
that require branching (if–else statements), in which algorithm length

Nature | Vol 618 | 8 June 2023 | 261


Article
and latency are not well correlated, AlphaDev discovers lower latency functions49 define function correctness by the number of hashing
solutions than AlphaDev-S, even when warm starting this algorithm collisions. Therefore, in this case, AlphaDev can optimize for minimiz-
with a near-optimal solution. See Methods for an in-depth analysis of ing collisions as well as latency. AlphaDev can also, in theory, optimize
these algorithms. complicated logic components within the body of large, impressive
functions. We hope that AlphaDev can provide interesting insights and
Generalization to additional domains inspire new approaches in both the artificial intelligence and program
To test the generality of AlphaDev, we train the agent on a set of addi- synthesis communities.
tional domains. These include a protocol buffer deserialization subrou-
tine called VarInt, presented below, and a competitive coding problem
(see Appendix D in Supplementary Information for more details). The Online content
competitive coding domain latency performance is reported in Table 1b. Any methods, additional references, Nature Portfolio reporting summa-
Protocol Buffer is Google’s open-source data format used to serial- ries, source data, extended data, supplementary information, acknowl-
ize structured data45. This format is commonly used in cases in which edgements, peer review information; details of author contributions
performance or network load is of primary concern. The VarInt algo- and competing interests; and statements of data and code availability
rithm46 is a key component in both the serialization and deserialization are available at https://doi.org/10.1038/s41586-023-06004-9.
processes. We trained the AlphaDev agent as in variable sort to optimize
the VarInt deserialization function with respect to correctness and 1. Amazon. Amazon S3—two trillion objects, 1.1 million requests/second. AWS https://aws.
measured latency. For correctness, we reward the agent for correctly amazon.com/blogs/aws/amazon-s3-two-trillion-objects-11-million-requests-second/
deserializing each input. We use a set of 80 inputs and correspond- (2013).
2. Cormen, T. H. et al. Introduction to Algorithms (MIT Press, 2022).
ing outputs that cover common protobuf use cases. AlphaDev learns 3. Gelmi, M. Introduce branchless sorting functions for sort3, sort4 and sort5. LLVM.org
an optimized VarInt deserialization function and manages to signifi- https://reviews.llvm.org/D118029 (2022).
4. Bansal, S. & Aiken, A. Automatic generation of peephole superoptimizers. ACM SIGARCH
cantly outperform the human benchmark for single valued inputs. Our
Comput. Arch. News 34, 394–403 (2006).
agent discovers a branchless solution that is both shorter (Table 1a) 5. Alur, R. et al. Syntax-Guided Synthesis (IEEE, 2013).
and roughly three times faster than the human benchmark (Table 1b). 6. Phothilimthana, P. M. et al. Scaling up superoptimization. In Proc. Twenty-First
International Conference on Architectural Support for Programming Languages and
In doing so, the agent also discovered a new VarInt assignment move in
Operating Systems 297–310 (ACM, 2016).
which AlphaDev learns to combine two operations into a single instruc- 7. Barthe, G. et al. From relational verification to SIMD loop synthesis. In Proc. of the 18th
tion leading to latency savings. See Appendix D.1 in Supplementary ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 123–134
(ACM, 2013).
Information for a full overview of this move. This is a strong indica-
8. Schkufza, E., Sharma, R. & Aiken, A. Stochastic superoptimization. ACM SIGPLAN Notices
tion that AlphaDev is capable of generalizing to optimize non-trivial, 48, 305–315 (2013).
real-world algorithms. 9. Bunel, R. et al. Learning to superoptimize programs. In Proc. International Conference on
Learning Representations (ICLR, 2016).
10. Phothilimthana, P. M. et al. Chlorophyll: synthesis-aided compiler for low-power spatial
Libc++ sort patch architectures. ACM SIGPLAN Notices 49, 396–407 (2014).
The sort 3, sort 4 and sort 5 algorithms in the LLVM libc++ standard 11. Vinyals, O. et al. Grammar as a foreign language. Adv. Neural Inform. Proc. Syst. 28,
2773–2781 (2015).
sorting library are called many times by larger sorting algorithms and
12. Chen, X., Liu, C. & Song, D. Towards synthesizing complex programs from input-
are therefore fundamental components of the library. We reverse output examples. In Proc. International Conference on Learning Representations (ICLR,
engineered the low-level assembly sorting algorithms discovered by 2018).
13. Devlin, J. et al. Robustfill: neural program learning under noisy i/o. In Proc. International
AlphaDev for sort 3, sort 4 and sort 5 to C++ and discovered that our Conference on Machine Learning 990–998 (PMLR, 2017).
sort implementations led to improvements of up to 70% for sequences 14. Li, Y. et al. Competition-level code generation with AlphaCode. Science 378, 1092–1097
of a length of five and roughly 1.7% for sequences exceeding 250,000 (2022).
15. Pearce, H. et al. Can codex and other large language models help us fix security bugs?
elements. These improvements are for the uint32, uint64 and float Preprint at https://arxiv.org/abs/2112.02125 (2021).
data types for ARMv8, Intel Skylake and AMD Zen 2 CPU architectures; 16. Chen, M. et al. Evaluating large language models trained on code. Preprint at https://
see Appendix E in Supplementary Information for the full performance arxiv.org/abs/2107.03374 (2021).
17. Bingmann, T., Marianczuk, J. & Sanders, P. Engineering faster sorters for small sets of
tables. The performance improvements are due to both the branch- items. Software: Pract. Exper. 51, 965–1004 (2021).
less conditional assembly generated by AlphaDev as well as the new 18. Levcopoulos, C. & Petersson, O. Splitsort: an adaptive sorting algorithm. Inform. Proc.
AlphaDev swap move. For sort 5, we used a 43 length algorithm dis- Lett. 39, 205–211 (1991).
19. Helman, D. R., Bader, D. A. & JáJá, J. A randomized parallel sorting algorithm with an
covered by AlphaDev, as it led to a more efficient C++ implementation. experimental study. J. Parallel Distrib. Comput. 52, 1–23 (1998).
These algorithms were sent for review and have officially been included 20. Goodrich, M. T. Randomized shellsort: a simple oblivious sorting algorithm. In Proc.
in the libc++ standard sorting library3. It is the first change to these of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms 1262–1277
(ACM, 2010).
sub-routines in over a decade. This is also the first time that any compo- 21. Mehlhorn, K., Sanders, P. & Sanders, P. Algorithms and Data Structures: The Basic Toolbox
nent in this sort library has been replaced by an algorithm that has been Vol. 55. (Springer, 2008).
automatically discovered using reinforcement learning. We estimate 22. Knebl, H. Algorithms and Data Structures (Springer, 2020).
23. Karatzoglou, A., Baltrunas, L. & Shi, Y. Learning to rank for recommender systems. In Proc.
that these routines are being called trillions of times every day1,35,47. of the 7th ACM Conference on Recommender Systems 493–494 (ACM, 2013).
24. Yang, J. Y., Zhang, B. & Mao, Y. Study on Information Retrieval Sorting Algorithm in
Network-BasedManufacturing Environment. In Applied Mechanics and Materials Vol. 484,
183–186 (Trans Tech Publishing, 2014).
Discussion 25. Krallmann, J., Schwiegelshohn, U. & Yahyapour, R. On the design and evaluation of job
AlphaDev discovers new, state-of-the-art sorting algorithms from schedulingalgorithms. In Workshop on Job Scheduling Strategies for Parallel Processing
scratch that have been incorporated into the LLVM C++ library, used 17–42 (Springer, 1999).
26. White, S. K., Martinez, T. & Rudolph, G. Generating a novel sort algorithm using
by millions of developers and applications around the world23–25. Both Reinforcement Programming. In Proc. IEEE Congress on Evolutionary Computation 1–8
AlphaDev and stochastic search are powerful algorithms. An inter- (IEEE, 2010).
esting direction for future research is to investigate combining these 27. Srivastava, S., Gulwani, S. & Foster, J. S. From program verification to program synthesis.
In Proc. of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of
algorithms together to realize the complementary advantages of both Programming Languages 313–326 (ACM, 2010).
approaches. 28. Ansel, J. et al. Petabricks: a language and compiler for algorithmic choice. ACM Sigplan
Notices 44, 38–49 (2009).
It is important to note that AlphaDev can, in theory, generalize to
29. Smith, D. R. The design of divide and conquer algorithms. Sci. Comput. Program. 5, 37–58
functions that do not require exhaustive verification of test cases. (1985).
For example, hashing functions48 as well as cryptographic hashing 30. Irvine, K. R. et al. Assembly Language for Intel-Based Computers (Prentice Hall, 2003).

262 | Nature | Vol 618 | 8 June 2023


31. Shannon, C. E. XXII. Programming a computer for playing chess. London, Edinb. Dublin 46. Google. VarInt protocol buffer serialization and deserialization, version 0.2.5; https://
Philos. Mag. J. Sci. 41.314, 256–275 (1950). developers.google.com/protocol-buffers/docs/encoding (2022).
32. Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. 47. Protvin, R. & Levenberg, J. Why Google stores billions of lines of code in a single
Nature 529, 484–489 (2016). repository. Commun. ACM 59, 78–87 (2016).
33. Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and 48. Berman, I. et al. Multi-collision resistant hash functions and their applications. In Proc.
Go through self-play. Science 362, 1140–1144 (2018). Annual International Conference on the Theory and Applications of Cryptographic
34. Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Proc. Syst. 30, 5999–6009 Techniques 133–161 (Springer, 2018).
(2017). 49. Damgård, I. B. Collision free hash functions and public key signature schemes. In
35. LLVM. LLVM users https://llvm.org/Users.html (LLVM, 2022). Workshop on the Theory and Application of of Cryptographic Techniques 203–216
36. Bartlett, J. Learn to Program with Assembly 271–273 (Apress, 2021). (Springer, 1987).
37. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction 2nd edn (MIT Press, 2018).
38. Schrittwieser, J. et al. Mastering atari, go, chess and shogi by planning with a learned Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
model. Nature 588, 604–609 (2020). published maps and institutional affiliations.
39. Maillard, O.-A., Ryabko, D. & Munos, R. Selecting the state-representation in
reinforcement learning. Adv. Neural Inform. Proc. Syst. 24, 2627–2635 (2011). Open Access This article is licensed under a Creative Commons Attribution
40. Qian, R. et al. Spatiotemporal contrastive video representation learning. In Proc. IEEE/CVF 4.0 International License, which permits use, sharing, adaptation, distribution
Conference on Computer Vision and Pattern Recognition 6964–6974 (IEEE, 2021). and reproduction in any medium or format, as long as you give appropriate
41. Brown, T. et al. Language models are few-shot learners. Adv. Neural Inform. Proc. Syst. 33, credit to the original author(s) and the source, provide a link to the Creative Commons licence,
1877–1901 (2020). and indicate if changes were made. The images or other third party material in this article are
42. Shazeer, N. Fast transformer decoding: one write-head is all you need. Preprint at https:// included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
arxiv.org/abs/1911.02150 (2019). to the material. If material is not included in the article’s Creative Commons licence and your
43. Bundala, D. & Závodny, J. Optimal sorting networks. In Proc. International Conference on intended use is not permitted by statutory regulation or exceeds the permitted use, you will
Language and Automata Theory and Applications 236–247 (Springer, 2014). need to obtain permission directly from the copyright holder. To view a copy of this licence,
44. Hahn, G. J. & Meeker, W. Q. Statistical Intervals: A Guide for Practitioners Vol. 92 visit http://creativecommons.org/licenses/by/4.0/.
(John Wiley & Sons, 2011).
45. Google. Protocol buffers, version 0.2.5; https://developers.google.com/protocol-buffers (2022). © The Author(s) 2023

Nature | Vol 618 | 8 June 2023 | 263


Article
Methods (4) We can read and write to each memory location only once.
(5) We cannot use non-initialized registers (illegal).
Background (6) Do not perform consecutive compare instructions.
AlphaZero. AlphaZero33 is an RL algorithm that leverages MCTS as a
policy improvement operator. It consists of (1) a representation net- Training regime. We train AlphaDev on a Tensor Processing Unit (TPU) v.3,
work f rep that outputs a latent representation ht of the state St; and with a total batch size of 1,024 per TPU core. We use up to 16 TPU cores
(2) a prediction network f pred that predicts the expected return and train for 1 million iterations. On the actor side, the games are played
(the value) v̂t and a policy (that is, distribution over the action space) on standalone TPU v.4, and we use up to 512 actors. In practice, across
π̂t from a given latent state. The algorithm uses the true dynamics and all tasks, training takes, in the worst case, 2 days to converge.
reward when planning. MuZero38 is a model-based variant of Alpha­Zero
that has the same representation and prediction networks, but also AlphaDev-S. It is important to understand the advantages and limi-
learns a model of the dynamics and predicts rewards, which it uses for tations of RL compared to other possible approaches for program
planning. Specifically, it learns a dynamics network f dyn that predicts optimization. As such, we implemented a state-of-the-art stochastic
k+1
the next latent state ht and reward rˆtk+1 resulting from a transition. superoptimization approach8 and incorporated it into AlphaDev as
Note that the subscript t denotes timesteps in the real environment the learning algorithm to optimize sorting functions. We refer to this
and the superscript k represents timesteps in the model. adapted version as AlphaDev-S. Our re-implementation has been spe-
rep
cifically optimized for the sorting domain. This includes implementing
ht = f (S t ) (1) the algorithm to run with our assembly environment, defining a cor-
rectness and performance loss function specific to sorting and running
k +1 k +1
ht , r^t = f
dyn k
(ht , a tk ) (2) extensive hyperparameter sweeps to identify the best variant. The cost
function used for AlphaDev-S is c = correctness + α × performance where
correctness corresponds to computing the number of incorrect input
pred
v^t , π^t = f (h t ) (3) sequence elements that are still unsorted, performance corresponds
to the algorithm length reward and α is a weight trading off the two cost
On reaching a new state, AlphaZero proceeds by first encoding the functions. We are unable to optimize directly for latency as this slows
state into a latent representation with the representation network. down the learning algorithm considerably making learning infeasible. It
Then, the true dynamics or dynamics network (for MuZero) as well as should be noted that this function has been adapted to support the same
the prediction network f pred(ht) are used to simulate several trajectories set of assembly instructions used by AlphaDev as well as prune the same
that fill out a search tree, by sampling state transitions. At each node, set of incorrect or illegal actions. It also uses the same program correct-
the actions are selected using an optimistic strategy called the predic- ness computation module (Fig. 2b) to compute the correctness term.
tor upper confidence tree bound32, meant to balance exploration AlphaDev-S is then executed by first proposing a transformation to
(trying new actions) and exploitation (progressing further down the the program stored in the buffer (which may be empty or initialized
subtree of the current estimate of the best action). This strategy starts with an already sorted program). The correctness and performance
out by following the predicted policy π̂t closely, and gradually shifts terms are then computed using the program correctness module and
towards maximizing the predicted value function. Ultimately, an action algorithm length, respectively. If the cost is lower than the current best
is recommended by sampling from the root node with probability cost, the new program is accepted with high probability, otherwise
proportional to its visit count during MCTS. The predicted policy is it is rejected. We will now discuss the correctness cost function and
then trained to match the visit counts of the MCTS policy in an attempt transform weights in more detail.
to distil the search procedure into a policy such that subsequent itera-
tions of MCTS will disregard nodes that are not promising. Correctness cost. For the correctness cost function, we implemented
three types of cost function. The first one is defined as the percentage
P − PCt
Sorting networks. Sorting networks are very efficient as their struc- of incorrectly placed items: where P is the total number of items
P
tures can be parallelized on modern CPU architectures. They therefore to place and PCt is number of correctly placed items at timestep t. The
tend to achieve faster runtime performance, especially on small sorts, second variant is the square root of this equation. The final cost func-
compared to popular and efficient base case algorithms such as inser- tion takes the square root of the difference − PCt and this is what
tion sort17,43,50. A sorting network43 consists of two types of item called yielded the best performance.
comparators (vertical lines) and wires (horizontal lines) (Extended
Data Fig. 2a). Each wire carries a value from left to right. When two wires Program transformations. We enabled several program transforma-
intersect at a comparator, the values on the two wires are compared. tions such as adding an instruction to increase the size of the program
If the value of the bottom wire is smaller than the value of the top wire, (Add Transform), swapping two instructions (Swap Transform), ran-
then the values are swapped between wires as seen in Extended Data domly changing an Opcode for an instruction (Opcode Transform),
Fig. 2b. A programmatic implementation of a sorting network consists randomly sampling an Operand for a chosen instruction (Operand
of executing these swaps on particular pairs of elements from the input Transform) and randomly sample an Opcode and its corresponding
sequence in a particular order. Operands (Instruction Transform). It is possible to influence the sam-
pling of these transforms to encourage some to be sampled more or
Action pruning rules less frequently. We optimized the weights for sampling transforms by
We pruned the action space by removing some program invariances running an extensive hyperparameter sweep.
(for example, the order of register allocation) and illegal instructions
(for example, comparing two memory locations). This helps reducing Investigative studies for AlphaDev variants
the size of the action space and increases convergence rate. For our We now present a set of investigative studies that help to better under-
experiments, we used the following rules: stand the advantages and limitations of the DRL and the stochastic
(1) Memory locations are always read in incremental order. search learning algorithms used in AlphaDev. We compare AlphaDev to
(2) Registers are allocated in incremental order. AlphaDev-S. We implemented two variants of AlphaDev-S: (1) Cold Start
(3) We cannot compare or conditionally move to a memory location (AlphaDev-S-CS) and (2) Warm Start (AlphaDev-S-WS). AlphaDev-S-CS
(illegal). uses no previous information and has to generate a program from
an empty program buffer. AlphaDev-S-WS’s buffer is warm started We then take the fifth percentile as our final measurement, because
with a correct sorting program (for example, optimal sorting network we assume that most noise sources are one-sided (for example, cache
assembly program) and it edits the program to optimize it further. We misses, pre-emptions and so on). During training we process the meas-
compared the variants with AlphaDev in both the individual and vari- urements across ten machines for computational efficiency. After train-
able sort algorithm setups. ing, we benchmark AlphaDev’s solution against the baseline solutions,
Because AlphaDev always learns from scratch with no previous knowl- and process the measurements across 100 machines for more accuracy
edge, the direct comparison would be to the cold start stochastic search and noise reduction. For each benchmark, we compute confidence
version: AlphaDev-S-CS. However, as initial near-optimal programs intervals using the distribution-free two-sided confidence interval for
may sometimes be available, we also compare AlphaDev to the warm a quantile tabular method44.
start stochastic search version: AlphaDev-S-WS.
It should be noted that the stochastic search variants are unable to Variable sort. When optimizing directly for latency, AlphaDev out-
optimize directly for latency, as this would make learning infeasible performs AlphaDev-S-WS on VarSort3, VarSort4 and VarSort5 as seen
because of computational efficiency. As such, our AlphaDev-S variants in Extended Data Table 3a. AlphaDev-S-CS fails to find a solution in
optimize for algorithm length. Then, at the end of training, we iterate each case. In the cases of VarSort4 and VarSort5, program length and
through the set of generated programs for AlphaDev-S across varying latency are not correlated (see Supplementary Information for more
lengths and identify the program with the lowest latency. details). This indicates that when program length cannot be used as a
In each case, the stochastic search algorithms (AlphaDev-S) are run proxy for performance, AlphaDev is able to find lower latency solutions
using at least the same computational resources and wall-clock time compared to AlphaDev-S. This is even in the case where the stochas-
to that of AlphaDev. tic search is warm started with a near-optimal program. In addition,
AlphaDev converges to the optimal solution after exploring a maxi-
Fixed sort. We first examine the performance of the various approaches mum of 12M programs as seen in Extended Data Table 3b. This is orders
for the fixed sort algorithms. In this case, all algorithmic variants opti- of magnitude lower than that of AlphaDev-S-CS and AlphaDev-S-WS,
mize for algorithm length as algorithm length and latency are highly respectively (31 trillion programs in the worst case).
correlated in the conditional branchless setting (see Supplementary
Information for more details). Exploration hypothesis
In the cold start setting, AlphaDev-S-CS is unable to find the optimal We proposed that AlphaDev-S struggles to discover programs when
programs in each case as seen in Extended Data Table 2a. In addition, learning from scratch and gets stuck in local optima when warm started
AlphaDev-S-CS explores orders of magnitude more programs than because of its limited exploration capabilities as a result of the stochastic
AlphaDev as shown in Extended Data Table 2b. In the warm start setting, search procedure. Extended Data Fig. 3 shows two-dimensional
AlphaDev-S is warm started with a near-optimal sorted program, and t-stochastic neighbour embedding (t-SNE) projections51 of AlphaDev
is able to match the performance of AlphaDev in each case as shown and AlphaDev-S’s assembly algorithms discovered during their
in Extended Data Table 2a. It is more computationally efficient than respective training procedures for VarSort5. The features used in
AlphaDev as shown in Extended Data Table 2c but explores orders of the projection include correctness, latency, algorithm length and a
magnitude more programs for sort 3 and sort 5 as shown in Extended histogram count of the instructions used per algorithm. Extended
Data Table 2b. It can be argued that AlphaDev-S-WS has a substantial Data Fig. 3a indicates the regions in algorithm space explored by
advantage in this scenario as it is provided with an initial near-optimal AlphaDev, AlphaDev-S-CS and AlphaDev-S-WS, respectively, whereas
program. We will show in the Variable sort section that when the algo- Extended Data Fig. 3b superimposes algorithm correctness onto
rithms become more complicated and branching is introduced, warm each point in the t-SNE projection in which the colour indicates the
starting the learning algorithm with a near-optimal program is not correctness of each discovered algorithm, ranging from incorrect
enough and can cause it to get stuck in suboptimal solutions. algorithms (purple) to correct algorithms (yellow). The AlphaDev-S
variants both cover a densely packed circular region around their
Brute-force approach. We also used a brute-force approach to prove initial seed, which highlights the breadth-first nature of their sto-
that no program shorter than 17 instructions exists for sort 3. We had chastic search procedure. This illustrates that AlphaDev-S-CS fails to
to enumerate roughly 1032 programs and, even with pruning heuristics, navigate through the space of incorrect algorithms in a reasonable
it took more than 3 days to prove this hypothesis. For sort 4 and above amount of time and discover correct algorithms when learning from
this approach is infeasible. scratch. A similar argument applies to AlphaDev-S-WS whereby, when
optimizing from an already correct but suboptimal expert demon-
Latency benchmarking suite. The length of a program is only a proxy stration, the algorithm is biased towards exploring its vicinity and
for the performance of an algorithm. As we introduce branching struc- struggles to escape this local maxima. By contrast, AlphaDev has more
tures, the length and latency of a program are not well correlated. diverse algorithm space coverage, as the long-term value function
Therefore, we run the programs on actual machines and measure their is a guiding signal for discovering new and interesting parts of algo-
latency. Microbenchmarking is very challenging given the numerous rithm space. As seen in Extended Data Fig. 3b, it is capable of escaping
noise sources that could affect the measurements. This is especially true the space of incorrect algorithms to discover a new space of correct
when running on shared machines where there could be interference algorithms, highlighting the exploration advantages afforded by
from other processes. Our approach is to have a separate benchmark- AlphaDev.
ing service, replicated on separated machines, so that we can quickly
perform many measurements in a controlled environment under dif- Related work
ferent conditions. The system works as follows: Assembly optimization. There are numerous approaches to optimiz-
(1) The RL agent processes 1,000 measurements across the machines ing assembly programs, which we have classified into three groups:
using the replicated service. enumerative search, stochastic search and symbolic search5.
(2) For each measurement, the service runs the given sorting algorithm First, enumerative search techniques include brute-force program
over 10,000 random inputs (for example, for sort 3 this would be enumeration4–6 as well as implicit enumeration using symbolic theorem
3 × 10,000 = 30,000 random integers). proving52,53. These approaches search through the space of programs
(3) We measure the time taken using a CPU performance counter to find a solution based on a predefined set of programs, heuristic and/
(CPU_CLK_UNHALTED.CORE). or cost function. These approaches struggle to span large regions of
Article
program space, especially as the size and complexity of the program which are used as part of the state space. By contrast, our approach
increases. only focuses on training a single RL architecture, taking advantage
Second, stochastic search techniques circumvent comprehensive of MCTS search and powerful state representations. Shypula et al.64
enumeration by relying on sampling mechanisms such as Markov create a supervised assembly dataset and use it to train a Transformer
chain Monte Carlo sampling5,6,8,9. Rajeev Alur et al.5 define a correct- model for mapping unoptimized to optimized code, followed by an RL
ness specification, provided by a logical formula that uses symbols stage for improving the solution quality. Our method does not require
from a background theory. The goal is to then find an implementa- a supervised dataset or two separate training and finetuning stages,
tion expression such that logical formula defining the specification and optimizes everything end-to-end using RL and search instead.
is valid. The idea is to iteratively add test cases and then search and Chen et al.65 define their own domain specific language and perform
expand the program to solve the given test cases. They optimize for input–output program synthesis that better uses the intermediate
correctness on problems from the book Hacker’s delight54. Phitch- program representation to guide the synthesis routine. They show
aya Mangpo Phothilimthana et al.6 introduce the LENS algorithm that that this can be incorporated with RL, using the setup of Rudy Bunel
is based on running enumerative, stochastic and symbolic search et al.66 and improve the correctness of generated functions. They do
in parallel, while relying on handcrafted pruning rules. This setup is not, however, optimize for program length or latency.
capable of optimizing up to 21 instructions, and cannot optimize for
latency nor support branching. Another algorithm8 is based on Markov Input–output examples for program synthesis. A large body of work
chain Monte Carlo rejection sampling and applies transformations to addresses the problem of learning programs from input–output pairs.
programs in assembly using a loss function that is a function of cor- One type of approach learns a neural network for matching inputs to
rectness and performance. Many of these approaches are prone to outputs directly11,13,67,68. This approach is difficult to integrate into exist-
getting stuck in local minima and may also struggle as the size and/ ing libraries and can struggle to generalize to previously unseen inputs,
or complexity of the program increases. In addition, incorporating although there has been some encouraging recent progress using graph
actual, measured latency into these approaches are either infeasible or representations69. Another type of approach is to perform a search in
prohibitively expensive. program space, guided by a learned model12,70–72. For instance, Chen
Third, symbolic search approaches can also be implemented to opti- et al.70 use a model that predicts the next program token on the basis of
mize assembly programs. These include SAT solvers55, SMT solvers5,6 a partial program and the input–output pairs. This bears some similari-
and Mixed Integer Programs (MIPs)56,57. However, these approaches ties to how search is guided in our approach: the learned policy prior
suffer from scaling issues. For example, classical solvers require a prob- in AlphaZero is a model for predicting the next token, learned on the
lem to be translated into a certain canonical form. It usually requires basis of a combination of a partial program and that program’s effects
an expert in the said solvers and a substantial amount of time to find on the inputs. However, we are interested in finding correct and efficient
an efficient formulation. In addition, for any new modification of the programs, which we achieve by further learning a value function for
problem, this has to be repeated. Classical solvers are also hard to paral- approximating the expected latency of partial programs, and using
lelize and thus, it is challenging to leverage more hardware to speed up AlphaZero to incorporate this value function into the search process.
the solving process. Another symbolic search algorithm is Cholorphyll10
that implements a multi-phase approach. It first requires as input a Deep learning for code generation. There are also several deep learn-
source program with partition annotations that specify where code ing approaches that use large languages models to generate code. These
and data reside. Then, a layout synthesizer maps program fragments approaches vary in their uses from transpilation, code refactoring and
onto physical cores to minimize computational costs. The code is then explaining code15 to generating human-level competitive code using a
separated into per-core program fragments and the program frag- natural language description14. That particular work aims to generate
ments are compiled into machine code. At this point, a superoptimizer correct code, but does not focus on generating low-latency solutions.
optimizes each of these fragments.
Sort-based program optimization. There are several program synthe-
SIMD optimization. Various approaches58–60 have also been applied sis studies that have tackled sorting algorithms. For example, White
to sorting functions that run in the single instruction, multiple data et al.26 use RL for learning sorting functions. Their work uses several
(SIMD)61 setup. This setup is capable of parallelizing instruction heuristics and a domain specific language to yield a sorting algorithm
execution, but is not supported at present in popular libraries such as called reinforcement programming sort. Srivastava et al.27 encodes the
LLVM’s libc++ std::sort library. One example is that from Gilles Barthe program synthesis as a verification problem. Specifically, they repre-
et al.7 that proposes a methodology for optimizing programs by sent a synthesis task as a tuple consisting of the functional expression,
automatically vectorizing loops with SIMD instructions. They do this the domains and guards appearing in the synthesized program and the
by introducing a framework for verifying the correctness of transfor- resource constraints. The idea is that, given a prespecified resource
mations to a program and performing a search-based procedure using constraint, their synthesizer produces a program that meets the pre-
the said transformation. Their framework can discover SIMD looping defined specification to ensure correctness. They apply this to discover
structures of up to nine instructions in 0.12 s, which corresponds to a merge sort and quick sort. Jason Ansel et al.28 takes as input predefined
minimum 2× speed-up. algorithms (for example, insertion sort, merge sort and quick sort) and
then determines when to select these algorithms for execution using
RL approaches for program synthesis. There are also several studies its autotuner function. It does so by defining a language that contains
using RL for program optimization. Kevin Ellis et al.62 learn a policy rules and transforms that dictate how the algorithms are selected and
and value function to write and evaluate code, as well as performing a where they are executed.
Monte Carlo-style search strategy during inference. This work requires
a pretraining step and aims to generate correct programs that satisfy
a predefined specification. The approach is successfully applied to Data availability
computer-aided design and string editing programs. SuperSonic63 uses The data used to train the system were generated synthetically accord-
an RL meta-optimizer to select between different RL architectures, ing to the procedures explained in the paper. The algorithms discovered
using a Multi-Armed Bandit policy search to find a state representation, by AlphaDev for the copy and swap operators are presented in the main
reward function and RL algorithm that is optimal for the current task. paper. We have also released the discovered AlphaDev assembly imple-
This requires keeping track of many RL algorithms and architectures, mentations for sort 3–8 as well as VarSort3, 4 and 5 on Github at https://
github.com/deepmind/alphadev. We have included exhaustive tests 62. Ellis, K. et al. Write, execute, assess: program synthesis with a REPL. Adv. Neural Inform.
Proc. Syst.32, 9137–9146 (2019).
to ensure that each implementation is correct. In addition, Appendix 63. Wang, H. et al. Automating reinforcement learning architecture design for code
G in Supplementary Information contains a list of additional, correct optimization. In Proc. 31st ACM SIGPLAN International Conference on Compiler
sorting algorithms discovered by AlphaDev for sort 3, sort 4 and sort 5. Construction 129–143 (ACM, 2022).
64. Shypula, A. G. et al. Learning to superoptimize real-world programs. Preprint at https://
The performance of the sort 3, sort 4 and sort 5 algorithms on the arxiv.org/abs/2109.13498 (2022).
official LLVM benchmarking suite for three different CPU architectures 65. Chen, X., Liu, C. & Song, D. Execution-guided neural program synthesis. In Proc.
as well as floats, int32 and int64 data types is detailed in Appendix E International Conference on Learning Representations (ICLR, 2018).
66. Bunel, R. et al. Leveraging grammar and reinforcement learning for neural program
in the Supplementary Information. In addition, the AlphaDev sort 3, synthesis. In Proc. International Conference on Learning Representations (ICLR, 2018).
sort 4 and sort 5 implementations can be found in the LLVM libc++ 67. Aharoni, R. & Goldberg, Y. Towards string-to-tree neural machine translation. In Proc. 55th
standard sorting library3. Annual Meeting of the Association for Computational Linguistics132–140 (ACL, 2017).
68. Dong, L. & Lapata, M. Language to logical form with neural attention. In Proc. 54th Annual
Meeting of the Association for Computational Linguistics 33–43 (ACL, 2016).
69. Ibarz, B. et al. A generalist neural algorithmic learner. In Proc. Learning on Graphs
Code availability Conference Vol. 198, 2:1–2:23 (PMLR, 2022).
70. Chen, X., Song, D. & Tian, Y. Latent execution for neural program synthesis beyond
We have also released pseudocode at https://github.com/deepmind/ domain-specific languages. Adv. Neural Inform. Proc. Syst. 34, 22196–22208 (2021).
alphadev that includes the environment, the full actor and training 71. Parisotto, E. et al. Neuro-symbolic program synthesis. Preprint at https://arxiv.org/
loops as well as the core MCTS algorithm. In addition, we include our abs/1611.01855 (2016).
72. Ellis, K., Solar-Lezama, A. & Tenenbaum, J. Sampling for Bayesian program learning. Adv.
actual JAX implementation of our policy, value and representation Neural Inform. Proc. Syst. 29, 1297–1305 (2016).
networks that enable the architectures to be reproduced. Finally, we
have a config file containing the hyperparameter definitions to be Acknowledgements We thank P. Kurylowicz, N. Anderson and Z. Ahmed for assistance
used with the agent. coordinating the research; L. Dionne and N. Klauser for patiently reviewing our LLVM code;
and N. Vaish, D. Gove, D. Kutenin and A. Fawzi for their helpful advice during the course of the
50. Hwang, M. Sort, Bitset (GitHub, 2021). project. We also thank our colleagues at DeepMind for their encouragement and support.
51. Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9.11,
2579–2605 (2008). Author contributions D.J.M., A.Michi and A.Z. conceived the idea and lead the research.
52. Gulwani, S. et al. Synthesis of loop-free programs. ACM SIGPLAN Notices 46.6, 62–73 A.Michi, D.J.M., A.Z., M.G., M.S., C.P., E.L., S.I. and A.Mandhane developed the neural network
(2011). architecture and training. J.-B.L., C.P., M.G., D.J.M. and E.L. developed the baseline. M.G., A.Z.,
53. Sasnauskas, R. et al. Souper: a synthesizing superoptimizer. Preprint at https://arxiv.org/ D.J.M., M.H., A.A., T.K. and K.Millikin analysed the generated algorithms and helped with the
abs/1711.04422 (2017). sort patch. D.J.M., A.Michi, A.Z., S.G., S.E., J.B., R.T., C.G. and K.Milan, managed the research.
54. Warren, H. S. Hacker’s Delight (Pearson Education, 2013). A.Michi, M.G. and M.S. led the technical platform. A.Mandhane, T.H., Y.L., J.S., T.C., M.B., P.K.,
55. Hamadi, Y., Jabbour, S. & Sais, L. ManySAT: a parallel SAT solver. J. Satisfiability, Boolean M.R., D.S., O.V. and D.H. contributed technical advice and ideas. D.J.M. and A.Z. conceived the
Model. Comput. 6, 245–262 (2010). project. D.J.M., C.P., E.L., A.Michi, M.G., A.Z., P.K. and M.S. wrote the paper.
56. Wolsey, L. A. Mixed integer programming. In Wiley Encyclopedia of Computer Science
and Engineering 1–10 (Wiley, 2007). Competing interests D.J.M., A.Michi, A.Z., M.G., M.S., C.P., E.L., S.I., A.Mandhane, P.K., M.R., D.S.
57. Nair, V. et al. Solving mixed integer programs using neural networks. Preprint at https:// and O.V. are planning to file a patent application relating to subject matter contained in this
arxiv.org/abs/2012.13349 (2020). paper in the name of DeepMind Technologies Limited. The remaining authors declare no
58. Inoue, H. et al. AA-sort: a new parallel sorting algorithm for multi-core SIMD processors. competing interests.
In Proc. International Conference on Parallel Architecture and Compilation Techniques
(PACT 2007) 189–198 (IEEE, 2007). Additional information
59. Yin, Z. et al. Efficient parallel sort on avx-512-based multi-core and many-core Supplementary information The online version contains supplementary material available at
architectures. In Proc. IEEE 21st International Conference on High Performance https://doi.org/10.1038/s41586-023-06004-9.
Computing and Communications 168–176 (IEEE, 2019). Correspondence and requests for materials should be addressed to Daniel J. Mankowitz.
60. Blacher, M. et al. Vectorized and performance-portable Quicksort. Preprint at https:// Peer review information Nature thanks Zheng Wang and the other, anonymous, reviewer(s) for
arxiv.org/abs/2205.05982 (2022). their contribution to the peer review of this work.
61. Wikipedia. Single instruction, multiple data https://en.m.wikipedia.org/wiki/SIMD (2022). Reprints and permissions information is available at http://www.nature.com/reprints.
Article

Extended Data Fig. 1 | The AlphaDev representation network architecture. can be found in the Supplementary Information, Appendix A. (b) Before
(a) The AlphaDev representation network comprises a Transformer Encoder inputting instructions into the Transformer Encoder network, each program
network that receives as input the assembly algorithm generated thus far. instruction’s opcode and operands are converted to one-hot encodings and
It also contains a CPU State Encoder network that receives as input the current concatenated. The resulting encoding is then fed into the Transformer Encoder
state of memory and registers. The exact architecture and hyperparameters network.
Extended Data Fig. 2 | An example sorting network43. (a) The horizontal lines of the comparator is smaller than the value at the bottom of the comparator,
are called wires and the vertical lines are called comparators. (b) An initially the numbers switch wires. An optimal sorting network places comparators in
unsorted sequence of values are input into the sorting network on the left hand specific positions so as to sort any sequence of unsorted values using the
side. At various stages two wires encounter a comparator. If the value at the top minimum number of comparators.
Article

Extended Data Fig. 3 | Hypothesis for improved exploration using programs (purple) to correct programs (yellow). As seen in the figure,
AlphaDev. (a) A 2D t-SNE 51 projection indicating the regions explored by AlphaDev-S struggles to move out of local optima whereas AlphaDev is able to
AlphaDev (blue) compared to AlphaDev-S. (b) The same 2D t-SNE projection as explore from the space of incorrect programs to the space of correct
in (a) with algorithm correctness superimposed onto each point from incorrect programs.
Extended Data Table 1 | Additional Assembly instructions

This table contains a list of additional assembly X86 instructions using AT&T syntax and their corresponding description.
Article
Extended Data Table 2 | Comparison of AlphaDev and
AlphaDev-S for fixed sort

(a) Presents the shortest programs found by each approach. Note that AlphaDev-S-CS is
unable to discover a sorting function when training from scratch. AlphaDev-S-WS, which is
initialized with a near-optimal program, is able to match the performance of AlphaDev, which
discovers the optimal programs from scratch. (b) Indicates the number of programs explored
by each approach to find the optimal solution. Note that AlphaDev-S-CS explores orders
of magnitude more programs for each sort algorithm. For sort 3 and sort 5 AlphaDev-S-WS
explores orders of magnitude more programs than AlphaDev to find the optimal solution.
(c) The approximate wall clock time to generate the shortest program for each sort length.
AlphaDev-S-WS is more computationally efficient than AlphaDev for branchless sort. How-
ever, as will be shown in Extended Data Table 3, when branching is introduced, AlphaDev
outperforms AlphaDev-S-WS, which tends to get stuck in locally optimal solutions.
Extended Data Table 3 | Comparison of AlphaDev and AlphaDev-S on variable sort

(a) Presents the latency results for the programs discovered by each approach. The reported latency corresponds to the 5th percentile of latencies measured across 100 machines. The ± [Lower,
Upper] reports the lower and upper confidence intervals respectively. In this setting, AlphaDev optimizes directly for real, measured latency. Note that AlphaDev outperforms each approach
and AlphaDev-S-CS is unable to find a solution in each case. (b) In the variable sort setting, both AlphaDev-S variants explore orders of magnitude more programs compared to AlphaDev.

You might also like