DeepMind - Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning
DeepMind - Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning
https://doi.org/10.1038/s41586-023-06004-9 Daniel J. Mankowitz1,3 ✉, Andrea Michi1,3, Anton Zhernov1,3, Marco Gelmi1,3, Marco Selvi1,3,
Cosmin Paduraru1,3, Edouard Leurent1,3, Shariq Iqbal1, Jean-Baptiste Lespiau1, Alex Ahern1,
Received: 25 July 2022
Thomas Köppe1, Kevin Millikin1, Stephen Gaffney1, Sophie Elster1, Jackson Broshear1,
Accepted: 23 March 2023 Chris Gamble1, Kieran Milan1, Robert Tung1, Minjae Hwang2, Taylan Cemgil1,
Mohammadamin Barekatain1, Yujia Li1, Amol Mandhane1, Thomas Hubert1,
Published online: 7 June 2023
Julian Schrittwieser1, Demis Hassabis1, Pushmeet Kohli1, Martin Riedmiller1, Oriol Vinyals1 &
Open access David Silver1
Fundamental algorithms such as sorting or hashing are used trillions of times on any
given day1. As demand for computation grows, it has become critical for these
algorithms to be as performant as possible. Whereas remarkable progress has been
achieved in the past2, making further improvements on the efficiency of these
routines has proved challenging for both human scientists and computational
approaches. Here we show how artificial intelligence can go beyond the current state
of the art by discovering hitherto unknown routines. To realize this, we formulated the
task of finding a better sorting routine as a single-player game. We then trained a new
deep reinforcement learning agent, AlphaDev, to play this game. AlphaDev
discovered small sorting algorithms from scratch that outperformed previously
known human benchmarks. These algorithms have been integrated into the LLVM
standard C++ sort library3. This change to this part of the sort library represents the
replacement of a component with an algorithm that has been automatically
discovered using reinforcement learning. We also present results in extra domains,
showcasing the generality of the approach.
Human intuition and know-how have been crucial in improving algo- can only sort sequences of length 3), whereas variable sort algorithms
rithms. However, many algorithms have reached a stage whereby can sort a sequence of varying size (for example, variable sort 5 can sort
human experts have not been able to optimize them further, leading sequences ranging from one to five elements).
to an ever-growing computational bottleneck. The work in classical We formulate the problem of discovering new, efficient sorting algo-
program synthesis literature, spanning many decades, aims to gen- rithms as a single-player game that we refer to as AssemblyGame. In this
erate correct programs and/or optimize programs using proxies for game, the player selects a series of low-level CPU instructions, which
latency. These include enumerative search techniques4–7 and stochastic we refer to as assembly instructions30, to combine to yield a new and
search5,6,8–10 as well as the more recent trend of using deep learning in efficient sorting algorithm. This is challenging as the player needs to
program synthesis for generating correct programs11–16. Using deep consider the combinatorial space of assembly instructions to yield an
reinforcement learning (DRL), we can take this a step further by generat- algorithm that is both provably correct and fast. The hardness of the
ing correct and performant algorithms by optimizing for actual meas- AssemblyGame arises not only from the size of the search space, which
ured latency at the CPU instruction level, by more efficiently searching is similar to extremely challenging games such as chess (10120 games)31
and considering the space of correct and fast programs compared to and Go (10700 games)32, but also from the nature of the reward function.
previous work. A single incorrect instruction in the AssemblyGame can potentially
One of the fundamental questions in computer science is how to invalidate the entire algorithm, making exploration in this space of
sort a sequence17–20. This is taught in elementary computer science games incredibly challenging.
classes around the world21,22 and is used ubiquitously by a vast range of To play the game, we introduce AlphaDev, a learning agent that is
applications23–25. Decades of computer science research have focused trained to search for correct and efficient algorithms. This agent is
on discovering and optimizing sorting algorithms26–28. A key component comprised of two core components, namely (1) a learning algorithm
of practical solutions is a small sort over a short sequence of elements; and (2) a representation function. The AlphaDev learning algorithm
this algorithm is called repeatedly when sorting large arrays that use can incorporate both DRL as well as stochastic search optimization
divide-and-conquer approaches29. In this work, we focus on two types algorithms to play AssemblyGame. The primary learning algorithm
of small sort algorithm: (1) the fixed sort and (2) the variable sort. Fixed in AlphaDev is an extension of AlphaZero33, a well-known DRL algo-
sort algorithms sort sequences of a fixed length (for example, sort 3 rithm, in which a neural network is trained to guide a search to solve
1
Deepmind, London, UK. 2Google, Mountain View, CA, USA. 3These authors contributed equally: Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru,
Edouard Leurent. ✉e-mail: [email protected]
Fig. 1 | The relationship between C++ and assembly programs. a, A C++ implementation of a variable sort 2 function that sorts any input sequence of up to two
elements. b, The C++ implementation in a is compiled to this equivalent low-level assembly representation.
AssemblyGame. The representation function is interchangeable and executing the current algorithm on a set of predefined inputs. As seen
captures the underlying structure of assembly programs. The primary in Fig. 2a, at timestep t, the player receives the current state St and
AlphaDev representation is based on Transformers34. executes an action at. This involves appending a legal assembly instruc-
Using AlphaDev, we have discovered fixed and variable sort algo- tion (for example, mov<A,B>) to the current algorithm generated thus
rithms from scratch that are both new and more efficient than the far. A reward rt is received that comprises both a measure of algorithm
state-of-the-art human benchmarks. The fixed sort solutions for sort 3, correctness and latency. Algorithm correctness (Fig. 2b) involves input-
sort 4 and sort 5 discovered by AlphaDev have been integrated into the ting a set of N test sequences into the current algorithm Pt to generate
standard sort function in the LLVM standard C++ library3. This library N outputs. These outputs are then compared to the expected outputs
is used by several million users including universities and numerous and a correctness reward rt is computed. Latency rewards can be gen-
international companies35. In addition, we analyse the new algorithm erated by either (1) penalizing the agent for increasing the length of
discoveries, compare AlphaDev to stochastic search optimization the algorithm (when length and latency are highly correlated) that we
approaches and apply AlphaDev to further domains to showcase the refer to as the algorithm length reward, or (2) measuring the actual
generality of the approach. latency of the algorithm. The game is executed for a limited number
of steps, after which the game is terminated. Winning the game corre-
sponds to generating a correct, low-latency algorithm using assembly
Representing algorithms as low-level CPU instructions instructions. Losing the game corresponds to generating an incorrect
When compiling algorithms to machine code from a high level language algorithm or a correct but inefficient algorithm.
such as C++ (for example, the sorting function in Fig. 1a), the algorithm
is first compiled into assembly (Fig. 1b). The assembler then converts
the assembly program into executable machine code. In this work, we Table 1 | AlphaDev performance when optimizing for
optimize algorithms at the assembly level30. In a typical assembly pro- algorithm length and latency
gram, the values are copied from memory into registers, manipulated
between registers and then written back to memory. The set of assembly (a) Algorithm AlphaDev Human benchmarks
instructions supported depends on the processor architecture. For the Length Length
purposes of this work, we focus on a subset of assembly instructions Sort 3 17 18
supported by the x86 processor architecture using the AT&T syntax36.
Sort 4 28 28
Each instruction is of the format Opcode⟨OperandA, OperandB⟩. An
Sort 5 42 46
example instruction is mov<A,B>, which is defined as move a value
from source (A) to destination (B). Further instruction definitions such VarSort3 21 33
as compare (cmp<A,B>), conditional move (cmovX<A,B>) and jump VarSort4 37 66
( jX<A>) can be found in Extended Data Table 1. In the example in Fig. 1b, VarSort5 63 115
%eax, %ecx, %edx, %edi correspond to four different register locations VarInt 27 31
and (%rsi), 4(%rsi) correspond to two different memory locations. The
(b) Algorithm AlphaDev Human benchmarks
symbol $2 is a placeholder for a constant value, which corresponds to
the length of the vector in this example. We use the terms assembly Latency ± (lower, upper) Latency ± (lower, upper)
program and assembly algorithm interchangeably in this work. This VarSort3 236,498 ± (235,898, 236,887) 246,040 ± (245,331, 246,470)
is because AlphaDev builds an assembly program from scratch, from VarSort4 279,339 ± (278,791, 279,851) 294,963 ± (294,514, 295,618)
an initially unordered set of instructions, each time it plays Assemb- VarSort5 312,079 ± (311,515, 312,787) 331,198 ± (330,717, 331,850)
lyGame, defining a new and efficient algorithm. VarInt 97,184 ± (96,885, 97,847) 295,358 ± (293,923, 296,297)
Competitive 75,973 ± (75,420, 76,638) 86,056 ± (85,630, 86,913)
DRL for discovering faster algorithms a, AlphaDev performance, compared to the human benchmarks, when optimizing for
algorithm length. AlphaDev discovers algorithms from scratch that match or improve on
In this section, we formulate optimizing algorithms at the CPU instruc-
the human benchmarks in each case. b, AlphaDev performance, compared to the human
tion level as a reinforcement learning (RL) problem37, in which the benchmarks, when optimizing directly for latency. In this setup, AlphaDev discovers algorithms
environment is modelled as a single-player game that we refer to as that have significantly lower latency than the human benchmarks in each case. The confidence
AssemblyGame. Each state in this game is defined as a vector St = ⟨Pt, Zt⟩ intervals are represented as latency ± (lower, upper), in which latency corresponds to the fifth
where Pt is a representation of the algorithm generated thus far in percentile of latency measurements across 100 different machines. Lower and upper refer to
the bounds of the 95% confidence interval for this percentile.
the game and Zt represents the state of memory and registers after
St at MOV<Register0,Memory1>
MOV<Register0,Memory1>
b
All test input
sequences Algorithmt Output Expected output
A … A′ A′
B MOV<Register0,Memory1> D′ = B′ rt
?
C C′ C′
Fig. 2 | The AssemblyGame and algorithm correctness computation. a, The computations are used to compute the reward r t. In this example, test
AssemblyGame is played by AlphaDev, which receives as input the current sequences are input to the algorithm; for example, in the case of sorting three
assembly algorithm generated thus far S t and plays the game by selecting an elements, test inputs comprise all sequences of unsorted elements of length 3.
action to execute. In this example, the action is a mov<Register0,Memory1> For each sequence, the algorithm output is compared to the expected output
assembly instruction, which is appended to the current algorithm. The agent (in the case of sorting, the expected output is the sorted elements). In this
receives a reward that is a function of the algorithm’s correctness, discussed in example, the output D′ does not match the expected output B ′ and the
b, as well as the algorithm’s latency. The game is won by the player discovering algorithm is therefore incorrect.
a low latency, correct algorithm. b, The program correctness and latency
We refer to the agent that plays this single-player game as AlphaDev. we implemented a dual value function setup, whereby AlphaDev has
The agent’s primary learning algorithm is an extension of the AlphaZero two value function heads: one predicting algorithm correctness and
agent32 and guides a Monte Carlo tree search (MCTS) planning proce- the second predicting algorithm latency. The latency head is used to
dure using a deep neural network33,38. The input to the neural network directly predict the latency of a given program by using the program’s
is the state St and the output is a policy and value prediction. The policy actual computed latency as a Monte Carlo target for AlphaDev during
prediction is a distribution over actions and the value function is a training. This dual-head approach achieved substantially better results
prediction of the cumulative returns R that the agent should expect than the vanilla, single head value function setup when optimizing for
to receive from the current state St. During a game, the agent receives real latency.
as input the current state St. The agent then executes an MCTS pro-
cedure and uses this to select the next action to take. The generated
games are then used to update the network’s parameters, enabling Results
the agent to learn. Discovering faster sort algorithms
It is critical that AlphaDev has a representation39,40 capable of rep- We trained the AlphaDev agent from scratch to generate a range of fixed
resenting complex algorithmic structures to efficiently explore the sort and variable sort algorithms that are both correct and achieve lower
space of instructions. To achieve this, we introduce the AlphaDev latency than the state-of-the-art human benchmarks.
representation network (Extended Data Fig. 1a). This network com-
prises two components, namely (1) a transformer encoder network Fixed sorting algorithms
that provides the agent with a representation of the algorithm We considered three fundamental algorithms: sort 3, sort 4 and sort 5.
structure, and (2) the CPU state encoder network that helps the The state-of-the-art human benchmarks for these algorithms are
agent predict how the algorithm affects the dynamics of memory sorting networks43 as they generate efficient, conditional branchless
and registers. The CPU state encoder network comprises a multi- assembly code. This means that all instructions are executed sequen-
layer perceptron that receives as input the state of each register tially and there is no branching involved. Improving on these algo-
and memory location for a given set of inputs. These networks rithms is challenging as they are already highly optimized. As seen in
each output embeddings that are combined to yield the AlphaDev Table 1a, AlphaDev is able to find algorithms with fewer instructions
state representation. than the human benchmarks for sort 3 and sort 5 and matches the
state-of-the-art performance on sort 4. These shorter algorithms do
Transformer encoder indeed lead to lower latency as the algorithm length and latency are
Transformers are natural text encoders and have had much success correlated for the conditional branchless case; see Appendix B in Sup-
with language models recently14,34,41. As such, this motivated us to plementary Information for more details. We also explored scaling
adapt the standard transformer to model assembly instructions. We to slightly larger sorts using a variant of AlphaDev. We managed to
developed and incorporated a transformer encoder, our adaptation of save three instructions on sort 6, two instructions on sort 7 and one
the MultiQuery transformer encoder42, into the AlphaDev representa- instruction on sort 8, which provides a promising basis for future work.
tion network to represent the assembly instructions. Each assembly See Appendix C in Supplementary Information for an overview of the
instruction’s Opcode and corresponding Operands are converted to approach.
one-hot encodings and concatenated to form the raw input sequence.
This is fed through a multilayer transformer encoder, which maps it Variable sorting algorithms
to corresponding embedding vectors (see Extended Data Fig. 1b for We considered three variable sorting algorithms: VarSort3, VarSort4
an illustration). and VarSort5. The human benchmark in each case is defined as an algo-
rithm that, for a given input length, calls the corresponding sorting
Latency value functions network. In this case, branching is required, which greatly increases
Latency is an important reward signal that is used to guide the agent the complexity of the problem as the agent needs to (1) determine
in discovering performant algorithms. To better estimate latency, how many subalgorithms it needs to construct and (2) build the body
A 0HPRU\>@ $ 0HPRU\>@ $
0HPRU\>@ % 0HPRU\>@ %
0HPRU\>@ & 0HPRU\>@ &
B
PRY0HPRU\>@33 $ PRY0HPRU\>@33 $
PRY0HPRU\>@44 % PRY0HPRU\>@44 %
PRY0HPRU\>@55 & PRY0HPRU\>@55 &
C
PRY56 PRY56
FPS35 FPS35
FPRYJ355 PD[ $& FPRYJ355 PD[ $&
FPRYO366 PLQ $& FPRYO366 PLQ $&
PRY633 PLQ $&
FPS64 FPS64
FPRYJ433 PLQ $%& FPRYJ433 PLQ $%
FPRYJ644 PD[ PLQ $& % FPRYJ644 PD[ PLQ $& %
d e Original f AlphaDev
A 0HPRU\>@ $ 0HPRU\>@ $
0HPRU\>@ % 0HPRU\>@ %
0HPRU\>@ & 0HPRU\>@ &
0HPRU\>@ ' 0HPRU\>@ '
B
PRY0HPRU\>@33 $ PRY0HPRU\>@33 $
PRY0HPRU\>@44 % PRY0HPRU\>@44 %
C PRY0HPRU\>@55 & PRY0HPRU\>@55 &
PRY0HPRU\>@66 ' PRY0HPRU\>@66 '
D FPS63 FPS63
PRY37 PRY37
FPRYO633 PLQ $' FPRYO633 PLQ $'
FPRYO766 PD[ $' FPRYO766 PD[ $'
FPS53 FPS53
PRY37
FPRYJ533 PD[ &PLQ $' FPRYJ533 PD[ &PLQ $'
FPRYO577 PLQ $&' FPRYO577 PLQ $&
FPS47 FPS47
PRY78 PRY78
FPRYO488 PLQ $%&' FPRYO488 PLQ $%&
FPRYO744 PD[ %PLQ $&' FPRYO744 PD[ %PLQ $&
Fig. 3 | Sorting networks and algorithmic improvements discovered by removal of a single instruction. d, An optimal classic sorting network
AlphaDev. a, An optimal classic sorting network for three inputs. The circled comparator configuration that has been improved by AlphaDev. See the
comparators have been improved by AlphaDev. See the AlphaDev swap move AlphaDev copy move for more details. e,f, The assembly pseudocode before
for more details. b,c, The assembly pseudocode before applying the AlphaDev applying the AlphaDev copy move (e) and after applying the AlphaDev copy
swap move (b) and after applying the AlphaDev swap move (c), resulting in the move (f), resulting in the removal of a single instruction.
of the main algorithm in parallel. The agent may also need to call
subalgorithms from other subalgorithms. In this case, optimizing AlphaDev swap move
for length leads to significantly shorter algorithms compared to the Figure 3a presents an optimal sorting network for three elements (see
human benchmarks as seen in Table 1a. However, owing to the com- Methods for an overview of sorting networks). We will explain how
plexities introduced by branching, latency and length are not always AlphaDev has improved the circled network segment. There are many
correlated; see Supplementary Information for more details. As such, variants of this structure that are found in sorting networks of various
we implemented a procedure that measures the actual latency of the sizes, and the same argument applies in each case. The circled part
programs by taking the fifth percentile of latency measurements across of the network (last two comparators) can be seen as a sequence of
100 different machines, with computed confidence intervals44, and instructions that takes an input sequence ⟨A, B, C⟩ and transforms each
optimize this metric. See Methods for the full benchmarking setup. input as shown in Table 2a (left). However, a comparator on wires B and
When optimizing for latency, the agent improves significantly on the C precedes this operator and therefore input sequences where B ≤ C
human benchmarks in each case as seen in Table 1b. are guaranteed. This means that it is enough to compute min(A, B) as
the first output instead of min(A, B, C) as shown in Table 2a (right).
New algorithm discoveries The pseudocode difference between Fig. 3b,c demonstrates how the
The solutions discovered by AlphaDev include new and exciting algo- AlphaDev swap move saves one instruction each time it is applied.
rithmic discoveries that lead to more efficient performance. In the
fixed sort setting, we found that AlphaDev discovered two interesting AlphaDev copy move
sequences of instructions that, when applied to a sorting network algo- Figure 3d presents a sorting network configuration, consisting of three
rithm, reduce the algorithm by one assembly instruction each time. We comparators, that is applied across four wires. This configuration is
refer to each sequence of instructions as (1) the AlphaDev swap move found in a sort 8 sorting network and corresponds to an operator tak-
and (2) the AlphaDev copy move respectively. ing four inputs ⟨A, B, C, D⟩ and transforming them into four outputs
Sort 3
No
Yes Sort 2
Length = 3? Sort 3
Yes
Length = 3?
No
Yes
Length = 2? Sort 2 No
Return
Return
Fig. 4 | Fundamentally different algorithms discovered by AlphaDev. three or two numbers as input. In this case, if the length is two, then it calls the
a, A flow diagram of the variable sort 4 (VarSort4) human benchmark algorithm. sort 2 sorting network and returns. If the length is three then it calls sort 3 to
In this algorithm, a sequence of unsorted numbers are input into the algorithm. sort the first three numbers and returns. If, however, the length is greater than
If the sequence length is four, three or two numbers, then the corresponding three, then it calls sort 3, followed by a simplified sort 4 routine that sorts the
sort 4, sort 3 or sort 2 sorting network is called that sorts the resulting sequence. remaining unsorted number. It is this part of the routine that results in
The result is then returned and output by the function. b, The VarSort4 algorithm significant latency savings.
discovered by AlphaDev. This algorithm also receives sequences of length four,
as seen in Table 2b (on the left). One can show that as part of sort 8, the then sort 3 is immediately called, resulting in the first three elements
input that flows into the operator satisfies the following inequality: being sorted. If the vector is greater than three elements, then a
D ≥ min(A, C). This means that the operator can be improved by apply- simplified sort 4 algorithm is called that sorts the remaining unsorted
ing the AlphaDev copy move that is defined in Table 2b (on the right), elements in the input vector. It is this simplified part of the routine
resulting in one instruction less than the original operator. The code that yields significant gains in terms of algorithmic length and latency.
difference between the original operator and the code after applying
the AlphaDev copy move is visualized in Fig. 3e,f, respectively. Stochastic search optimization approaches
It is important to understand the advantages and limitations of RL
New variable sort algorithms compared to other approaches for program optimization. As such,
The VarSort4 algorithm discovered by AlphaDev is particularly inter- we implemented a state-of-the-art stochastic superoptimization
esting. The flow diagram for the human benchmark algorithm and approach8, adapted it to the sort setting and used it as the learning algo-
AlphaDev can be seen in Fig. 4a,b, respectively. The human bench- rithm in AlphaDev. We refer to this variant as AlphaDev-S (see Methods
mark algorithm determines the length of the input vector, and then for more details). We run this algorithm with at least the same amount
calls the corresponding sorting network to sort the elements. The of resources and wall-clock time as AlphaDev. AlphaDev-S requires a
AlphaDev solution has a completely different approach as seen prohibitive amount of time to optimize directly for latency as latency
in Fig. 4b. If the length of the input vector is strictly greater than 2, needs to be computed after every mutation. As such, AlphaDev-S opti-
mizes for a latency proxy, namely algorithm length and, then, at the
Table 2 | Analysis of the AlphaDev swap and copy moves end of training, we search through all correct programs generated
by AlphaDev-S and benchmark each one to find the lowest latency
(a) Input Original output AlphaDev swap move solution. In general, we find that AlphaDev consistently outperforms
A min(A, B, C) min(A, B) AlphaDev-S when learning from scratch without previous knowledge.
In addition, as the size of the program increases, AlphaDev explores
B max(min(A, C), B) max(min(A, C), B)
orders of magnitude fewer programs (12 million programs in the worst
C max(A, C) max(A, C)
case) compared to AlphaDev-S (31 trillion programs in the worst case).
(b) Input Original output AlphaDev copy move This may be because AlphaDev is able to better explore the space of
A min(A, B, C, D) min(A, B, C, D) algorithms compared to the breadth-first stochastic search proce-
B max(B, min(A, C, D)) max(B, min(A, C)) dure that gets stuck more easily into local optima; see Methods for an
C max(C, min(A, D)) max(C, min(A, D))
overview of this exploration hypothesis. In addition, AlphaDev never
evaluates latency during search as it uses the latency value function
D max(A, D) max(A, D)
predictions and, because of this, only needs to compute actual meas-
a, Left shows the transformation applied to inputs A, B and C in a classic sorting network when ured latency on less than 0.002% of generated programs. When incor-
applying the circled operator in Fig. 3a. Right shows the AlphaDev swap move transformation
porating previous knowledge into AlphaDev-S, such as warm starting
applied in place of the circled operator. Note the new transformation in bold that saves a
single instruction each time it is applied. b, Left shows the transformation applied to inputs the learning algorithm with a near-optimal solution, AlphaDev-S is
A, B, C and D according to the sorting network configuration in Fig. 3d. Right shows the more computationally efficient for sort 3, sort 4 and sort 5 (branch-
AlphaDev copy move transformation applied to this sorting network configuration. The less assembly algorithms) and also generates competitive low-latency
transformation in bold indicates the change made by the copy move, saving an instruction algorithms to that of AlphaDev in each case. However, for algorithms
each time it is applied.
that require branching (if–else statements), in which algorithm length
Extended Data Fig. 1 | The AlphaDev representation network architecture. can be found in the Supplementary Information, Appendix A. (b) Before
(a) The AlphaDev representation network comprises a Transformer Encoder inputting instructions into the Transformer Encoder network, each program
network that receives as input the assembly algorithm generated thus far. instruction’s opcode and operands are converted to one-hot encodings and
It also contains a CPU State Encoder network that receives as input the current concatenated. The resulting encoding is then fed into the Transformer Encoder
state of memory and registers. The exact architecture and hyperparameters network.
Extended Data Fig. 2 | An example sorting network43. (a) The horizontal lines of the comparator is smaller than the value at the bottom of the comparator,
are called wires and the vertical lines are called comparators. (b) An initially the numbers switch wires. An optimal sorting network places comparators in
unsorted sequence of values are input into the sorting network on the left hand specific positions so as to sort any sequence of unsorted values using the
side. At various stages two wires encounter a comparator. If the value at the top minimum number of comparators.
Article
Extended Data Fig. 3 | Hypothesis for improved exploration using programs (purple) to correct programs (yellow). As seen in the figure,
AlphaDev. (a) A 2D t-SNE 51 projection indicating the regions explored by AlphaDev-S struggles to move out of local optima whereas AlphaDev is able to
AlphaDev (blue) compared to AlphaDev-S. (b) The same 2D t-SNE projection as explore from the space of incorrect programs to the space of correct
in (a) with algorithm correctness superimposed onto each point from incorrect programs.
Extended Data Table 1 | Additional Assembly instructions
This table contains a list of additional assembly X86 instructions using AT&T syntax and their corresponding description.
Article
Extended Data Table 2 | Comparison of AlphaDev and
AlphaDev-S for fixed sort
(a) Presents the shortest programs found by each approach. Note that AlphaDev-S-CS is
unable to discover a sorting function when training from scratch. AlphaDev-S-WS, which is
initialized with a near-optimal program, is able to match the performance of AlphaDev, which
discovers the optimal programs from scratch. (b) Indicates the number of programs explored
by each approach to find the optimal solution. Note that AlphaDev-S-CS explores orders
of magnitude more programs for each sort algorithm. For sort 3 and sort 5 AlphaDev-S-WS
explores orders of magnitude more programs than AlphaDev to find the optimal solution.
(c) The approximate wall clock time to generate the shortest program for each sort length.
AlphaDev-S-WS is more computationally efficient than AlphaDev for branchless sort. How-
ever, as will be shown in Extended Data Table 3, when branching is introduced, AlphaDev
outperforms AlphaDev-S-WS, which tends to get stuck in locally optimal solutions.
Extended Data Table 3 | Comparison of AlphaDev and AlphaDev-S on variable sort
(a) Presents the latency results for the programs discovered by each approach. The reported latency corresponds to the 5th percentile of latencies measured across 100 machines. The ± [Lower,
Upper] reports the lower and upper confidence intervals respectively. In this setting, AlphaDev optimizes directly for real, measured latency. Note that AlphaDev outperforms each approach
and AlphaDev-S-CS is unable to find a solution in each case. (b) In the variable sort setting, both AlphaDev-S variants explore orders of magnitude more programs compared to AlphaDev.