0% found this document useful (0 votes)

113 views24 pages

Reinforcement Learning For Combinatorial Optimization: A Survey

This document summarizes recent advancements in applying reinforcement learning (RL) frameworks to solve hard combinatorial optimization problems. It explores how RL proposes an alternative to traditional heuristics designed by domain experts to automate the search for optimal solutions. The survey provides necessary background on operations research and machine learning, showcases recent RL methods for problems like the traveling salesman problem, and compares RL models to traditional algorithms, indicating RL could be a promising direction.

Uploaded by

Lolo Cha9chou9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

113 views24 pages

Reinforcement Learning For Combinatorial Optimization: A Survey

Uploaded by

Lolo Cha9chou9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Reinforcement Learning for Combinatorial Optimization: A Survey

Nina Mazyavkinaa,∗ , Sergey Sviridovb , Sergei Ivanovc,a and Evgeny Burnaeva

a Skolkovo Institute of Science and Technology, Russia
b Zyfra, Russia
c Criteo AI Lab, France

ARTICLE INFO ABSTRACT

arXiv:2003.03600v3 [cs.LG] 24 Dec 2020

Keywords: Many traditional algorithms for solving combinatorial optimization problems involve using
reinforcement learning, operations re- hand-crafted heuristics that sequentially construct a solution. Such heuristics are designed by
search, combinatorial optimization, value- domain experts and may often be suboptimal due to the hard nature of the problems. Rein-
based methods, policy-based meth- forcement learning (RL) proposes a good alternative to automate the search of these heuristics
ods by training an agent in a supervised or self-supervised manner. In this survey, we explore the
recent advancements of applying RL frameworks to hard combinatorial problems. Our survey
provides the necessary background for operations research and machine learning communities
and showcases the works that are moving the field forward. We juxtapose recently proposed
RL methods, laying out the timeline of the improvements for each problem, as well as we make
a comparison with traditional algorithms, indicating that RL models can become a promising
direction for solving combinatorial problems.

1. Introduction
Optimization problems are concerned with finding optimal configuration or "value" among different possibilities,
and they naturally fall into one of the two buckets: configurations with continuous and with discrete variables. For
example, finding a solution to a convex programming problem is a continuous optimization problem, while finding
the shortest path among all paths in a graph is a discrete optimization problem. Sometimes the line between the two
can not be drawn that easily. For example, the linear programming task in the continuous space can be regarded as
a discrete combinatorial problem because its solution lies in a finite set of vertices of the convex polytope as it has
been demonstrated by Dantzig’s algorithm [Dantzig and Thapa, 1997]. Conventionally, optimization problems in the
discrete space are called combinatorial optimization (CO) problems and, typically, have different types of solutions
comparing to the ones in the continuous space. One can formulate a CO problem as follows:
Definition 1. Let 𝑉 be a set of elements and 𝑓 ∶ 𝑉 ↦ ℝ be a cost function. Combinatorial optimization problem
aims to find an optimal value of the function 𝑓 and any corresponding optimal element that achieves that optimal
value on the domain 𝑉 .
Typically the set 𝑉 is finite, in which case there is a global optimum, and, hence, a trivial solution exists for any
CO problem by comparing values of all elements 𝑣 ∈ 𝑉 . Note that the definition 1 also includes the case of decision
problems, when the solution is binary (or, more generally, multi-class), by associating a higher cost for the wrong
answer than for the right one. One common example of a combinatorial problem is Travelling Salesman Problem
(TSP). The goal is to provide the shortest route that visits each vertex and returns to the initial endpoint, or, in other
words, to find a Hamiltonian circuit 𝐻 with minimal length in a fully-connected weighted graph. In this case, a set of
elements is defined by all Hamiltonian circuits, i.e. 𝑉 = {all Hamiltonian paths}, and the cost associated with each
∑
Hamiltonian circuit is the sum of the weights 𝑤(𝑒) of the edges 𝑒 on the circuit, i.e. 𝑓 (𝐻) = 𝑤(𝑒). Another
𝑒∈𝐻
example of CO problem is Mixed-Integer Linear Program (MILP), for which the objective is to minimize 𝑐 ⊤ 𝑥 for a
given vector 𝑐 ∈ ℝ𝑑 such that the vector 𝑥 ∈ ℤ𝑑 satisfies the constraints 𝐴𝑥 ≤ 𝑏 for the parameters 𝐴 and 𝑏.
Many CO problems are NP-hard and do not have an efficient polynomial-time solution. As a result, many al-
gorithms that solve these problems either approximately or heuristically have been designed. One of the emerging
trends of the last years is to solve CO problems by training a machine learning (ML) algorithm. For example, we can
∗ Correspondingauthor
[email protected] (N. Mazyavkina)
ORCID (s): 0000-0002-0874-4165 (N. Mazyavkina)

Mazyavkina et al.: Preprint submitted to Elsevier Page 1 of 24

Reinforcement Learning for Combinatorial Optimization

train ML algorithm on a dataset of already solved TSP instances to decide on which node to move next for new TSP
instances. A particular branch of ML that we consider in this survey is called reinforcement learning (RL) that for a
given CO problem defines an environment and the agent that acts in the environment constructing a solution.
In order to apply RL to CO, the problem is modeled as a sequential decision-making process, where the agent
interacts with the environment by performing a sequence of actions in order to find a solution. Markov Decision
Process (MDP) provides a widely used mathematical framework for modeling this type of problems [Bellman, 1957].
Definition 2. MDP can be defined as a tuple 𝑀 = ⟨𝑆, 𝐴, 𝑅, 𝑇 , 𝛾, 𝐻⟩, where
• 𝑆 - state space 𝐬𝑡 ∈ 𝑆. State space for combinatorial optimization problems in this survey is typically defined
in one of two ways. One group of approaches constructs solutions incrementally define it as a set of partial
solutions to the problem (e.g. a partially constructed path for TSP problem). The other group of methods starts
with a suboptimal solution to a problem and iteratively improves it (e.g. a suboptimal tour for TSP).
• 𝐴 - action space 𝐚𝑡 ∈ 𝐴. Actions represent addition to partial or changing complete solution (e.g. changing the
order of nodes in a tour for TSP);
• 𝑅 - reward function is a mapping from states and actions into real numbers 𝑅 ∶ 𝑆 × 𝐴→ ←← ℝ. Rewards indicate
how action chosen in particular state improves or worsens a solution to the problem (i.e. a tour length for TSP);
• 𝑇 - transition function 𝑇 (𝐬𝑡+1 |𝐬𝑡 , 𝐚𝑡 ) that governs transition dynamics from one state to another in response to
action. In combinatorial optimization setting transition dynamics is usually deterministic and known in advance;
• 𝛾 - scalar discount factor, 0 < 𝛾 ≤ 1. Discount factor encourages the agent to account more for short-term
rewards;
• 𝐻 - horizon, that defines the length of the episode, where episode is defined as a sequence {𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 , 𝑎𝑡+1 , 𝑠𝑡+2 , ...}𝐻
𝑡=0
.
For methods that construct solutions incrementally episode length is defined naturally by number of actions per-
formed until solution is found. For iterative methods some artificial stopping criteria are introduced.

The goal of an agent acting in Markov Decision Process is to find a policy function 𝜋(𝑠) that maps states into
actions. Solving MDP means finding the optimal policy that maximizes the expected cumulative discounted sum of
rewards:

∑
𝐻
∗
𝜋 = argmax 𝔼[ 𝛾 𝑡 𝑅(𝑠𝑡 , 𝑎𝑡 )], (1)
𝜋 𝑡=0

Once MDP has been defined for a CO problem we need to decide how the agent would search for the optimal
policy 𝜋 ∗ . Broadly, there are two types of RL algorithms:
• Value-based methods first compute the value action function 𝑄𝜋 (𝑠, 𝑎) as the expected reward of a policy 𝜋
given a state 𝑠 and taking an action 𝑎. Then the agent’s policy corresponds to picking an action that maximizes
𝑄𝜋 (𝑠, 𝑎) for a given state. The main difference between value-based approaches is in how to estimate 𝑄𝜋 (𝑠, 𝑎)
accurately and efficiently.
• Policy-based methods directly model the agent’s policy as a parametric function 𝜋𝜃 (𝑠). By collecting previous
decisions that the agent made in the environment, also known as experience, we can optimize the parameters
𝜃 by maximizing the final reward 1. The main difference between policy-based methods is in optimization
approaches for finding the function 𝜋𝜃 (𝑠) that maximizes the expected sum of rewards.

As can be seen, RL algorithms depend on the functions that take as input the states of MDP and outputs the
actions’ values or actions. States represent some information about the problem such as the given graph or the current
tour of TSP, while Q-values or actions are numbers. Therefore an RL algorithm has to include an encoder, i.e., a
function that encodes a state to a number. Many encoders were proposed for CO problems including recurrent neural
networks, graph neural networks, attention-based networks, and multi-layer perceptrons.

Mazyavkina et al.: Preprint submitted to Elsevier Page 2 of 24

Reinforcement Learning for Combinatorial Optimization

RL Algorithm MDP

Agent
Encoder

States/Rewards Actions

Problem Environment

Figure 1: Solving a CO problem with the RL approach requires formulating MDP. The environment is defined by a
particular instance of CO problem (e.g. Max-Cut problem). States are encoded with a neural network model (e.g.
every node has a vector representation encoded by a graph neural network). The agent is driven by an RL algorithm
(e.g. Monte-Carlo Tree Search) and makes decisions that move the environment to the next state (e.g. removing a
vertex from a solution set).

To sum up, a pipeline for solving CO problem with RL is presented in Figure 1. A CO problem is first reformulated
in terms of MDP, i.e., we define the states, actions, and rewards for a given problem. We then define an encoder of the
states, i.e. a parametric function that encodes the input states and outputs a numerical vector (Q-values or probabilities
of each action). The next step is the actual RL algorithm that determines how the agent learns the parameters of the
encoder and makes the decisions for a given MDP. After the agent has selected an action, the environment moves to a
new state and the agent receives a reward for the action it has made. The process then repeats from a new state within
the allocated time budget. Once the parameters of the model have been trained, the agent is capable of searching the
solutions for unseen instances of the problem.
Our work is motivated by the recent success in the application of the techniques and methods of the RL field
to solve CO problems. Although many practical combinatorial optimization problems can be, in principle, solved by
reinforcement learning algorithms with relevant literature existing in the operations research community, we will focus
on RL approaches for CO problems. This survey covers the most recent papers that show how reinforcement learning
algorithms can be applied to reformulate and solve some of the canonical optimization problems, such as Travelling
Salesman Problem (TSP), Maximum Cut (Max-Cut) problem, Maximum Independent Set (MIS), Minimum Vertex
Cover (MVC), Bin Packing Problem (BPP).
Related work. Some of the recent surveys also describe the intersection of machine learning and combinatorial
optimization. This way a comprehensive survey by [Bengio et al., 2020] has summarized the approaches that solve
CO problems from the perspective of the general ML, and the authors have discussed the possible ways of the combi-
nation of the ML heuristics with the existing off-the-shelf solvers. Moreover, the work by [Zhou et al., 2018], which is
devoted to the description and possible applications of GNNs, has described the progress on the CO problems’ formu-
lation from the GNN perspective in one of its sections. Finally, the more recent surveys by [Vesselinova et al., 2020]
and [Guo et al., 2019], describe the latest ML approaches to solving the CO tasks, in addition to possible applications
of such methods. We note that our survey is complementary to the existing ones as we focus on RL approaches,
provide necessary background and classification of the RL models, and make a comparison between different RL
methods and existing solutions.

Mazyavkina et al.: Preprint submitted to Elsevier Page 3 of 24

Reinforcement Learning for Combinatorial Optimization

Paper organization. The remainder of this survey is organized as follows. In section 2, we provide a necessary
background including the formulation of CO problems, different encoders, and RL algorithms that are used for solving
CO with RL. In section 3 we provide a classification of the existing RL-CO methods based on the popular design
choices such as the type of RL algorithm. In section 4 we describe the recent RL approaches for the specific CO
problems, providing the details about the formulated MDPs as well as their influence on other works. In section 5
we make a comparison between the RL-CO works and the existing traditional approaches. We conclude and provide
future directions in section 6.

2. Background
In this section, we provide definitions of combinatorial problems, state-of-the-art algorithms and heuristics that
solve these problems. We also describe machine learning models that encode states of CO problems for an RL agent.
Finally, we categorize popular RL algorithms that have been employed recently for solving CO problems.

2.1. Combinatorial Optimization Problems

We start by considering mixed-integer linear programs (MILP) – a constrained optimization problem, to which
many practical applications can be reduced. Several industrial optimizers (e.g. [CPLEX, 1987; Gleixner et al., 2017;
Gurobi Optimization, 2020; The Sage Developers, 2020; Makhorin; Schrage, 1986]) exist that use a branch-and-bound
technique to solve the MILP instance.
Definition 3 (Mixed-Integer Linear Program (MILP) [Wolsey, 1998]). A mixed-integer linear program is an opti-
mization problem of the form
{ }
arg min 𝐜⊤ 𝐱 | 𝐀𝐱 ≤ 𝐛, 𝟎 ≤ 𝐱, 𝐱 ∈ ℤ𝑝 × ℝ𝑛−𝑝 ,
𝐱

where 𝐜 ∈ ℝ𝑛 is the objective coefficient vector, 𝐀 ∈ ℝ𝑚×𝑛 is the constraint coefficient matrix, 𝐛 ∈ ℝ𝑚 is the
constraint vector, and 𝑝 ≤ 𝑛 is the number of integer variables.
Next, we provide formulations of the combinatorial optimization problems, their time complexity, and the state-
of-the-art algorithms for solving them.
Definition 4 (Traveling Salesman Problem (TSP)). Given a complete weighted graph 𝐺 = (𝑉 , 𝐸), find a tour of
minimum total weight, i.e. a cycle of minimum length that visits each node of the graph exactly once.
TSP is a canonical example of a combinatorial optimization problem, which has found applications in planning,
data clustering, genome sequencing, etc. [Applegate et al., 2006]. TSP problem is NP-hard [Papadimitriou and Stei-
glitz, 1998], and many exact, heuristic, and approximation algorithms have been developed, in order to solve it. The
best known exact algorithm is the Held–Karp algorithm [Held and Karp, 1962]. Published in 1962, it solves the
problem in time 𝑂(𝑛2 2𝑛 ), which has not been improved in the general setting since then. TSP can be formulated as
a MILP instance [Dantzig et al., 1954; Miller et al., 1960], which allows one to apply MILP solvers, such as Gurobi
[Gurobi Optimization, 2020], in order to find the exact or approximate solutions to TSP. Among them, Concorde
[Applegate et al., 2006] is a specialized TSP solver that uses a combination of cutting-plane algorithms with a branch-
and-bound approach. Similarly, an extension of the Lin-Kernighan-Helsgaun TSP solver (LKH3) [Helsgaun, 2017],
which improves the Lin-Kernighan algorithm [Lin and Kernighan, 1973], is a tour improvement method that iteratively
decides which edges to rewire to decrease the tour length. More generic solvers that avoid local optima exist such as
OR-Tools [Perron and Furnon, 2019] that tackle vehicle routing problems through local search algorithms and meta-
heuristics. In addition to solvers, many heuristic algorithms have been developed, such as Christofides-Serdyukov al-
gorithm [Christofides, 1976; van Bevern and Slugina, 2020], the Lin-Kernighan-Helsgaun heuristic [Helsgaun, 2000],
2-OPT local search [Mersmann et al., 2012]. [Applegate et al., 2006] provides an extensive overview of various
approaches to TSP.
Definition 5 (Maximum Cut Problem (Max-Cut)). Given a graph 𝐺 = (𝑉 , 𝐸), find a subset of vertices 𝑆 ⊂ 𝑉 that
∑
maximizes a cut 𝐶(𝑆, 𝐺) = 𝑖∈𝑆,𝑗∈𝑉 ⧵𝑆 𝑤𝑖𝑗 where 𝑤𝑖𝑗 ∈ 𝑊 is the weight of the edge-connecting vertices 𝑖 and 𝑗.
Max-Cut solutions have found numerous applications in real-life problems including protein folding [Perdomo-
Ortiz et al., 2012], financial portfolio management [Elsokkary et al., 2017], and finding the ground state of the Ising

Mazyavkina et al.: Preprint submitted to Elsevier Page 4 of 24

Reinforcement Learning for Combinatorial Optimization

Hamiltonian in physics [Barahona, 1982]. Max-Cut is an NP-complete problem [Karp, 1972], and, hence, does not
have a known polynomial-time algorithm. Approximation algorithms exist for Max-Cut, including deterministic 0.5-
approximation [Mitzenmacher and Upfal, 2005; Gonzalez, 2007] and randomized 0.878-approximation [Goemans
and Williamson, 1995]. Industrial solvers can be used to find a solution by applying the branch-and-bound routines.
In particular, Max-Cut problem can be transformed into a quadratic unconstrained binary optimization problem and
solved by CPLEX [CPLEX, 1987], which takes within an hour for graph instances with hundreds of vertices [Barrett
et al., 2020]. For larger instances several heuristics using the simulated annealing technique have been proposed that
could scale to graphs with thousands of vertices [Yamamoto et al., 2017; Tiunov et al., 2019; Leleu et al., 2019].
Definition 6 (Bin Packing Problem (BPP)). Given a set 𝐼 of items, a size 𝑠(𝑖) ∈ ℤ+ for each 𝑖 ∈ 𝐼, and a positive
integer bin capacity 𝐵, find a partition of 𝐼 into disjoint sets 𝐼1 , … , 𝐼𝐾 such that the sum of the sizes of the items in
each 𝐼𝑗 is less or equal than 𝐵 and 𝐾 has the smallest possible value.
There are other variants of BPP such as 2D, 3D packing, packing with various surface area, packing by weights,
and others [Wu et al., 2010]. This CO problem has found its applications in many domains such as resource optimiza-
tion, logistics, and circuit design [Kellerer et al., 2004]. BPP is an NP-complete problem with many approximation
algorithms proposed in the literature. First-fit decreasing (FFD) and best-fit decreasing (BFD) are two simple ap-
proximation algorithms that first sort the items in the decreasing order of their costs and then assign each item to the
first (for FFD) or the fullest (for BFD) bin that it fits into. Both FFD and BFD run in 𝑂(𝑛 log 𝑛) time and have 11∕9
asymptotic performance guarantee [Korte et al., 2012]. Among exact approaches, one of the first attempts has been the
Martello-Toth algorithm that works under the branch-and-bound paradigm [Martello and Toth, 1990a,b]. In addition,
several recent improvements have been proposed [Schreiber and Korf, 2013; Korf, 2003] which can run on instances
with hundreds of items. Alternatively, BPP can be formulated as a MILP instance [Wu et al., 2010; Chen et al., 1995]
and solved using standard MILP solvers such as Gurobi [Gurobi Optimization, 2020] or CPLEX [CPLEX, 1987].
Definition 7 (Minimum Vertex Cover (MVC)). Given a graph 𝐺 = (𝑉 , 𝐸), find a subset of nodes 𝑆 ⊂ 𝑉 , such that
every edge is covered, i.e. (𝑢, 𝑣) ∈ 𝐸 ⟺ 𝑢 ∈ 𝑆 or 𝑣 ∈ 𝑆, and |𝑆| is minimized.
Vertex cover optimization is a fundamental problem with applications to computational biochemistry [Lancia et al.,
2001] and computer network security [Filiol et al., 2007]. There is a naïve approximation algorithm with a factor 2,
which works by adding both endpoints of an arbitrary edge to the solution and then removing this endpoints
( √ from)the
graph [Papadimitriou and Steiglitz, 1998]. A better approximation algorithm with a factor of 2 − Θ 1∕ log |𝑉 | is
√
known [Karakostas, 2009], although, it has been shown that MVC cannot be approximated within a factor 2 − 𝜀 for
any 𝜀 > 0 [Dinur and Safra, 2005; Subhash et al., 2018]. The problem can formulated as an integer linear program
∑
(ILP) by minimizing 𝑣∈𝑉 𝑐(𝑣)𝑥𝑣 , where 𝑥𝑣 ∈ {0, 1} denotes whether a node 𝑣 with a weight 𝑐(𝑣) is in a solution
set, subject to 𝑥𝑢 + 𝑥𝑣 ≥ 1. Solvers such as CPLEX [CPLEX, 1987] or Gurobi [Gurobi Optimization, 2020] can be
used to solve the ILP formulations with hundreds of thousands of nodes [Akiba and Iwata, 2016].
Definition 8 (Maximum Independent Set (MIS)). Given a graph 𝐺(𝑉 , 𝐸) find a subset of vertices 𝑆 ⊂ 𝑉 , such that
no two vertices in 𝑆 are connected by an edge of 𝐸, and |𝑆| is minimized.
MIS is a popular CO problem with applications in classification theory, molecular docking, recommendations, and
more [Feo et al., 1994; Gardiner et al., 2000; Agrawal et al., 1996]. As such the approaches of finding the solutions
for this problem have received a lot of attention from the academic community. It is easy to see that the complement
of an independent set in a graph 𝐺 is a vertex cover in 𝐺 and a clique in a complement graph 𝐺, ̄ hence, the solutions
to a minimum vertex cover in 𝐺 or a maximum clique in 𝐺 can be applied to solve the MIS problem. The running
time of the brute-force algorithm is 𝑂(𝑛2 2𝑛 ), which has been improved by [Tarjan and Trojanowski, 1977] to 𝑂(2𝑛∕3 ),
and recently to the best known bound 𝑂(1.1996𝑛 ) with polynomial space [Xiao and Nagamochi, 2017]. To cope with
medium and large instances of MIS several local search and evolutionary algorithms have been proposed. The local
search algorithms maintain a solution set, which is iteratively updated by adding and removing nodes that improve
the current objective value [Andrade et al., 2008; Katayama et al., 2005; Hansen et al., 2004; Pullan and Hoos, 2006].
In contrast, the evolutionary algorithms maintain several independent sets at the current iterations which are then
merged or pruned based on some fitness criteria [Lamm et al., 2015; Borisovsky and Zavolovskaya, 2003; Back and
Khuri, 1994]. Hybrid approaches exist that combine the evolutionary algorithms with the local search, capable to
solve instances with hundreds of thousands of vertices [Lamm et al., 2016].

Mazyavkina et al.: Preprint submitted to Elsevier Page 5 of 24

Reinforcement Learning for Combinatorial Optimization

In order to approach the outlined problems with reinforcement learning, we must represent the graphs, involved
in the problems, as vectors that can be further provided as an input to a machine learning algorithm. Next, we discuss
different approaches for learning the representations of these problems.

2.2. Encoders
In order to process the input structure 𝑆 (e.g. graphs) of CO problems, we must present a mapping from 𝑆 to a
𝑑-dimensional space ℝ𝑑 . We call such a mapping an encoder as it encodes the original input space. The encoders
vary depending on the particular type of the space 𝑆 but there are some common architectures that researchers have
developed over the last years to solve CO problems.
The first frequently used architecture is a recurrent
neural network (RNN). RNNs can operate on sequen-
tial data, encoding each element of the sequence into
a vector. In particular, the RNN is composed of the
block of parameters that takes as an input the current
element of the sequence and the previous output of the
RNN block and outputs a vector that is passed to the
next element of the sequence. For example, in the case
of TSP, one can encode a tour of TSP by applying RNN
to the current node (e.g. initially represented by a con-
Figure 2: The scheme for a recurrent neural network
stant 𝑑-dimensional vector) and the output of the RNN
(RNN). Each box represents an encoding function. Each
on the previous node of the tour. One can stack multi-
element in the sequence is encoded using its initial repre-
ple blocks of RNNs together making the neural network
sentation and the output of the model at the previous step.
deep. Popular choices of RNN blocks are a Long Short-
RNN parameters are shared across all elements of the se-
Term Memory (LSTM) unit [Hochreiter and Schmid-
quence.
huber, 1997] and Gated Recurrent Unit (GRU) [Cho
et al., 2014], which tackle the vanishing gradient prob-
lem [Goodfellow et al., 2016].
One of the fundamental limitations of RNN models is related to the modeling of the long-range dependencies:
as the model takes the output of the last time-step it may “forget” the information from the previous elements of
the sequence. Attention models fix this by forming a connection not just to the last input element, but to all input
elements. Hence, the output of the attention model depends on the current element of the sequence and all previous
elements of the sequence. In particular, similarity scores (e.g. dot product) are computed between the input element
and each of the previous elements, and these scores are used to determine the weights of the importance of each of the
previous elements to the current element. Attention models has recently gained the superior performance on language
modeling tasks (e.g. language translation) [Vaswani et al., 2017] and have been applied to solving CO problems (e.g.
for building incrementally a tour for TSP).
Note that the attention model relies on modeling de-
pendencies between each pair of elements in the input
structure, which can be inefficient if there are only few
relevant dependencies. One simple extension of the at-
tention model is a pointer network (PN) [Vinyals et al.,
2015]. Instead of using the weights among all pairs for
computation of the influence of each input element, the
pointer networks use the weights to select a single input
element that will be used for encoding. For example, in
Figure 3 the element "A" has the highest similarity to
the element "B", and, therefore, it is used for computa-
tion of the representation of element "B" (unlike atten-
Figure 3: The scheme for a pointer network. Element "B"
tion model, in which case the elements "C" and "D" are
in the sequence first computes similarity scores to all other
also used).
elements. Next we encode the representation of "B" us-
Although these models are general enough to be ap-
ing the element with maximum value ("A" in this case,
plied to various spaces 𝑆 (e.g. points for TSP), many CO
dashed). This process is then repeated for other elements
problems studied in this paper are associated with the
in the sequence.

Mazyavkina et al.: Preprint submitted to Elsevier Page 6 of 24

Reinforcement Learning for Combinatorial Optimization

Figure 4: A classification of reinforcement learning methods.

graphs. A natural continuation of the attention models

to the graph domain is a graph neural network (GNN).
Initially, the nodes are represented by some vectors (e.g. constant unit vectors). Then, each node’s representation is
updated depending on the local neighborhood structure of this node. In the most common message-passing paradigms,
adjacent nodes exchange their current representations in order to update them in the next iteration. One can see this
framework as a generalization of the attention model, where the elements do not attend to all of the other elements
(forming a fully-connected graph), but only to elements that are linked in the graph. Popular choices of GNN mod-
els include a Graph Convolutional Network (GCN) [Kipf and Welling, 2017], a Graph Attention Network (GAT)
[Veličković et al., 2018], a a Graph Isomorphism Network (GIN) [Xu et al., 2018], Structure-to-Vector Network
(S2V) [Dai et al., 2016].
While there are many intrinsic details about all of these models, at a high level it is important to understand
that all of them are the differentiable functions optimized by the gradient descent that return the encoded vector
representations, which next can be used by the RL agent.

2.3. Reinforcement Learning Algorithms

In the introduction section 1 we gave the definitions of an MDP, which include the states, actions, rewards, and
transition functions. We also explained what the policy of an agent is and what is the optimal policy. Here we will
deep-dive into the RL algorithms that search for the optimal policy of an MDP.
Broadly, the RL algorithms can be split into the model-based and model-free categories (Figure 4).

• Model-based methods focus on the environments, where transition functions are known or can be learned, and
can be utilized by the algorithm when making decisions. This group includes Monte-Carlo Tree Search (MCTS)
algorithms such as AlphaZero [Silver et al., 2016] and MuZero [Schrittwieser et al., 2019].
• Model-free methods do not rely on the availability of the transition functions of the environment and utilize
solely the experience collected by the agent.

Mazyavkina et al.: Preprint submitted to Elsevier Page 7 of 24

Reinforcement Learning for Combinatorial Optimization

Furthermore, model-free methods can be split into two big families of RL algorithms – policy-based and value-
based methods. This partition is motivated by the way of deriving a solution of an MDP. In the case of policy-based
methods, a policy is approximated directly, while value-based methods focus on approximating a value function, which
is a measure of the quality of the policy for some state-action pair in the given environment. Additionally, there are
RL algorithms that combine policy-based methods with value-based methods. The type of methods that utilize such
training procedure is called actor-critic methods [Sutton et al., 2000; Mnih et al., 2016]. The basic principle behind
these algorithms is for the critic model to approximate the value function, and for the actor model to approximate
policy. Usually to do this, both actor and critic, use the policy and value-based RL, mentioned above. This way, the
critic provides the measure of how good the action taken by the actor has been, which allows to appropriately adjust
the learnable parameters for the next train step.
Next, we formally describe the value-based, policy-based, and MCTS approaches and the corresponding RL
algorithms that have been used to solve CO problems.

2.3.1. Value-based methods

As it has been mentioned earlier, the main goal of all reinforcement learning methods is to find a policy, which
would consistently allow the agent to gain a lot of rewards. Value-based reinforcement learning methods focus on
finding such policy through the approximation of a value function 𝑉 (𝑠) and an action-value function 𝑄(𝑠, 𝑎). In this
section, we will define both of these functions, which value and action-value functions can be called optimal, and how
can we derive the optimal policy, knowing the optimal value functions.
Definition 9. Value function of a state 𝑠 is the expectation of the future discounted rewards, when starting from the
state 𝑠 and following some policy 𝜋:
[ ]
∑
∞
𝜋 𝑡
𝑉 (𝑠) = 𝔼 𝛾 𝑟(𝑠𝑡 )|𝜋, 𝑠0 = 𝑠 (2)
𝑡=0

The notation 𝑉 𝜋 here and in the following sections means that the value function 𝑉 is defined with respect to the
policy 𝜋. It is also important to note, that the value of a terminal state in the case of a finite MDP equals 0.
At the same time, it can be more convenient to think of the value function as the function depending not only on
the state but also on the action.
Definition 10. Action-value function 𝑄(𝑠, 𝑎) is the expectation of the future discounted rewards, when starting from
the state 𝑠, taking the action 𝑎 and then following some policy 𝜋:
[ ]
∑
∞
𝜋 𝑡
𝑄 (𝑠, 𝑎) = 𝔼 𝛾 𝑟(𝑠𝑡 , 𝑎𝑡 )|𝜋, 𝑠0 = 𝑠, 𝑎0 = 𝑎 . (3)
𝑡=0

It is also clear that 𝑉 𝜋 (𝑠) can be interpreted in terms of the 𝑄𝜋 (𝑠, 𝑎) as:

𝑉 𝜋 (𝑠) = max 𝑄𝜋 (𝑠, 𝑎).

𝑎

From the definition of a value function comes a very important recursive property, representing the relationship
between the value of the state 𝑉 𝜋 (𝑠) and the values of the possible following states 𝑉 𝜋 (𝑠′ ), which lies at the foundation
of many value-based RL methods. This property can be expressed as an equation, called the Bellman equation
[Bellman, 1952]:
∑
𝑉 𝜋 (𝑠) = 𝑟(𝑠) + 𝛾 𝑇 (𝑠, 𝜋(𝑠), 𝑠′ )𝑉 𝜋 (𝑠′ ). (4)
𝑠′

The Bellman equation can be also rewritten in terms of the action-value function 𝑄𝜋 (𝑠, 𝑎) in the following way:
∑
𝑄𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 𝑇 (𝑠, 𝑎, 𝑠′ ) max
′
𝑄𝜋 (𝑠′ , 𝑎′ ). (5)
𝑎
𝑠′

At the beginning of this section, we have stated that the goal of all of the RL tasks is to find a policy, which can
accumulate a lot of rewards. This means that one policy can be better than (or equal to) the other if the expected return

Mazyavkina et al.: Preprint submitted to Elsevier Page 8 of 24

Reinforcement Learning for Combinatorial Optimization

of this policy is greater than the one achieved by the other policy: 𝜋 ′ ≥ 𝜋. Moreover, by the definition of a value
′
function, we can claim that 𝜋 ′ ≥ 𝜋 if and only if 𝑉 𝜋 (𝑠) ≥ 𝑉 𝜋 (𝑠) in all states 𝑠 ∈ 𝑆.
Knowing this relationship between policies, we can state that there is a policy that is better or equal to all the other
possible policies. This policy is called an optimal policy 𝜋 ∗ . Evidently, the optimality of the action-value and value
functions is closely connected to the optimality of the policy they follow. This way, the value function of an MDP is
called optimal if it is the maximum of value functions across all policies:

𝑉 ∗ (𝑠) = max 𝑉 𝜋 (𝑠), ∀𝑠 ∈ 𝑆.

𝜋

Similarly, we can give the definition to the optimal action-value function 𝑄∗ (𝑠, 𝑎):

𝑄∗ (𝑠, 𝑎) = max 𝑄𝜋 (𝑠, 𝑎), ∀𝑠 ∈ 𝑆, ∀𝑎 ∈ 𝐴.

𝜋

Given the Bellman equations (4) and (5), one can derive the optimal policy if the action-value or value functions
are known. In the case of a value function 𝑉 ∗ (𝑠), one can find optimal actions by doing the greedy one-step search:
picking the actions that correspond to the maximum value 𝑉 ∗ (𝑠) in the state 𝑠 computed by the Bellman equation (4).
On the other hand, in the case of the action-value function one-step search is not needed. For each state 𝑠 we can
easily find such action 𝑎 that maximizes the action-function, as in order to do that we just need to compute 𝑄∗ (𝑠, 𝑎).
This way, we do not need to know any information about the rewards and values in the following states 𝑠′ in contrast
with the value function.
Therefore, in the case of value-based methods, in order to find the optimal policy, we need to find the optimal
value functions. Notably, it is possible to explicitly solve the Bellman equation, i.e. find the optimal value function,
but only in the case when the transition function is known. In practice, it is rarely the case, so we need some methods
to approximate the Bellman’s equation solution.

• Q-learning. One of the popular representatives of the approximate value-based methods is Q-learning [Watkins
and Dayan, 1992] and its deep variant Deep Q-learning [Mnih et al., 2015]. In Q-learning, the action-value
function 𝑄(𝑠, 𝑎) is iteratively updated by learning from the collected experiences of the current policy. It has
been shown in [Sutton, 1988, Theorem 3] that the function updated by such a rule converges to the optimal
value function.
• DQN. With the rise of Deep Learning, neural networks (NNs) have proven to achieve state-of-the-art results
on various datasets by learning useful function approximations through the high-dimensional inputs. This led
researchers to explore the potential of NNs’ approximations of the Q-functions. Deep Q-networks (DQN) [Mnih
et al., 2015] can learn the policies directly using end-to-end reinforcement learning. The network approximates
Q-values for each action depending on the current input state. In order to stabilize the training process, authors
have used the following formulation of the loss function:

[( )2 ]
𝐿(𝜃𝑖 ) = 𝔼(𝑠,𝑎,𝑟,𝑠′ )∼𝐷 𝑟 + 𝛾 max
′
𝑄𝜃− (𝑠′ , 𝑎′ ) − 𝑄𝜃𝑖 (𝑠, 𝑎) , (6)
𝑎 𝑖

where 𝐷 is a replay memory buffer, used to store (𝑠, 𝑎, 𝑟, 𝑠′ ) trajectories. Equation (6) is the mean-squared error
between the current approximation of the Q-function and some maximized target value 𝑟 + 𝛾 max𝑎′ 𝑄𝜃 − (𝑠′ , 𝑎′ ).
𝑖
The training of DQN has been shown to be more stable, and, consequently, DQN has been effective for many
RL problems, including RL-CO problems.

2.3.2. Policy-based methods

In contrast to the value-based methods that aim to find the optimal state-action value function 𝑄∗ (𝑠, 𝑎) and act
greedily with respect to it to obtain the optimal policy 𝜋 ∗ , policy-based methods attempt to directly find the optimal
policy, represented by some parametric function 𝜋𝜃∗ , by optimizing (1) with respect to the policy parameters 𝜃: the
method collects experiences in the environment using the current policy and optimizes it utilizing these collected
experiences. Many methods have been proposed to optimize the policy functions, and we discuss the most commonly
used ones for solving CO problems.

Mazyavkina et al.: Preprint submitted to Elsevier Page 9 of 24

Reinforcement Learning for Combinatorial Optimization

• Policy gradient. In order to optimize (1) with respect to the policy parameters 𝜃, policy gradient theorem
[Sutton et al., 2000] can be applied to estimate the gradients of the policy function in the following form:

∑
𝐻
∇𝜃 𝐽 (𝜋𝜃 ) = 𝔼𝜋𝜃 [ ∇𝜃 log 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴(𝑠
̂ 𝑡 , 𝑎𝑡 )], (7)
𝑡=0

̂ 𝑡 , 𝑎𝑡 ) = ∑ 𝛾 𝑡′ −𝑡 𝑟(𝑠′ , 𝑎′ ) − 𝑏(𝑠𝑡 ), 𝐻 is the agent’s horizon, and 𝑏(𝑠) is the baseline

𝐻
where the return estimate 𝐴(𝑠 𝑡 𝑡
𝑡=𝑡′
function. The gradient of the policy is then used by the gradient descent algorithm to optimize the parameters
𝜃.
• REINFORCE. The role of the baseline 𝑏(𝑠) is to reduce the variance of the return estimate 𝐴(𝑠 ̂ 𝑡 , 𝑎𝑡 ) — as it is
computed by running the current policy 𝜋𝜃 , the initial parameters can lead to poor performance in the beginning
of the training, and the baseline 𝑏(𝑠) tries to mitigate this by reducing the variance. When the baseline 𝑏(𝑠𝑡 ) is
excluded from the return estimate calculation we obtain a REINFORCE algorithm that has been proposed by
[Williams, 1992]. Alternatively, one can compute the baseline value 𝑏(𝑠𝑡 ) by calculating an average reward over
the sampled trajectories, or by using a parametric value function estimator 𝑉𝜙 (𝑠𝑡 ).
• Actor-critic algorithms. The family of Actor-Critic (A2C, A3C) [Mnih et al., 2016] algorithms further extend
REINFORCE with the baseline by using bootstrapping — updating the state-value estimates from the values of
the subsequent states. For example, a common approach is to compute the return estimate for each step using
the parametric value function:
̂ 𝑡 , 𝑎𝑡 ) = 𝑟(𝑠𝑡 , 𝑎𝑡 ) + 𝑉𝜙 (𝑠′ ) − 𝑉𝜙 (𝑠𝑡 )
𝐴(𝑠 (8)
𝑡

Although this approach introduces bias to the gradient estimates, it often reduces variance even further. More-
over, the actor-critic methods can be applied to the online and continual learning, as they no longer rely on
Monte-Carlo rollouts, i.e. unrolling the trajectory to a terminal state.
• PPO/DDPG. Further development of this group of reinforcement learning algorithms has resulted in the ap-
pearance of several more advanced methods such as Proximal Policy Optimization (PPO) [Schulman et al.,
2017], that performs policy updates with constraints in the policy space, or Deep Deterministic Policy Gradient
(DDPG) [Lillicrap et al., 2016], an actor-critic algorithm that attempts to learn a parametric state-action value
function 𝑄𝜙 (𝑠, 𝑎), corresponding to the current policy, and use it to compute the bootstrapped return estimate.

2.3.3. Monte Carlo Tree Search

Both value-based and policy-based approaches do not use the model of the environment (model-free approaches),
i.e. the transition probabilities of the model, and, hence, such approaches do not plan ahead by unrolling the envi-
ronment to the next steps. However, it is possible to define an MDP for CO problems in such a way that we can use
the knowledge of the environment in order to improve the predictions by planning several steps ahead. Some notable
examples are AlphaZero [Silver et al., 2016] and Expert Iteration [Anthony et al., 2017] that have achieved super-
human performances in games like chess, shogi, go, and hex, learning exclusively through self-play. Moreover, the
most recent algorithm, MuZero [Schrittwieser et al., 2019], has been able to achieve a superhuman performance by
extending the previous approaches using the learned dynamics model in challenging and visually complex domains,
such as Atari games, go and shogi without the knowledge of the game rules.

• MCTS. The algorithm follows the general procedure of Monte Carlo Tree Search (MCTS) [Browne et al.,
2012] consisting of selection, expansion, roll-out and backup steps (Figure 5). However, instead of evaluating
leaf nodes in a tree by making a rollout step, a neural network 𝑓𝜃 is used to provide a policy 𝑃 (𝑠, ∗) and state-
value estimates 𝑉 (𝑠) for the new node in the tree. The nodes in the tree refer to states 𝑠, and edges refer to
actions 𝑎. During the selection phase we start at a root state, 𝑠0 , and keep selecting next states that maximize
the upper confidence bound:
√∑
′
𝑎′ 𝑁(𝑠, 𝑎 )
UCB = 𝑄(𝑠, 𝑎) + 𝑐 ⋅ 𝑃 (𝑠, 𝑎) ⋅ , (9)
1 + 𝑁(𝑠, 𝑎)
Mazyavkina et al.: Preprint submitted to Elsevier Page 10 of 24
Reinforcement Learning for Combinatorial Optimization

Figure 5: Three steps of the Monte Carlo Tree Search (MCTS). Starting the simulation from the root node the select
step picks the node that maximizes the upper confidence bound. When previously unseen node is expanded, the policy
𝑃 (𝑠, ∗) and the state-value function 𝑉 (𝑠) are evaluated at this node, and the action-value 𝑄(𝑠, 𝑎) and the counter 𝑁(𝑠, 𝑎)
are initialized to 0. Then 𝑉 (𝑠) estimates are propagated back along the path of the current simulation to update 𝑄(𝑠, 𝑎)
and 𝑁(𝑠, 𝑎).

When previously unseen node in a search tree is encountered, policy 𝑃 (𝑠, ∗) and state-value value functions
𝑃 (𝑠, ∗) and state-value estimates 𝑉 (𝑠) are estimated for this node. After that 𝑉 (𝑠) estimate is propagated back
along the search tree updating the 𝑄(𝑠, 𝑎) and 𝑁(𝑠, 𝑎) values. After a number of search iterations we select the
next action from the root state according to the improved policy:
𝑁(𝑠0 , 𝑎)
𝜋(𝑠𝑜 ) = ∑ ′
. (10)
𝑎′ 𝑁(𝑠0 , 𝑎 )

3. Taxonomy of RL for CO
The full taxonomy of RL methods for CO can be challenging because of the orthogonality of the ways we can
classify the given works. In this section we will list all the taxonomy groups that are used in this survey.
One straightforward way of dividing the RL approaches, concerning the CO field, is by the family of the RL
methods used to find the solution of the given problem. As shown in the Figure 4, it is possible to split the RL
methods by either the first level of the Figure 4 (i.e. into the model-based and model-free methods) or by the second
level (i.e. policy-based, value-baded, the methods using Monte-Carlo approach) (section 2.3). In addition, another
division is possible by the type of encoders used for representing the states of the MDP (section 2.2). This division is
much more granular than the other ones discussed in this section, as can be seen from the works surveyed in the next
section.
Another way to aggregate the existing RL approaches is based on the integration of RL into the given CO problem,
i.e. if an RL agent is searching for a solution to CO problem or if an RL agent is facilitating the inner workings of the
existing off-the-shelf solvers.
• In principal learning an agent makes the direct decision that constitutes a part of the solution or the complete
solution of the problem and does not require the feedback from the off-the-shelf solver. For example, in TSP
the agent can be parameterized by a neural network that incrementally builds a path from a set of vertices and
then receives the reward in the form of the length of the constructed path, which is used to update the policy of
the agent.
• Alternatively one can learn the RL agent’s policy in the joint training with already existing solvers so that it can
improve some of the metrics for a particular problem. For example, in MILP a commonly used approach is the

Mazyavkina et al.: Preprint submitted to Elsevier Page 11 of 24

Reinforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL

[Bello et al., 2017] No Yes Pointer Network REINFORCE with baseline

[Khalil et al., 2017] No Yes S2V DQN
[Nazari et al., 2018] No Yes Pointer Network with Convolutional Encoder REINFORCE (TSP) and A3C (VRP)
[Deudon et al., 2018] No Yes Pointer Network with Attention Encoder REINFORCE with baseline
[Kool et al., 2019] No Yes Pointer Network with Attention Encoder REINFORCE with baseline
[Emami and Ranka, 2018] No No FF NN with Sinkhorn layer Sinkhorn Policy Gradient
[Cappart et al., 2020] Yes Yes GAT/Set Transformer DQN/PPO
[Drori et al., 2020] Yes Yes GIN with an Attention Decoder MCTS
[Lu et al., 2020] Yes No GAT REINFORCE
[Chen and Tian, 2019] Yes No LSTM encoder + classifier Q-Actor-Critic

Table 1
Summary of approaches for Travelling Salesman Problem.

Branch & Bound method, which at every step selects a branching rule on the node of the tree. This can have
a significant impact on the overall size of the tree and, hence, the running time of the algorithm. A branching
rule is a heuristic that typically requires either some domain expertise or a hyperparameter tuning procedure.
However, a parameterized RL agent can learn to imitate the policy of the node selection by receiving rewards
proportional to the running time.
Another dimension that the RL approaches can be divided into is the way the solution is searched by the learned
heuristics. In this regard, methods can be divided into those learning construction heuristics or improvement heuristics.
• Methods that learn construction heuristics are building the solutions incrementally using the learned policy by
choosing each element to add to a partial solution.
• The second group of methods start from some arbitrary solution and learn a policy that improves it iteratively.
This approach tries to address the problem that is commonly encountered with the construction heuristics learn-
ing, namely, the need to use some extra procedures to find a good solution like beam search or sampling.

4. RL for CO
In this section we survey existing RL approaches to solve CO problems that include Traveling Salesman Prob-
lem (Definition 4), Maximum Cut Problem (Definition 5), Bin Packing Problem (Definition 6), Minimum Vertex
Cover Problem (Definition 7), and Maximum Independent Set (Definition 8). These problems have received the most
attention from the research community and we juxtapose the approaches for all considered problems.

4.1. Travelling Salesman Problem

One of the first attempts to apply policy gradient algorithms to combinatorial optimization problems has been
made in [Bello et al., 2017]. In the case of solving the Traveling Salesman Problem, the MDP representation takes
the following form: a state is a 𝑝-dimensional graph embedding vector, representing the current tour of the nodes at
the time step 𝑡, while the action is picking another node, which has not been used at the current state. This way the
initial state 𝑠0 is the embedding of the starting node. A transition function 𝑇 (𝑠, 𝑎, 𝑠′ ), in this case, returns the next
node of the constructed tour until all the nodes have been visited. Finally, the reward in [Bello et al., 2017] is intuitive:
it is the negative tour length. The pointer network architecture, proposed in [Vinyals et al., 2015], is used to encode
the input sequence, while the solution is constructed sequentially from a distribution over the input using the pointer
mechanism of the decoder, and trained in parallel and asynchronously similar to [Mnih et al., 2016]. Moreover, several
inference strategies are proposed to construct a solution — along with greedy decoding and sampling Active Search
approach is suggested. Active Search allows learning the solution for the single test problem instance, either starting
from a trained or untrained model. To update the parameters of the controller so that to maximize the expected rewards
REINFORCE algorithm with learned baseline is used.
The later works, such as the one by [Khalil et al., 2017], has improved on the work of [Bello et al., 2017]. In
the case of [Khalil et al., 2017] the MDP, constructed for solving the Traveling Salesman problem, is similar to

Mazyavkina et al.: Preprint submitted to Elsevier Page 12 of 24

Reinforcement Learning for Combinatorial Optimization

the one used by [Bello et al., 2017], except for the reward function 𝑟(𝑠, 𝑎). The reward, in this case, is defined as
the difference in the cost functions after transitioning from the state 𝑠 to the state 𝑠′ when taking some action 𝑎:
𝑟(𝑠, 𝑎) = 𝑐(ℎ(𝑠′ ), 𝐺) − 𝑐(ℎ(𝑠), 𝐺), where ℎ is the graph embedding function of the partial solutions 𝑠 and 𝑠′ , 𝐺 is
the whole graph, 𝑐 is the cost function. Because the weighted variant of the TSP is solved, the authors define a cost
function 𝑐(ℎ(𝑠), 𝐺) as the negative weighted sum of the tour length. Also, the work implements S2V [Dai et al., 2016]
for encoding the partial solutions, and DQN as the RL algorithm of choice for updating the network’s parameters.
Another more work by [Nazari et al., 2018], motivated by [Bello et al., 2017], concentrates on solving the Vehicle
Routing Problem (VRP), which is a generalization of TSP. However, the approach suggested in [Bello et al., 2017]
can not be applied directly to solve VRP due to its dynamic nature, i.e. the demand in the node becoming zero, once
the node has been visited since it embeds the sequential and static nature of the input. The authors of [Nazari et al.,
2018] extend the previous methods used for solving TSP to circumvent this problem and find the solutions to VRP
and its stochastic variant. Specifically, similarly to [Bello et al., 2017], in [Nazari et al., 2018] approach the state 𝑠
represents the embedding of the current solution as a vector of tuples, one value of which is the coordinates of the
customer’s location and the other is the customer’s demand at the current time step. An action 𝑎 is picking a node,
which the vehicle will visit next in its route. The reward is also similar to the one used for TSP: it is the negative
total route length, which is given to the agent only after all customers’ demands are satisfied, which is the terminal
state of the MDP. The authors of [Nazari et al., 2018] also suggest to improve the Pointer Network, used by [Bello
et al., 2017]. To do that, the encoder is simplified by replacing the LSTM unit with the 1-d convolutional embedding
layers so that the model is invariant to the input sequence order, consequently, being able to handle the dynamic state
change. The policy learning is then performed by using REINFORCE algorithm for TSP and VRP while using A3C
for stochastic VRP.
Similarly to [Nazari et al., 2018], the work by [Deudon et al., 2018] uses the same approach as the one by [Bello
et al., 2017], while changing the encoder-decoder network architecture. This way, while the MDP is the same as in
[Bello et al., 2017], instead of including the LSTM units, the GNN encoder architecture is based solely on the attention
mechanisms so that the input is encoded as a set and not as a sequence. The decoder, however, stays the same as in the
Pointer Network case. Additionally, the authors have looked into combining a solution provided by the reinforcement
learning agent with the 2-Opt heuristic [Croes, 1958], in order to further improve the inference results. REINFORCE
algorithm with critic baseline is used to update the parameters of the described encode-decoder network.
Parallel to [Deudon et al., 2018], inspired by the transformer architecture of [Vaswani et al., 2017], a construction
heuristic learning approach by [Kool et al., 2019] has been proposed in order to solve TSP, two variants of VRP
(Capacitated VRP and Split Delivery VRP), Orienteering Problem (OP), Prize Collecting TSP (PCTSP) and Stochastic
PCTSP (SPCTSP). In this work, the authors have implemented a similar encoder-decoder architecture as the authors
of [Deudon et al., 2018], i.e. the Transformer-like attention-based encoder, while the decoder is similar to one of the
Pointer Network. However, the authors found that slightly changing the training procedure and using a simple rollout
baseline instead of the one learned by a critic yields better performance. The MDP formulation, in this case, is also
similar to the one used by [Deudon et al., 2018], and, consequently, the one by [Bello et al., 2017].
One specific construction heuristic approach has been proposed in [Emami and Ranka, 2018]. The authors have
designed a novel policy gradient method, Sinkhorn Policy Gradient (SPG), specifically for the class of combinatorial
optimization problems, which involves permutations. This approach yields a different MDP formulation. Here, in
contrast with the case when the solution is constructed sequentially, the state space consists of instances of combina-
torial problems of a particular size. The action space, in this case, is outputting a permutation matrix, which, applied
to the original graph, produces the solution tour. The reward function is the negated sum of the Euclidean distances
between each stop along the tour. Finally, using a special Sinkhorn layer on the output of the feed-forward neural
network with GRUs to produce continuous and differentiable relaxations of permutation matrices, authors have been
able to train actor-critic algorithms similar to Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2016].
The work by [Cappart et al., 2020] combines two approaches to solving the traveling salesman problem with time
windows, namely the RL approach and the constraint programming (CP) one, so that to learn branching strategies.
In order to encode the CO problems, the authors bring up a dynamic programming formulation, that acts as a bridge
between both techniques and can be exposed both as an MDP and a CP problem. A state 𝑠 is a vector, consisting
of three values: the set of remaining cities that still have to be visited, the last city that has been visited, and the
current time. An action 𝑎 corresponds to choosing a city. The reward 𝑟(𝑠, 𝑎) corresponds to the negative travel
time between two cities. This MDP can then be transformed into a dynamic programming model. DQN and PPO
algorithms have been trained for the MDP formulation to select the efficient branching policies for different CP search

Mazyavkina et al.: Preprint submitted to Elsevier Page 13 of 24

Reinforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL

[Khalil et al., 2017] No Yes S2V DQN

[Barrett et al., 2020] No Yes S2V DQN
[Cappart et al., 2019] Yes Yes S2V DQN
[Tang et al., 2020] Yes No LSTM + Attention Policy Gradient + ES
[Abe et al., 2019] No Yes GNN Neural MCTS
[Gu and Yang, 2020] No Yes Pointer Network A3C

Table 2
Summary of approaches for Maximum Cut Problem.

strategies — branch-and-bound, iterative limited discrepancy search and restart based search, and have been used to
solve challenging CO problems.
The work by [Drori et al., 2020] differs from the previous works, which tailor their approaches to individual
problems. In contrast, this work provides a general framework for model-free reinforcement learning using a GNN
representation that adapts to different problem classes by changing a reward. This framework models problems by
using the edge-to-vertex line graph and formulates them as a single-player game framework. The MDPs for TSP
and VRP are the same as in [Bello et al., 2017]. Instead of using a full-featured Neural MCTS, [Drori et al., 2020]
represents a policy as a GIN encoder with an attention-based decoder, learning it during the tree-search procedure.
[Lu et al., 2020] suggests to learn the improvement heuristics in hierarchical manner for capacitated VRP as a part
of the joint approach. The authors have designed an intrinsic MDP, which incorporates not only the features of the
current solutions but also the running history. A state 𝑠𝑖 includes free capacity of the route containing a customer
𝑖, its location, a location of the node 𝑖− visited before 𝑖, a location of the node 𝑖+ visited after 𝑖,a distance from 𝑖−
to 𝑖,a distance from 𝑖 to 𝑖+ ,a distance from 𝑖− to 𝑖+ , an action taken ℎ steps before, an effect of 𝑎𝑡−ℎ . The action
consists of choosing between two groups of operators, that change the current solution, for example by applying the
2-Opt heuristic, which removes two edges and reconnects their endpoints. Concretely, these two operator groups
are improvement operators, that are chosen according to a learned policy, or perturbation operators in the case of
reaching a local minima. The authors have experimented with the reward functions, and have chosen the two most
successful ones: +1∕ − 1 reward for each time the solution improves/does not give any gains, and the advantage
reward, which takes the initial solution’s total distance as the baseline, and constitutes the difference between this
baseline and the distance of the subsequent solutions as the reward at each time step. The policy is parameterized by
the Graph Attention Network and is trained with the REINFORCE algorithm.
The final work, we are going to cover for this section of problems is by [Chen and Tian, 2019], who proposes
solving VRP and online job scheduling problems by learning improvement heuristics. The algorithm rewrites different
parts of the solution until convergence instead of constructing the solution in the sequential order. The state space is
represented as a set of all solutions to the problem, while the action set consists of regions, i.e. nodes in the graph, and
their corresponding rewriting rules. The reward, in this case, is the difference in the costs of the current and previous
solutions. The authors use an LSTM encoder, specific to each of the covered problems and train region-picking and
rule-picking policies jointly by applying the Q-Actor-Critic algorithm.

4.2. Maximum Cut Problem

The first work to address solving Maximum Cut Problem with reinforcement learning was [Khalil et al., 2017],
that proposed the principled approach to learning the construction heuristic by combining graph embeddings with Q-
learning - S2V-DQN. They formulated the problem as an MDP, where the state space, 𝑆, is defined as a partial solution
to the problem, i.e. the subset of all nodes in a graph added to the set, that maximizes the maximum cut. The action
space, 𝐴, is a set of nodes that are not in the current state. The transition function, 𝑇 (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ), is deterministic and
corresponds to tagging the last selected node with a feature 𝑥𝑣 = 1. The reward is calculated as an immediate change
in the cut weight, and the episode terminates when the cut weight can’t be improved with further actions. The graph
embedding network proposed in [Dai et al., 2016] was used as state encoder. A variant of the Q-learning algorithm
was used to learn to construct the solution, that was trained on randomly generated instances of graphs. This approach
achieves better approximation ratios compared to the commonly used heuristic solutions of the problem, as well as
the generalization ability, which has been shown by training on graphs of consisting of 50-100 nodes and tested on

Mazyavkina et al.: Preprint submitted to Elsevier Page 14 of 24

Reinforcement Learning for Combinatorial Optimization

graphs with up to 1000-1200 nodes, achieving very good approximation ratios to exact solutions.
[Barrett et al., 2020] improved on the work of [Khalil et al., 2017] in terms of the approximation ratio as well as
the generalization by proposing an ECO-DQN algorithm. The algorithm kept the general framework of S2V-DQN but
introduced several modifications. The agent was allowed to remove vertices from the partially constructed solution
to better explore the solution space. The reward function was modified to provide a normalized incremental reward
for finding a solution better than seen in the episode so far, as well as give small rewards for finding a locally optimal
solution that had not yet been seen during the episode. In addition, there were no penalties for decreasing cut value.
The input of the state encoder was modified to account for changes in the reward structure. Since in this setting the
agent had been able to explore indefinitely, the episode length was set to 2|𝑉 |. Moreover, the authors allowed the
algorithm to start from an arbitrary state, which could be useful by combining this approach with other methods, e.g.
heuristics. This method showed better approximation ratios than S2V-DQN, as well as better generalization ability.
[Cappart et al., 2019] devised the joint approach to the Max-Cut problem by incorporation of the reinforcement
learning into the Decision Diagrams (DD) framework [Bergman et al., 2016] to learn the constructive heuristic. The
integration of the reinforcement learning allowed to provide tighter objective function bounds of the DD solution by
learning heuristics for variable ordering. They have formulated the problem as an MDP, where the state space, 𝑆,
is represented as a set of ordered sequences of selected variables along with partially constructed DDs. The action
space, 𝐴, consists of variables, that are not yet selected. The transition function, 𝑇 , adds variables to the selected
variables set and to the DD. The reward function is designed to tighten the bounds of the DD and is encoded as the
relative upper and lower bounds improvements after the addition of the variable to the set. The training was performed
on the generated random graphs with the algorithm and state encoding described above in [Khalil et al., 2017]. The
authors showed that their approach had outperformed several ordering heuristics and generalized well to the larger
graph instances, but didn’t report any comparison to the other reinforcement learning-based methods.
Another joint method proposed by [Tang et al., 2020] combined a reinforcement learning framework with the
cutting plane method. Specifically, in order to learn the improvement heuristics to choose Gomory’s cutting plane,
which is frequently used in the Branch-and-Cut solvers, an efficient MDP formulation was developed. The state
space, 𝑆, includes the original linear constraints and cuts added so far. Solving the linear relaxation produces the
action space, 𝐴, of Gomory cut that can be added to the problem. After that the transition function, 𝑇 , adds the
chosen cuts to the problem that results in a new state. The reward function is defined as a difference between the
objective function of two consecutive linear problem solutions. The policy gradient algorithm was used to select new
Gamory cuts, and the state was encoded with an LSTM network (to account for a variable number of variables) along
with the attention-based mechanism (to account for a variable number of constraints). The algorithm was trained on
the generated graph instances using evolution strategies and had been shown to improve the efficiency of the cuts,
the integrality gaps, and the generalization, compared to the usually used heuristics used to choose Gomory cuts.
Also, the approach was shown to be beneficial in combination with the branching strategy innexperiments with the
Branch-and-Cut algorithm.
[Abe et al., 2019] proposed to use a graph neural network along with Neural MTCS approach to learn the con-
struction heuristics. The MDP formulation defines the state space, 𝑆, as a set of partial graphs from where nodes
can be removed and colored in one of two colors representing two subsets. The action space, 𝐴, represents sets of
nodes still left in the graph and their available colors. The transition function, 𝑇 , colors the selected node of the graph
and removes it along with the adjacent edges. The neighboring nodes, that have been left, are keeping a counter of
how many nodes of the adjacent color have been removed. When the new node is removed, the number of the earlier
removed neighboring nodes of the opposite color is provided as the incremental reward signal, 𝑅 (the number of edges
that were included in the cut set). Several GNNs were compared as the graph encoders, with GIN[Xu et al., 2018]
being shown to be the most performing. Also, the training procedure similar to AlphaGo Zero was employed with
the modification to accommodate for a numeric rather than win/lose solution. The experiments were performed with
a vast variety of generated and real-world graphs. The extensive comparison of the method with several heuristics
and with previously described S2V-DQN [Khalil et al., 2017] showed the superior performance as well as the better
generalization ability to larger graphs, yet they didn’t report any comparison with the exact methods.
[Gu and Yang, 2020] applied the Pointer Network [Vinyals et al., 2015] along with the Actor-Critic algorithm
similar to [Bello et al., 2017] to iteratively construct a solution. The MDP formulation defines the state, 𝑆, as a
symmetric matrix, 𝑄, the values of which are the edge weights between nodes (0 for the disconnected nodes). Columns
of this matrix are fed to the Pointer Network, which sequentially outputs the actions, 𝐴, in the form of pointers to input
vectors along with a special end-of-sequence symbol "EOS". The resulting sequence of nodes separated by the "EOS"

Mazyavkina et al.: Preprint submitted to Elsevier Page 15 of 24

Reinforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL

[Hu et al., 2017] No Yes Pointer Network REINFORCE with baseline

[Duan et al., 2019] No Yes Pointer Network + Classifier PPO
[Laterre et al., 2018] No Yes FF NN Neural MCTS
[Li et al., 2020] No No Attention Actor-Critic
[Cai et al., 2019] Yes No N/A PPO

Table 3
Summary of approaches for Bin Packing Problem.

symbol represents a solution to the problem, from which the reward is calculated. The authors conducted experiments
with simulated graphs with up to 300 nodes and reported fairly good approximations ratios, but, unfortunately, didn’t
compare with the previous works or known heuristics.

4.3. Bin Packing Problem

To our knowledge, one of the first attempts to solve a variant of Bin Packing Problem with modern reinforcement
learning was [Hu et al., 2017]. The authors have proposed a new, more realistic formulation of the problem, where the
bin with the least surface area that could pack all 3D items is determined. This principled approach is only concerned
with learning the construction heuristic to choose a better sequence to pack the items and using regular heuristics to
determine the space and orientation. The state space, 𝑆, is denoted by a set of sizes (height, width, and length) of the
items that need to be packed. The approach proposed by [Bello et al., 2017], which utilizes the Pointer Network, is
used to output the sequence of actions, 𝐴, i.e. sequence of items to pack. Reward, 𝑅, is calculated as the value of the
surface area of packed items. REINFORCE with the baseline is used as a reinforcement learning algorithm, with the
baseline provided by the known heuristic. The improvement over the heuristic and random item selection was shown
with greedy decoding as well as sampling from the with beam search.
Further work by [Duan et al., 2019] extends the approach of [Hu et al., 2017] to learning of the orientations along
with a sequence order of items by combining reinforcement and supervised learning in a multi-task fashion. In this
work a Pointer Network, trained with a PPO algorithm, was enhanced with a classifier that determined the orientation
of the current item in the output sequence, given the representation from the encoder and the embedded partial items
sequence. The classifier is trained in a supervised setting, using the orientations in the so-far best solution of the
problem as labels. The experiments were conducted on the real-world dataset and showed that the proposed method
performs better than several widely used heuristics and previous approaches by [Hu et al., 2017].
[Laterre et al., 2018] applied the principled Neural MCTS approach to solve the already mentioned variant of
2D and 3D bin packing problems by learning the construction heuristic. The MDP formulation includes the state
space, 𝑆, represented by the set of items that need to be packed with their heights, widths, and depths. The action
space, 𝐴, is represented by the set of item ids, coordinates of the bottom-left corner of the position of the items, and
their orientations. To solve 2D and 3D Bin Packing Problems, formulated as a single-player game, Neural MCTS
constructs the optimal solution with the addition of a ranked reward mechanism that reshapes the rewards according
to the relative performance in the recent games. This mechanism aims to provide a natural curriculum for a single
agent similar to the natural adversary in two-player games. The experimental results have been compared with the
heuristic as well as Gurobi solver and showed better performance in several cases on the dataset created by randomly
cutting the original bin into items.
[Li et al., 2020] tries to address the limitation of the three previously described works, namely using heuristics
for the rotation or the position coordinates ([Hu et al., 2017],[Duan et al., 2019]) or obtaining items from cutting the
original bin ([Laterre et al., 2018]). Concretely, the authors propose to construct an end-to-end pipeline to choose an
item, orientation, and position coordinates by using an attention mechanism. The MDP’s state space, 𝑆, includes a
binary indicator of whether the item is packed or not, its dimensions, and coordinates relative to the bin. The action
space, 𝐴, is defined by the selection of the item, the rotation and the position of the item in the bin. The reward
function is incremental and is calculated as the volume gap in the bin, i.e. the current bin’s volume − the volume of
the packed items. The actor-critic algorithm was used for learning. The comparison provided with a genetic algorithm
and previous reinforcement learning approaches, namely [Duan et al., 2019] and [Kool et al., 2019], has showed that

Mazyavkina et al.: Preprint submitted to Elsevier Page 16 of 24

Reinforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL

[Khalil et al., 2017] No Yes S2V DQN

[Song et al., 2020] No Yes S2V DQN + Imitation Learning
[Manchanda et al., 2019] No Yes GNN DQN

Table 4
Summary of approaches for Minimum Vertex Cover problem.

the proposed method achieves a smaller bin gap ratio for the problems of size up to 30 items.
[Cai et al., 2019] has taken the joint approach to solving a 1D bin packing problem by combining proximal policy
optimization (PPO) with the simulated annealing (SA) heuristic algorithm. PPO is used to learn the improvement
heuristic to build an initial starting solution for SA, which in turn, after finding a good solution in a limited number
of iterations, calculates the reward function, 𝑅, as the difference in costs between the initial and final solutions and
passes it to the PPO agent. The action space, 𝐴, is represented by a set of changes of the bins between two items,
e.g. a perturbation to the current solution. The state space, 𝑆, is described with a set of assignments of items to bins.
The work has showed that the combination of RL and the heuristics can find solutions better than these algorithms in
isolation, but has not provides any comparison with known heuristics or other algorithms.

4.4. Minimum Vertex Cover Problem

The principled approach to solving the Minimum Vertex Problem (MVC) with reinforcement learning was devel-
oped by [Khalil et al., 2017]. To learn the construction heuristic the problem was put into the MDP framework, which
is described in details in 4.2 along with the experimental results. To apply the algorithm to the MVC problem reward
function, 𝑅, was modified to produce −1 for assigning a node to the cover set. Episode termination happens when all
edges are covered.
[Song et al., 2020] proposed the joint co-training method, which has gained popularity in the classification domain,
to construct sequential policies. The paper describes two policy-learning strategies for the MVC problem: the first
strategy copies the one described in [Khalil et al., 2017], i.e. S2V-DQN, the second is the integer linear programming
approach solved by the branch & bound method. The authors create the 𝐶𝑜𝑃 𝑖𝐸𝑟 algorithm that is intuitively similar to
Imitation Learning [Hester et al., 2018], in which two strategies induce two policies, estimate them to figure out which
one is better, exchange the information between them, and, finally, make the update. The performed experiments
resulted in the extensive ablation study, listing the comparisons with the S2V-DQN, Imitation learning, and Gurobi
solver, and showed a smaller optimality gap for problems up to 500 nodes.
Finally, it is worth including in this section the work by [Manchanda et al., 2019] that combined the supervised
and reinforcement learning approaches in the joint method that learns a construction heuristic for a budget-constrained
Maximum Vertex Cover problem. The algorithm consists of two phases. In the first phase, a GCN is used to deter-
mine "good" candidate nodes by learning the scoring function, using the scores, provided by the probabilistic greedy
approach, as labels. Then the candidates nodes are used in an algorithm similar to [Khalil et al., 2017] to sequentially
construct a solution. Since the degree of nodes in large graphs can be very high, the importance sampling according to
the computed score is used to choose the neighboring nodes for the embedding calculation, which helps to reduce the
computational complexity. The extensive experiments on random and real-world graphs has showed that the proposed
method performs marginally better compared to S2V-DQN, scales to much larger graph instances up to a hundred
thousand nodes, and is significantly more efficient in terms of the computation efficiency due to a lower number of
learned parameters.

4.5. Maximum Independent Set Problem

One of the first RL for CO works, that covered MIS problem, is [Cappart et al., 2019]. It focuses on a particular
approach to solving combinatorial optimization problems, where an RL algorithm is used to find the optimal ordering
of the variables in a Decision Diagram (DD), to tighten the relaxation bounds for the MIS problem. The MDP
formulation, as well as the encoder and the RL algorithm are described in detail in Section 4.4.
Another early article, covering MIS, is [Abe et al., 2019]. In it, authors have proposed the following MDP for-
mulation: let a state 𝑠 ∈ 𝑆 be a graph, received at each step of constructing a solution, with the initial state 𝑠0 being

Mazyavkina et al.: Preprint submitted to Elsevier Page 17 of 24

Reinforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL

[Cappart et al., 2019] Yes No S2V DQN

[Abe et al., 2019] No No GIN MCTS
[Ahn et al., 2020] Yes Yes GCN PPO

Table 5
Summary of approaches for Maximum Independent Set problem.

the initial graph 𝐺0 ; an action 𝑎 ∈ 𝐴 be a selection of one node of a graph in the current state; a transition function
𝑇 (𝑠, 𝑎, 𝑠′ ) be the function returning the next state, corresponding to the graph where edges covered by the action 𝑎 and
its adjacent nodes are deleted; and, finally, a reward function 𝑟(𝑠, 𝑎) be a constant and equal to 1. For encoders, the
authors proposed to apply a GIN, to account for the variable size of a state representation in a search tree. [Abe et al.,
2019] uses a model-based algorithm, namely AlphaGo Zero, to update the parameters of the GIN network.
The latest work, [Ahn et al., 2020] modifies the MDP formulation, by applying a label to each node, i.e. each node
can be either included into the solution, excluded from the solution or the determination of its label can be deferred.
This way, a state 𝑠 ∈ 𝑆 becomes a vector, the size of which is equal to the number of nodes in the graph, and which
consists of the labels that have been given to each node at the current time step. The initial state 𝑠0 is a vector with all
labels set to being deferred. An action 𝑎 ∈ 𝐴 is a vector with new label assignments for the next state of only currently
deferred nodes. To maintain the independence of the solution set, the transition function 𝑇 (𝑠, 𝑎) is set to consist of two
phases: the update phase and clean-up phase. The first phase represents the naive assignment of the labels by applying
the action 𝑎, which leads to the intermediate state 𝑠. ̂ In the clean-up phase, the authors modify the intermediate state
𝑠̂ in such a way that the included nodes are only adjacent to the excluded ones. Finally, the reward function 𝑟(𝑠, 𝑎)
is equal to the increase in the cardinality of included vertices between the current state 𝑠′ and the previous state 𝑠.
The authors propose to use the Graph Convolutional Network encoder and PPO method with the rollback procedure
to learn the optimal Deep Auto-Deferring Policy (ADP), which outputs the improvement heuristic to solve the MIS
problem.

5. Comparison
In this section, we will partially compare the results achieved by the works presented in this survey. Concretely,
we have distinguished the two most frequently mentioned problems, namely, Travelling Salesman Problem (TSP) and
Capacitated Vehicle Routing Problem (CVRP). The average tour lengths for these problems, reported in the works [Lu
et al., 2020; Kool et al., 2019; Lodi et al., 2002; Bello et al., 2017; Emami and Ranka, 2018; Ma et al., 2020; Nazari
et al., 2018; Chen and Tian, 2019], are shown in Tables 6, 7.
The presented results have been achieved on Erdős–Rényi (ER) graphs of various sizes, namely, with the number
of nodes of 20, 50, 100 for TSP, and 10, 20, 50, 100 for CVRP. In the case of CVRP, we have also specified the capacity
of the vehicle (Cap.), which varies from 10 to 50. Also, we have included the results achieved by the OR-Tools solver
[Perron and Furnon, 2019] and LK-H heuristic algorithm [Helsgaun, 2017] as the baseline solutions.

Best performing methods. It is clear from the presented table that the best performing methods for TSP are [Kool
et al., 2019] and [Bello et al., 2017], and for VRP — [Lu et al., 2020]. These algorithms perform on par with the
baseline, and in some cases demonstrate better results. Moreover, in the case of [Lu et al., 2020], the algorithm
manages to present the best performance across all the other methods, even for tasks with smaller vehicle capacities.

Focus on smaller graphs. Throughout our analysis, we have found that most of the articles focus on testing the
CO-RL methods on graphs with the number of nodes of 20, 50, 100. At the same time, [Ma et al., 2020] presents
the results for bigger graphs with 250, 500, 750, and 1000 nodes for a TSP problem. This may be connected to the
fact that with the increasing size of the graphs the process of finding the optimal solution also becomes much more
computationally difficult even for the commercial solvers. The comparison of the reported results and the baseline
further supports this fact: for TSP it can be seen how for smaller graphs almost all of the methods outperform OR-
Tools, while for bigger graphs it is no longer the case. Consequently, this can be a promising direction for further

Mazyavkina et al.: Preprint submitted to Elsevier Page 18 of 24

Reinforcement Learning for Combinatorial Optimization

Average tour length

Algo Article Method
N=20 N=50 N=100
[Lu et al., 2020] 4.0 6.0 8.4
[Kool et al., 2019] REINFORCE 3.8 5.7 7.9
[Deudon et al., 2018] 3.8 5.8 8.9
RL
[Deudon et al., 2018] REINFORCE+2opt 3.8 5.8 8.2
[Bello et al., 2017] A3C 3.8 5.7 7.9
[Emami and Ranka, 2018] Sinkhorn Policy Gradient 4.6 – –

[Helsgaun, 2017] LK-H 3.8 5.7 7.8

[Perron and Furnon, 2019] OR-Tools 3.9 5.8 8.0
Table 6
The average tour lengths (the smaller, the better) comparison for TSP for ER graphs
with the number of nodes 𝑁 equal to 20, 50, 100.

research.

Non-overlapping problems. We can see that although there have emerged a lot of works focused on creating well-
performing RL-based solvers, the CO problems, covered in these articles, rarely coincide, which makes the fair
comparison a much harder task. We are convinced that further analysis should be focused on unifying the results from
different sources, and, hence, identifying more promising directions for research.

Running times. One of the main pros of using machine learning and reinforcement learning algorithms to solve CO
problems is the considerable reduction in running times compared to the ones obtained by the metaheuristic algorithms
and solvers. However, it is still hard to compare the running time results of different works as they can significantly
vary depending on the implementations and the hardware used for experimentation. For these reasons, we do not
attempt to exactly compare the times achieved by different RL-CO works. Still, however, we can note that some of the
works such as [Nazari et al., 2018], [Chen and Tian, 2019], [Lu et al., 2020] claim to have outperformed the classic
heuristic algorithms. Concretely, the authors of [Nazari et al., 2018], show that for larger problems, their framework
is faster than the randomized heuristics and their running times grow slower with the increase of the complexities of
the CO problems than the ones achieved by Clarke-Wright [Clarke and Wright, 1964] and Sweep heuristics [Wren
and Holliday, 1972]. [Chen and Tian, 2019] claim that their approach outperforms the expression simplification
component in Z3 solver [De Moura and Bjørner, 2008] in terms of both the objective metrics and the time efficiency.
Finally, although the exact training times are not given in the article, the authors of [Lu et al., 2020] note that the given
time of their algorithm is much smaller than that of LK-H. In addition, although also acknowledging the complexity
of comparing the times of different works, [Kool et al., 2019] have claimed that the running time of their algorithm is
ten times faster than the one of [Bello et al., 2017].

6. Conclusion and future directions

The previous sections have covered several approaches to solving canonical combinatorial optimization problems
by utilizing reinforcement learning algorithms. As this field has demonstrated to be performing on-par with the state-
of-the-art heuristic methods and solvers, we are expecting new algorithms and approaches to emerge in the following
possible directions, which we have found promising:

Mazyavkina et al.: Preprint submitted to Elsevier Page 19 of 24

Reinforcement Learning for Combinatorial Optimization

Average tour length

Algo Article Method
N=10, N=20, N=20, N=50, N=50, N=100, N=100,
Cap. 10 Cap. 20 Cap. 30 Cap. 30 Cap. 40 Cap. 40 Cap. 50

[Nazari et al., 2018] 4.7 – 6.4 – 11.15 – 17.0

[Kool et al., 2019] REINFORCE – – 6.3 – 10.6 – 16.2
RL
[Lu et al., 2020] – 6.1 – 10.4 – 15.6 –
[Chen and Tian, 2019] A2C – – 6.2 – 10.5 – 16.1

[Helsgaun, 2017] LK-H – 6.1 6.1 10.4 10.4 15.6 15.6

[Perron and Furnon, 2019] OR-Tools 4.7 6.4 6.4 11.3 11.3 17.2 17.2

Table 7
The average tour lengths comparison for Capacitated VRP for ER graphs with the num-
ber of nodes 𝑁 equal to 10, 20, 50, 100. Cap. represents the capacity of the vehicle for
CVRP.

Generalization to other problems. In 5, we have formulated one of the main problems of the current state of
the RL-CO field, which is a limited number of experimental comparisons. Indeed, the CO group of mathematical
problems is vast, and the current approaches often require being implemented for a concrete set of problems. RL
field, however, has already made some steps towards the generalization of the learned policies to the unseen problems
(for example, [Groshev et al., 2018]). In the case of CO, these unseen problems can be smaller instances of the same
problem, problem instances with different distributions, or even the ones from the other group of CO problems. We
believe, that although this direction is challenging, it is extremely promising for future development in the RL-CO
field.

Improving the solution quality. A lot of the reviewed works, presented in this survey, have demonstrated superior
performance compared to the commercial solvers. Moreover, some of them have also achieved the quality of the
solutions equal to the optimal ones or the ones achieved by the heuristic algorithms. However, these results are true
only for the less complex versions of CO problems, for example, the ones with smaller numbers of nodes. This leaves
us with the possibility of further improvement of the current algorithms in terms of the objective quality. Some of the
possible ways for this may be further incorporation of classical CO algorithms with the RL approaches, for example,
with using imitation learning as in [Hester et al., 2018].

Filling the gaps. One of the ways to classify RL-CO approaches, which we have mentioned previously, is by
grouping them into joint and constructive methods. Tables 1, 2, 3, 4, 5 contain the information about these labels
for each of the reviewed article, and from them, we can identify some unexplored approaches for each of the CO
problems. This way from Table 3, it can be seen that there has not been published any both joint and constructive
algorithm for solving the Bin Packing problem. The same logic can be applied to the Minimum Vertex Problem, Table
3, where there are no approaches of joint-constructive and joint-nonconstructive type. Exploring these algorithmic
possibilities can provide us not only with the new methods but also with useful insights into the effectiveness of these
approaches.

In conclusion, we see the field of RL for CO problems as a very promising direction for CO research, because of
the effectiveness in terms of the solution quality, the capacity to outperform the existing algorithms, and huge running
time gains compared to the classical heuristic approaches.

References
K. Abe, Z. Xu, I. Sato, and M. Sugiyama. Solving np-hard problems on graphs with extended alphago zero, 2019.

Mazyavkina et al.: Preprint submitted to Elsevier Page 20 of 24

Reinforcement Learning for Combinatorial Optimization

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo, et al. Fast discovery of association rules. Advances in knowledge discovery and
data mining, 12(1):307–328, 1996.
S. Ahn, Y. Seo, and J. Shin. Deep auto-deferring policy for combinatorial optimization, 2020. URL https://openreview.net/forum?
id=Hkexw1BtDr.
T. Akiba and Y. Iwata. Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover. Theoretical Computer Science,
609:211–225, 2016.
D. V. Andrade, M. G. Resende, and R. F. Werneck. Fast local search for the maximum independent set problem. In International Workshop on
Experimental and Efficient Algorithms, pages 220–234. Springer, 2008.
T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search. In Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, NIPS’17, page 5366–5376, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN
9781510860964.
D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook. The traveling salesman problem: a computational study. Princeton university press,
2006.
T. Back and S. Khuri. An evolutionary heuristic for the maximum independent set problem. In Proceedings of the First IEEE Conference on
Evolutionary Computation. IEEE World Congress on Computational Intelligence, pages 531–535. IEEE, 1994.
F. Barahona. On the computational complexity of ising spin glass models. Journal of Physics A: Mathematical and General, 15(10):3241, 1982.
ISSN 13616447. doi: 10.1088/0305-4470/15/10/028.
T. D. Barrett, W. R. Clements, J. N. Foerster, and A. Lvovsky. Exploratory combinatorial optimization with reinforcement learning. In Proceedings
of the the 34th National Conference on Artificial Intelligence, AAAI, pages 3243–3250, 2020. doi: 10.1609/aaai.v34i04.5723.
R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8),
1952. ISSN 0027-8424. doi: 10.1073/pnas.38.8.716.
R. Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957. ISSN 0022-2518.
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning. In Workshop Proceedings
of the 5th International Conference on Learning Representations, ICLR, 2017.
Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon. European Journal of
Operational Research, Aug. 2020. ISSN 03772217. doi: 10.1016/j.ejor.2020.07.063.
D. Bergman, A. A. Cire, W.-J. v. Hoeve, and J. Hooker. Decision Diagrams for Optimization. Springer Publishing Company, Incorporated, 1st
edition, 2016. ISBN 3319428470.
P. A. Borisovsky and M. S. Zavolovskaya. Experimental comparison of two evolutionary algorithms for the independent set problem. In Workshops
on Applications of Evolutionary Computation, pages 154–164. Springer, 2003.
C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener, D. Perez Liebana, S. Samothrakis, and S. Colton. A
survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012. ISSN 1943068X. doi:
10.1109/TCIAIG.2012.2186810.
Q. Cai, W. Hang, A. Mirhoseini, G. Tucker, J. Wang, and W. Wei. Reinforcement learning driven heuristic optimization. In Proceedings of
Workshop on Deep Reinforcement Learning for Knowledge Discovery, DRL4KDD, 2019.
Q. Cappart, E. Goutierre, D. Bergman, and L.-M. Rousseau. Improving optimization bounds using machine learning: Decision diagrams meet
deep reinforcement learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI, 2019. ISBN 9781577358091. doi:
10.1609/aaai.v33i01.33011443.
Q. Cappart, T. Moisan, L.-M. Rousseau, I. Prémont-Schwarz, and A. Cire. Combining reinforcement learning and constraint programming for
combinatorial optimization. arXiv preprint arXiv:2006.01610, 2020.
C. Chen, S.-M. Lee, and Q. Shen. An analytical model for the container loading problem. European Journal of Operational Research, 80(1):68–76,
1995.
X. Chen and Y. Tian. Learning to perform local rewriting for combinatorial optimization. In Proceedings of the 33rd Conference on Advances in
Neural Information Processing Systems, NeurIPS, pages 6281–6292, 2019.
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn
encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
N. Christofides. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report, Carnegie-Mellon Univ Pittsburgh Pa
Management Sciences Research Group, 1976.
G. Clarke and J. W. Wright. Scheduling of Vehicles from a Central Depot to a Number of Delivery Points. Operations Research, 12(4):568–581,
1964. ISSN 0030-364X. doi: 10.1287/opre.12.4.568.
CPLEX. IBM ILOG CPLEX optimization studio. Version, 12:1987–2018, 1987.
A. Croes. A method for solving traveling salesman problems. Operations Research, 5:791—-812, 1958. ISSN 0030-364X. doi: 10.1287/opre.6.6.
791.
H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International
Conference on Machine Learning, ICML, 2016. ISBN 9781510829008.
G. Dantzig, R. Fulkerson, and S. Johnson. Solution of a large-scale traveling-salesman problem. Journal of the operations research society of
America, 2(4):393–410, 1954.
G. B. Dantzig and M. N. Thapa. Linear programming 1: Introduction. Springer International Publishing, New York, NY, 1997. doi: https:
//doi.org/10.1007/b97672.
L. De Moura and N. Bjørner. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis
of Systems, pages 337–340. Springer, 2008. ISBN 3540787992. doi: 10.1007/978-3-540-78800-3_24.
M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau. Learning heuristics for the TSP by policy gradient. In Lecture Notes in Com-
puter Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018. ISBN 9783319930305.

Mazyavkina et al.: Preprint submitted to Elsevier Page 21 of 24

Reinforcement Learning for Combinatorial Optimization

doi: 10.1007/978-3-319-93031-2_12.
I. Dinur and S. Safra. On the hardness of approximating minimum vertex cover. Annals of Mathematics, pages 439–485, 2005. ISSN 0003486X.
doi: 10.4007/annals.2005.162.439.
I. Drori, A. Kharkar, W. R. Sickinger, B. Kates, Q. Ma, S. Ge, E. Dolev, B. Dietrich, D. P. Williamson, and M. Udell. Learning to solve combinatorial
optimization problems on real-world graphs in linear time, 2020.
L. Duan, H. Hu, Y. Qian, Y. Gong, X. Zhang, J. Wei, and Y. Xu. A multi-task selected learning approach for solving 3d flexible bin packing
problem. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS, page 1386–1394,
2019. ISBN 9781510892002.
N. Elsokkary, F. S. Khan, D. La Torre, T. S. Humble, and J. Gottlieb. Financial portfolio management using d-wave quantum optimizer: The case
of abu dhabi securities exchange. Technical report, Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), 2017.
P. Emami and S. Ranka. Learning permutations with sinkhorn policy gradient, 2018.
T. A. Feo, M. G. Resende, and S. H. Smith. A greedy randomized adaptive search procedure for maximum independent set. Operations Research,
42(5):860–878, 1994.
E. Filiol, E. Franc, A. Gubbioli, B. Moquet, and G. Roblot. Combinatorial optimisation of worm propagation on an unknown network. International
Journal of Computer Science, 2(2):124–130, 2007.
E. J. Gardiner, P. Willett, and P. J. Artymiuk. Graph-theoretic techniques for macromolecular docking. Journal of Chemical Information and
Computer Sciences, 40(2):273–279, 2000.
A. Gleixner, L. Eifler, T. Gally, G. Gamrath, P. Gemander, R. L. Gottwald, G. Hendel, C. Hojny, T. Koch, M. Miltenberger, B. Müller, M. E. Pfetsch,
C. Puchert, D. Rehfeldt, F. Schlösser, F. Serrano, Y. Shinano, J. M. Viernickel, S. Vigerske, D. Weninger, J. T. Witt, and J. Witzig. The SCIP
Optimization Suite 5.0. Technical report, Optimization Online, December 2017. URL http://www.optimization-online.org/DB_
HTML/2017/12/6385.html.
M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite
programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995. ISSN 1557735X. doi: 10.1145/227683.227684.
T. F. Gonzalez. Handbook of approximation algorithms and metaheuristics. CRC Press, 2007. ISBN 9781420010749. doi: 10.1201/
9781420010749.
I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
E. Groshev, M. Goldstein, A. Tamar, S. Srivastava, and P. Abbeel. Learning generalized reactive policies using deep neural networks. In Proceedings
of the 28th International Conference on Automated Planning and Scheduling, ICAPS, pages 408–416, 2018.
S. Gu and Y. Yang. A deep learning algorithm for the max-cut problem based on pointer network structure with supervised learning and reinforce-
ment learning strategies. Mathematics, 8(2):298, Feb 2020. ISSN 2227-7390. doi: 10.3390/math8020298. URL http://dx.doi.org/
10.3390/math8020298.
T. Guo, C. Han, S. Tang, and M. Ding. Solving combinatorial problems with machine learning methods. In Nonlinear Combinatorial Optimization,
pages 207–229. Springer International Publishing, Cham, 2019. ISBN 978-3-030-16194-1. doi: 10.1007/978-3-030-16194-1_9.
L. Gurobi Optimization. Gurobi optimizer reference manual, 2020. URL http://www.gurobi.com.
P. Hansen, N. Mladenović, and D. Urošević. Variable neighborhood search for the maximum clique. Discrete Applied Mathematics, 145(1):
117–125, 2004.
M. Held and R. M. Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied mathematics,
10(1):196–210, 1962.
K. Helsgaun. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):
106–130, 2000.
K. Helsgaun. An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Technical
report, Roskilde University, 2017.
T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from
demonstrations. In Proceedings of the 32nd Conference on Artificial Intelligence, AAAI, 2018. ISBN 9781577358008.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
H. Hu, X. Zhang, X. Yan, L. Wang, and Y. Xu. Solving a new 3d bin packing problem with deep reinforcement learning method, 2017.
G. Karakostas. A better approximation ratio for the vertex cover problem. ACM Transactions on Algorithms (TALG), 5(4):1–8, 2009.
R. M. Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972. doi: 10.1007/
978-1-4684-2001-2_9.
K. Katayama, A. Hamamoto, and H. Narihisa. An effective local search for the maximum clique problem. Information Processing Letters, 95(5):
503–511, 2005.
H. Kellerer, U. Pferschy, and D. Pisinger. Multidimensional knapsack problems. In Knapsack problems, pages 235–283. Springer, 2004.
E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. In Proceedings of the 31st
Conference on Advances in Neural Information Processing Systems, NeurIPS, 2017.
T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Represen-
tations (ICLR), 2017.
W. Kool, H. van Hoof, and M. Welling. Attention, learn to solve routing problems! In Proceedings of the 7th International Conference on Learning
Representations, ICLR, 2019.
R. E. Korf. An improved algorithm for optimal bin packing. In IJCAI, volume 3, pages 1252–1258. Citeseer, 2003.
B. Korte, J. Vygen, B. Korte, and J. Vygen. Combinatorial optimization, volume 2. Springer, 2012.
S. Lamm, P. Sanders, and C. Schulz. Graph partitioning for independent sets. In International Symposium on Experimental Algorithms, pages
68–81. Springer, 2015.
S. Lamm, P. Sanders, C. Schulz, D. Strash, and R. F. Werneck. Finding near-optimal independent sets at scale. In 2016 Proceedings of the

Mazyavkina et al.: Preprint submitted to Elsevier Page 22 of 24

Reinforcement Learning for Combinatorial Optimization

Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 138–150. SIAM, 2016.
G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. Snps problems, complexity, and algorithms. In European symposium on algorithms,
pages 182–193. Springer, 2001.
A. Laterre, Y. Fu, M. K. Jabri, A.-S. Cohen, D. Kas, K. Hajjar, T. S. Dahl, A. Kerkeni, and K. Beguir. Ranked reward: Enabling self-play
reinforcement learning for combinatorial optimization, 2018.
T. Leleu, Y. Yamamoto, P. L. McMahon, and K. Aihara. Destabilization of local minima in analog spin systems by correction of amplitude
heterogeneity. Physical review letters, 122(4), 2019. ISSN 10797114. doi: 10.1103/PhysRevLett.122.040607.
D. Li, C. Ren, Z. Gu, Y. Wang, and F. Lau. Solving packing problems by conditional query learning, 2020. URL https://openreview.
net/forum?id=BkgTwRNtPB.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning.
In Proceedings of the 4th International Conference on Learning Representations, ICLR, 2016.
S. Lin and B. W. Kernighan. An effective heuristic algorithm for the traveling-salesman problem. Operations research, 21(2):498–516, 1973.
A. Lodi, S. Martello, and D. Vigo. Heuristic algorithms for the three-dimensional bin packing problem. European Journal of Operational Research,
141(2):410–420, 2002.
H. Lu, X. Zhang, and S. Yang. A learning-based iterative method for solving vehicle routing problems. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=BJe1334YDH.
Q. Ma, S. Ge, D. He, D. Thaker, and I. Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In
AAAI Workshop on Deep Learning on Graphs: Methodologies and Applications, AAAI, 2020.
A. Makhorin. Glpk (gnu linear programming kit), 2012. URL http://www.gnu.org/software/glpk/glpk.html.
S. Manchanda, A. Mittal, A. Dhawan, S. Medya, S. Ranu, and A. K. Singh. Learning heuristics over large graphs via deep reinforcement learning.
CoRR, abs/1903.03332, 2019. URL http://arxiv.org/abs/1903.03332.
S. Martello and P. Toth. Bin-packing problem. Knapsack problems: Algorithms and computer implementations, pages 221–245, 1990a.
S. Martello and P. Toth. Lower bounds and reduction procedures for the bin packing problem. Discrete applied mathematics, 28(1):59–70, 1990b.
O. Mersmann, B. Bischl, J. Bossek, H. Trautmann, M. Wagner, and F. Neumann. Local search and the traveling salesman problem: A feature-based
characterization of problem hardness. In International Conference on Learning and Intelligent Optimization, pages 115–129. Springer, 2012.
C. E. Miller, A. W. Tucker, and R. A. Zemlin. Integer programming formulation of traveling salesman problems. Journal of the ACM (JACM), 7
(4):326–329, 1960.
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press,
USA, 2005. ISBN 9780521835402.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 2015. ISSN 14764687. doi: 10.1038/nature14236.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement
learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, volume 48 of ICML, page
1928–1937, 2016. ISBN 9781510829008.
M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác. Reinforcement learning for solving the vehicle routing problem. In Proceedings of the 32nd
Conference on Advances in Neural Information Processing Systems, NeurIPS, pages 9839–9849, 2018.
C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998. ISBN 9780486402581.
A. Perdomo-Ortiz, N. Dickson, M. Drew-Brook, G. Rose, and A. Aspuru-Guzik. Finding low-energy conformations of lattice protein models by
quantum annealing. Scientific reports, 2:571, 2012. ISSN 20452322. doi: 10.1038/srep00571.
L. Perron and V. Furnon. Or-tools, 2019. URL https://developers.google.com/optimization/.
W. Pullan and H. H. Hoos. Dynamic local search for the maximum clique problem. Journal of Artificial Intelligence Research, 25:159–185, 2006.
L. Schrage. Linear, Integer, and Quadratic Programming with LINDO: User’s Manual, 1986.
E. L. Schreiber and R. E. Korf. Improved bin completion for optimal bin packing and number partitioning. In Twenty-Third International Joint
Conference on Artificial Intelligence, 2013.
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and
D. Silver. Mastering atari, go, chess and shogi by planning with a learned model, 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ISSN 14764687. doi:
10.1038/nature16961.
J. Song, R. Lanka, Y. Yue, and M. Ono. Co-training for policy learning. In Proceedings of the 35th Conference on Uncertainty in Artificial
Intelligence, UAI 2019, volume 115 of Proceedings of Machine Learning Research, pages 1191–1201, Tel Aviv, Israel, 22–25 Jul 2020. PMLR.
K. Subhash, D. Minzer, and M. Safra. Pseudorandom sets in grassmann graph have near-perfect expansion. In 2018 IEEE 59th Annual Symposium
on Foundations of Computer Science (FOCS), pages 592–601. IEEE, 2018.
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 1988. ISSN 15730565. doi: 10.1023/A:
1022633531479.
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In
Advances in neural information processing systems, pages 1057–1063, 2000. ISBN 0262194503.
Y. Tang, S. Agrawal, and Y. Faenza. Reinforcement learning for integer programming: Learning to cut. In Proceedings of the International
Conference on Machine Learning, ICML, pages 1483–1492, 2020.
R. E. Tarjan and A. E. Trojanowski. Finding a maximum independent set. SIAM Journal on Computing, 6(3):537–546, 1977.
The Sage Developers. SageMath, the Sage Mathematics Software System (Version 9.0.0), 2020. URL https://www.sagemath.org.
E. S. Tiunov, A. E. Ulanov, and A. Lvovsky. Annealing by simulating the coherent ising machine. Optics express, 27(7):10288–10295, 2019. ISSN

Mazyavkina et al.: Preprint submitted to Elsevier Page 23 of 24

Reinforcement Learning for Combinatorial Optimization

1094-4087. doi: 10.1364/oe.27.010288.

R. van Bevern and V. A. Slugina. A historical note on the 3/2-approximation algorithm for the metric traveling salesman problem. Historia
Mathematica, 2020.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in
neural information processing systems, pages 5998–6008, 2017.
P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In Proceedings of the 6th International
Conference on Learning Representations, ICLR, 2018.
N. Vesselinova, R. Steinert, D. F. Perez-Ramirez, and M. Boman. Learning combinatorial optimization on graphs: A survey with applications to
networking. IEEE Access, 8:120388–120416, 2020. ISSN 2169-3536. doi: 10.1109/access.2020.3004964.
O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Proceedings of the 28th International Conference on Neural Information Processing
Systems, volume 2 of NeurIPS, page 2692–2700, 2015.
C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992. ISSN 0885-6125. doi: 10.1007/bf00992698.
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992. ISSN 15730565.
doi: 10.1023/A:1022672621406.
L. A. Wolsey. Integer programming, volume 52. John Wiley & Sons, 1998. ISBN 9780471283669.
A. Wren and A. Holliday. Computer scheduling of vehicles from one or more depots to a number of delivery points. Journal of the Operational
Research Society, 23(3):333–344, 1972. ISSN 0160-5682. doi: 10.1057/jors.1972.53.
Y. Wu, W. Li, M. Goh, and R. de Souza. Three-dimensional bin packing problem with variable bin height. European journal of operational
research, 202(2):347–355, 2010.
M. Xiao and H. Nagamochi. Exact algorithms for maximum independent set. Information and Computation, 255:126–146, 2017.
K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks?, 2018.
Y. Yamamoto, K. Aihara, T. Leleu, K.-i. Kawarabayashi, S. Kako, M. Fejer, K. Inoue, and H. Takesue. Coherent ising machines—optical neural
networks operating at the quantum limit. npj Quantum Information, 3(1):1–15, 2017. ISSN 20566387. doi: 10.1038/s41534-017-0048-9.
J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph neural networks: A review of methods and applications. arXiv
preprint, arXiv:1812.08434, 2018.

Mazyavkina et al.: Preprint submitted to Elsevier Page 24 of 24

Gartley Velcoty Relative
No ratings yet
Gartley Velcoty Relative
3 pages
Jamie Turner
No ratings yet
Jamie Turner
7 pages
RL For CO Survey
No ratings yet
RL For CO Survey
15 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
RL and CP For CO
No ratings yet
RL and CP For CO
11 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
AI To Solve TSP
No ratings yet
AI To Solve TSP
15 pages
Achieving (1) Sample Complexity For Constrained Markov Decision Process
No ratings yet
Achieving (1) Sample Complexity For Constrained Markov Decision Process
40 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
No ratings yet
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
29 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
Combining Reinforcement Learning and Constraint Programming For Combinatorial Optimization
No ratings yet
Combining Reinforcement Learning and Constraint Programming For Combinatorial Optimization
20 pages
RL QA Unit-IV
No ratings yet
RL QA Unit-IV
9 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
Simulation-Based Optimization Parametric Optimizat
100% (1)
Simulation-Based Optimization Parametric Optimizat
11 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
20289-Article Text-24302-1-2-20220628
No ratings yet
20289-Article Text-24302-1-2-20220628
9 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Abstract Dynamic Programming
No ratings yet
Abstract Dynamic Programming
257 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
No ratings yet
OR-Gym: A Reinforcement Learning Library For Operations Research Problems
28 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Combinatorial Algorithms
No ratings yet
Combinatorial Algorithms
77 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Reinforcement Learning: Foundations
No ratings yet
Reinforcement Learning: Foundations
276 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Alg RLearning Ejemplo
No ratings yet
Alg RLearning Ejemplo
99 pages
Cheat-Sheet For Dynamic Programming: Template For Solution Short-Hand
No ratings yet
Cheat-Sheet For Dynamic Programming: Template For Solution Short-Hand
6 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Rl-Unit Iv Qa
No ratings yet
Rl-Unit Iv Qa
16 pages
Srarm - Unit 1
No ratings yet
Srarm - Unit 1
16 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Dynamic Programming
No ratings yet
Dynamic Programming
10 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
9 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Deep Reinforcement Learning For Multi-Objective Optimization
No ratings yet
Deep Reinforcement Learning For Multi-Objective Optimization
12 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Combinatorial Optimization and Reasoning With Graph Neural Networks
No ratings yet
Combinatorial Optimization and Reasoning With Graph Neural Networks
8 pages
Four Room Pages 1
No ratings yet
Four Room Pages 1
8 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Constraint Satisfaction: Fundamentals and Applications
From Everand
Constraint Satisfaction: Fundamentals and Applications
Fouad Sabry
No ratings yet
JavaScript Interview Questions 100+
No ratings yet
JavaScript Interview Questions 100+
14 pages
Baraldi (2014)
No ratings yet
Baraldi (2014)
20 pages
Brain, Beauty, and Art Essays Bringing Neuroaesthetics Into Focus Optimized PDF Download
100% (13)
Brain, Beauty, and Art Essays Bringing Neuroaesthetics Into Focus Optimized PDF Download
14 pages
The Routledge Handbook On Greening High Density Cities Climate Society and Health 1st Edition Peng Du Instant Download
100% (1)
The Routledge Handbook On Greening High Density Cities Climate Society and Health 1st Edition Peng Du Instant Download
39 pages
Luk Clutch Academy Nissan Navara
No ratings yet
Luk Clutch Academy Nissan Navara
1 page
Data Interpretation PO Mains PDF Set 11
No ratings yet
Data Interpretation PO Mains PDF Set 11
42 pages
Performance of Internal Combustion Engine
No ratings yet
Performance of Internal Combustion Engine
45 pages
04-Types of Girls by Physical
No ratings yet
04-Types of Girls by Physical
1 page
Chapter 5
No ratings yet
Chapter 5
2 pages
Y49rddjj405950 Noc
No ratings yet
Y49rddjj405950 Noc
2 pages
CPP Child Protection and Anti-Bullying Handbook
No ratings yet
CPP Child Protection and Anti-Bullying Handbook
6 pages
Heat Bridges Catalog en 1205
No ratings yet
Heat Bridges Catalog en 1205
9 pages
Person-Organization Fit, Person-Job Fit, and Innovative Work Behavior - The Role of Organizational Citizenship Behavior
No ratings yet
Person-Organization Fit, Person-Job Fit, and Innovative Work Behavior - The Role of Organizational Citizenship Behavior
14 pages
Week 6 Assignment Sardor Jurayev 2
100% (1)
Week 6 Assignment Sardor Jurayev 2
4 pages
More About Mona Lisa Smile
No ratings yet
More About Mona Lisa Smile
11 pages
Loki Asam MCD-1 XCP 1-5-0
No ratings yet
Loki Asam MCD-1 XCP 1-5-0
3 pages
Project Report Title Page
100% (1)
Project Report Title Page
1 page
Op Transaction History 05!12!2022
No ratings yet
Op Transaction History 05!12!2022
7 pages
Moiz Compiler
No ratings yet
Moiz Compiler
11 pages
Signed. WIR 20 Oktober 2024 TAHAP III 2024 Batam
No ratings yet
Signed. WIR 20 Oktober 2024 TAHAP III 2024 Batam
1 page
GGH Casing Assembly Procedure
No ratings yet
GGH Casing Assembly Procedure
14 pages
CLSM0200 Clutch Service
No ratings yet
CLSM0200 Clutch Service
156 pages
GTA Liberty City Stories Cheat Codes
100% (1)
GTA Liberty City Stories Cheat Codes
13 pages
Thesis TV Addiction
100% (2)
Thesis TV Addiction
4 pages
Deep Beam
0% (1)
Deep Beam
8 pages
Adhd Is Awesome Penn Holderness Instant Download
50% (2)
Adhd Is Awesome Penn Holderness Instant Download
39 pages
Teaching Plan For Nutrition: Submitted by
No ratings yet
Teaching Plan For Nutrition: Submitted by
5 pages
(English) LabWare Brochure ELN
No ratings yet
(English) LabWare Brochure ELN
6 pages

Reinforcement Learning For Combinatorial Optimization: A Survey

Uploaded by

Reinforcement Learning For Combinatorial Optimization: A Survey

Uploaded by

Reinforcement Learning for Combinatorial Optimization: A Survey

Nina Mazyavkinaa,∗ , Sergey Sviridovb , Sergei Ivanovc,a and Evgeny Burnaeva

ARTICLE INFO ABSTRACT

Mazyavkina et al.: Preprint submitted to Elsevier Page 1 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 2 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 3 of 24

2.1. Combinatorial Optimization Problems

Mazyavkina et al.: Preprint submitted to Elsevier Page 4 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 5 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 6 of 24

Figure 4: A classification of reinforcement learning methods.

graphs. A natural continuation of the attention models

2.3. Reinforcement Learning Algorithms

Mazyavkina et al.: Preprint submitted to Elsevier Page 7 of 24

2.3.1. Value-based methods

𝑉 𝜋 (𝑠) = max 𝑄𝜋 (𝑠, 𝑎).

Mazyavkina et al.: Preprint submitted to Elsevier Page 8 of 24

𝑉 ∗ (𝑠) = max 𝑉 𝜋 (𝑠), ∀𝑠 ∈ 𝑆.

𝑄∗ (𝑠, 𝑎) = max 𝑄𝜋 (𝑠, 𝑎), ∀𝑠 ∈ 𝑆, ∀𝑎 ∈ 𝐴.

2.3.2. Policy-based methods

Mazyavkina et al.: Preprint submitted to Elsevier Page 9 of 24

̂ 𝑡 , 𝑎𝑡 ) = ∑ 𝛾 𝑡′ −𝑡 𝑟(𝑠′ , 𝑎′ ) − 𝑏(𝑠𝑡 ), 𝐻 is the agent’s horizon, and 𝑏(𝑠) is the baseline

2.3.3. Monte Carlo Tree Search

Mazyavkina et al.: Preprint submitted to Elsevier Page 11 of 24

Searching Solution Training

[Bello et al., 2017] No Yes Pointer Network REINFORCE with baseline

4.1. Travelling Salesman Problem

Mazyavkina et al.: Preprint submitted to Elsevier Page 12 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 13 of 24

Searching Solution Training

[Khalil et al., 2017] No Yes S2V DQN

4.2. Maximum Cut Problem

Mazyavkina et al.: Preprint submitted to Elsevier Page 14 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 15 of 24

Searching Solution Training

[Hu et al., 2017] No Yes Pointer Network REINFORCE with baseline

4.3. Bin Packing Problem

Mazyavkina et al.: Preprint submitted to Elsevier Page 16 of 24

Searching Solution Training

[Khalil et al., 2017] No Yes S2V DQN

4.4. Minimum Vertex Cover Problem

4.5. Maximum Independent Set Problem

Mazyavkina et al.: Preprint submitted to Elsevier Page 17 of 24

Searching Solution Training

[Cappart et al., 2019] Yes No S2V DQN

Mazyavkina et al.: Preprint submitted to Elsevier Page 18 of 24

Average tour length

[Helsgaun, 2017] LK-H 3.8 5.7 7.8

6. Conclusion and future directions

Mazyavkina et al.: Preprint submitted to Elsevier Page 19 of 24

Average tour length

[Nazari et al., 2018] 4.7 – 6.4 – 11.15 – 17.0

[Helsgaun, 2017] LK-H – 6.1 6.1 10.4 10.4 15.6 15.6

Mazyavkina et al.: Preprint submitted to Elsevier Page 20 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 21 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 22 of 24

Mazyavkina et al.: Preprint submitted to Elsevier Page 23 of 24

1094-4087. doi: 10.1364/oe.27.010288.

Mazyavkina et al.: Preprint submitted to Elsevier Page 24 of 24

You might also like