Reinforcement Learning For Combinatorial Optimization: A Survey
Reinforcement Learning For Combinatorial Optimization: A Survey
Keywords: Many traditional algorithms for solving combinatorial optimization problems involve using
reinforcement learning, operations re- hand-crafted heuristics that sequentially construct a solution. Such heuristics are designed by
search, combinatorial optimization, value- domain experts and may often be suboptimal due to the hard nature of the problems. Rein-
based methods, policy-based meth- forcement learning (RL) proposes a good alternative to automate the search of these heuristics
ods by training an agent in a supervised or self-supervised manner. In this survey, we explore the
recent advancements of applying RL frameworks to hard combinatorial problems. Our survey
provides the necessary background for operations research and machine learning communities
and showcases the works that are moving the field forward. We juxtapose recently proposed
RL methods, laying out the timeline of the improvements for each problem, as well as we make
a comparison with traditional algorithms, indicating that RL models can become a promising
direction for solving combinatorial problems.
1. Introduction
Optimization problems are concerned with finding optimal configuration or "value" among different possibilities,
and they naturally fall into one of the two buckets: configurations with continuous and with discrete variables. For
example, finding a solution to a convex programming problem is a continuous optimization problem, while finding
the shortest path among all paths in a graph is a discrete optimization problem. Sometimes the line between the two
can not be drawn that easily. For example, the linear programming task in the continuous space can be regarded as
a discrete combinatorial problem because its solution lies in a finite set of vertices of the convex polytope as it has
been demonstrated by Dantzig’s algorithm [Dantzig and Thapa, 1997]. Conventionally, optimization problems in the
discrete space are called combinatorial optimization (CO) problems and, typically, have different types of solutions
comparing to the ones in the continuous space. One can formulate a CO problem as follows:
Definition 1. Let 𝑉 be a set of elements and 𝑓 ∶ 𝑉 ↦ ℝ be a cost function. Combinatorial optimization problem
aims to find an optimal value of the function 𝑓 and any corresponding optimal element that achieves that optimal
value on the domain 𝑉 .
Typically the set 𝑉 is finite, in which case there is a global optimum, and, hence, a trivial solution exists for any
CO problem by comparing values of all elements 𝑣 ∈ 𝑉 . Note that the definition 1 also includes the case of decision
problems, when the solution is binary (or, more generally, multi-class), by associating a higher cost for the wrong
answer than for the right one. One common example of a combinatorial problem is Travelling Salesman Problem
(TSP). The goal is to provide the shortest route that visits each vertex and returns to the initial endpoint, or, in other
words, to find a Hamiltonian circuit 𝐻 with minimal length in a fully-connected weighted graph. In this case, a set of
elements is defined by all Hamiltonian circuits, i.e. 𝑉 = {all Hamiltonian paths}, and the cost associated with each
∑
Hamiltonian circuit is the sum of the weights 𝑤(𝑒) of the edges 𝑒 on the circuit, i.e. 𝑓 (𝐻) = 𝑤(𝑒). Another
𝑒∈𝐻
example of CO problem is Mixed-Integer Linear Program (MILP), for which the objective is to minimize 𝑐 ⊤ 𝑥 for a
given vector 𝑐 ∈ ℝ𝑑 such that the vector 𝑥 ∈ ℤ𝑑 satisfies the constraints 𝐴𝑥 ≤ 𝑏 for the parameters 𝐴 and 𝑏.
Many CO problems are NP-hard and do not have an efficient polynomial-time solution. As a result, many al-
gorithms that solve these problems either approximately or heuristically have been designed. One of the emerging
trends of the last years is to solve CO problems by training a machine learning (ML) algorithm. For example, we can
∗ Correspondingauthor
[email protected] (N. Mazyavkina)
ORCID (s): 0000-0002-0874-4165 (N. Mazyavkina)
train ML algorithm on a dataset of already solved TSP instances to decide on which node to move next for new TSP
instances. A particular branch of ML that we consider in this survey is called reinforcement learning (RL) that for a
given CO problem defines an environment and the agent that acts in the environment constructing a solution.
In order to apply RL to CO, the problem is modeled as a sequential decision-making process, where the agent
interacts with the environment by performing a sequence of actions in order to find a solution. Markov Decision
Process (MDP) provides a widely used mathematical framework for modeling this type of problems [Bellman, 1957].
Definition 2. MDP can be defined as a tuple 𝑀 = ⟨𝑆, 𝐴, 𝑅, 𝑇 , 𝛾, 𝐻⟩, where
• 𝑆 - state space 𝐬𝑡 ∈ 𝑆. State space for combinatorial optimization problems in this survey is typically defined
in one of two ways. One group of approaches constructs solutions incrementally define it as a set of partial
solutions to the problem (e.g. a partially constructed path for TSP problem). The other group of methods starts
with a suboptimal solution to a problem and iteratively improves it (e.g. a suboptimal tour for TSP).
• 𝐴 - action space 𝐚𝑡 ∈ 𝐴. Actions represent addition to partial or changing complete solution (e.g. changing the
order of nodes in a tour for TSP);
• 𝑅 - reward function is a mapping from states and actions into real numbers 𝑅 ∶ 𝑆 × 𝐴→ ←← ℝ. Rewards indicate
how action chosen in particular state improves or worsens a solution to the problem (i.e. a tour length for TSP);
• 𝑇 - transition function 𝑇 (𝐬𝑡+1 |𝐬𝑡 , 𝐚𝑡 ) that governs transition dynamics from one state to another in response to
action. In combinatorial optimization setting transition dynamics is usually deterministic and known in advance;
• 𝛾 - scalar discount factor, 0 < 𝛾 ≤ 1. Discount factor encourages the agent to account more for short-term
rewards;
• 𝐻 - horizon, that defines the length of the episode, where episode is defined as a sequence {𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 , 𝑎𝑡+1 , 𝑠𝑡+2 , ...}𝐻
𝑡=0
.
For methods that construct solutions incrementally episode length is defined naturally by number of actions per-
formed until solution is found. For iterative methods some artificial stopping criteria are introduced.
The goal of an agent acting in Markov Decision Process is to find a policy function 𝜋(𝑠) that maps states into
actions. Solving MDP means finding the optimal policy that maximizes the expected cumulative discounted sum of
rewards:
∑
𝐻
∗
𝜋 = argmax 𝔼[ 𝛾 𝑡 𝑅(𝑠𝑡 , 𝑎𝑡 )], (1)
𝜋 𝑡=0
Once MDP has been defined for a CO problem we need to decide how the agent would search for the optimal
policy 𝜋 ∗ . Broadly, there are two types of RL algorithms:
• Value-based methods first compute the value action function 𝑄𝜋 (𝑠, 𝑎) as the expected reward of a policy 𝜋
given a state 𝑠 and taking an action 𝑎. Then the agent’s policy corresponds to picking an action that maximizes
𝑄𝜋 (𝑠, 𝑎) for a given state. The main difference between value-based approaches is in how to estimate 𝑄𝜋 (𝑠, 𝑎)
accurately and efficiently.
• Policy-based methods directly model the agent’s policy as a parametric function 𝜋𝜃 (𝑠). By collecting previous
decisions that the agent made in the environment, also known as experience, we can optimize the parameters
𝜃 by maximizing the final reward 1. The main difference between policy-based methods is in optimization
approaches for finding the function 𝜋𝜃 (𝑠) that maximizes the expected sum of rewards.
As can be seen, RL algorithms depend on the functions that take as input the states of MDP and outputs the
actions’ values or actions. States represent some information about the problem such as the given graph or the current
tour of TSP, while Q-values or actions are numbers. Therefore an RL algorithm has to include an encoder, i.e., a
function that encodes a state to a number. Many encoders were proposed for CO problems including recurrent neural
networks, graph neural networks, attention-based networks, and multi-layer perceptrons.
RL Algorithm MDP
Agent
Encoder
States/Rewards Actions
Problem Environment
Figure 1: Solving a CO problem with the RL approach requires formulating MDP. The environment is defined by a
particular instance of CO problem (e.g. Max-Cut problem). States are encoded with a neural network model (e.g.
every node has a vector representation encoded by a graph neural network). The agent is driven by an RL algorithm
(e.g. Monte-Carlo Tree Search) and makes decisions that move the environment to the next state (e.g. removing a
vertex from a solution set).
To sum up, a pipeline for solving CO problem with RL is presented in Figure 1. A CO problem is first reformulated
in terms of MDP, i.e., we define the states, actions, and rewards for a given problem. We then define an encoder of the
states, i.e. a parametric function that encodes the input states and outputs a numerical vector (Q-values or probabilities
of each action). The next step is the actual RL algorithm that determines how the agent learns the parameters of the
encoder and makes the decisions for a given MDP. After the agent has selected an action, the environment moves to a
new state and the agent receives a reward for the action it has made. The process then repeats from a new state within
the allocated time budget. Once the parameters of the model have been trained, the agent is capable of searching the
solutions for unseen instances of the problem.
Our work is motivated by the recent success in the application of the techniques and methods of the RL field
to solve CO problems. Although many practical combinatorial optimization problems can be, in principle, solved by
reinforcement learning algorithms with relevant literature existing in the operations research community, we will focus
on RL approaches for CO problems. This survey covers the most recent papers that show how reinforcement learning
algorithms can be applied to reformulate and solve some of the canonical optimization problems, such as Travelling
Salesman Problem (TSP), Maximum Cut (Max-Cut) problem, Maximum Independent Set (MIS), Minimum Vertex
Cover (MVC), Bin Packing Problem (BPP).
Related work. Some of the recent surveys also describe the intersection of machine learning and combinatorial
optimization. This way a comprehensive survey by [Bengio et al., 2020] has summarized the approaches that solve
CO problems from the perspective of the general ML, and the authors have discussed the possible ways of the combi-
nation of the ML heuristics with the existing off-the-shelf solvers. Moreover, the work by [Zhou et al., 2018], which is
devoted to the description and possible applications of GNNs, has described the progress on the CO problems’ formu-
lation from the GNN perspective in one of its sections. Finally, the more recent surveys by [Vesselinova et al., 2020]
and [Guo et al., 2019], describe the latest ML approaches to solving the CO tasks, in addition to possible applications
of such methods. We note that our survey is complementary to the existing ones as we focus on RL approaches,
provide necessary background and classification of the RL models, and make a comparison between different RL
methods and existing solutions.
Paper organization. The remainder of this survey is organized as follows. In section 2, we provide a necessary
background including the formulation of CO problems, different encoders, and RL algorithms that are used for solving
CO with RL. In section 3 we provide a classification of the existing RL-CO methods based on the popular design
choices such as the type of RL algorithm. In section 4 we describe the recent RL approaches for the specific CO
problems, providing the details about the formulated MDPs as well as their influence on other works. In section 5
we make a comparison between the RL-CO works and the existing traditional approaches. We conclude and provide
future directions in section 6.
2. Background
In this section, we provide definitions of combinatorial problems, state-of-the-art algorithms and heuristics that
solve these problems. We also describe machine learning models that encode states of CO problems for an RL agent.
Finally, we categorize popular RL algorithms that have been employed recently for solving CO problems.
where 𝐜 ∈ ℝ𝑛 is the objective coefficient vector, 𝐀 ∈ ℝ𝑚×𝑛 is the constraint coefficient matrix, 𝐛 ∈ ℝ𝑚 is the
constraint vector, and 𝑝 ≤ 𝑛 is the number of integer variables.
Next, we provide formulations of the combinatorial optimization problems, their time complexity, and the state-
of-the-art algorithms for solving them.
Definition 4 (Traveling Salesman Problem (TSP)). Given a complete weighted graph 𝐺 = (𝑉 , 𝐸), find a tour of
minimum total weight, i.e. a cycle of minimum length that visits each node of the graph exactly once.
TSP is a canonical example of a combinatorial optimization problem, which has found applications in planning,
data clustering, genome sequencing, etc. [Applegate et al., 2006]. TSP problem is NP-hard [Papadimitriou and Stei-
glitz, 1998], and many exact, heuristic, and approximation algorithms have been developed, in order to solve it. The
best known exact algorithm is the Held–Karp algorithm [Held and Karp, 1962]. Published in 1962, it solves the
problem in time 𝑂(𝑛2 2𝑛 ), which has not been improved in the general setting since then. TSP can be formulated as
a MILP instance [Dantzig et al., 1954; Miller et al., 1960], which allows one to apply MILP solvers, such as Gurobi
[Gurobi Optimization, 2020], in order to find the exact or approximate solutions to TSP. Among them, Concorde
[Applegate et al., 2006] is a specialized TSP solver that uses a combination of cutting-plane algorithms with a branch-
and-bound approach. Similarly, an extension of the Lin-Kernighan-Helsgaun TSP solver (LKH3) [Helsgaun, 2017],
which improves the Lin-Kernighan algorithm [Lin and Kernighan, 1973], is a tour improvement method that iteratively
decides which edges to rewire to decrease the tour length. More generic solvers that avoid local optima exist such as
OR-Tools [Perron and Furnon, 2019] that tackle vehicle routing problems through local search algorithms and meta-
heuristics. In addition to solvers, many heuristic algorithms have been developed, such as Christofides-Serdyukov al-
gorithm [Christofides, 1976; van Bevern and Slugina, 2020], the Lin-Kernighan-Helsgaun heuristic [Helsgaun, 2000],
2-OPT local search [Mersmann et al., 2012]. [Applegate et al., 2006] provides an extensive overview of various
approaches to TSP.
Definition 5 (Maximum Cut Problem (Max-Cut)). Given a graph 𝐺 = (𝑉 , 𝐸), find a subset of vertices 𝑆 ⊂ 𝑉 that
∑
maximizes a cut 𝐶(𝑆, 𝐺) = 𝑖∈𝑆,𝑗∈𝑉 ⧵𝑆 𝑤𝑖𝑗 where 𝑤𝑖𝑗 ∈ 𝑊 is the weight of the edge-connecting vertices 𝑖 and 𝑗.
Max-Cut solutions have found numerous applications in real-life problems including protein folding [Perdomo-
Ortiz et al., 2012], financial portfolio management [Elsokkary et al., 2017], and finding the ground state of the Ising
Hamiltonian in physics [Barahona, 1982]. Max-Cut is an NP-complete problem [Karp, 1972], and, hence, does not
have a known polynomial-time algorithm. Approximation algorithms exist for Max-Cut, including deterministic 0.5-
approximation [Mitzenmacher and Upfal, 2005; Gonzalez, 2007] and randomized 0.878-approximation [Goemans
and Williamson, 1995]. Industrial solvers can be used to find a solution by applying the branch-and-bound routines.
In particular, Max-Cut problem can be transformed into a quadratic unconstrained binary optimization problem and
solved by CPLEX [CPLEX, 1987], which takes within an hour for graph instances with hundreds of vertices [Barrett
et al., 2020]. For larger instances several heuristics using the simulated annealing technique have been proposed that
could scale to graphs with thousands of vertices [Yamamoto et al., 2017; Tiunov et al., 2019; Leleu et al., 2019].
Definition 6 (Bin Packing Problem (BPP)). Given a set 𝐼 of items, a size 𝑠(𝑖) ∈ ℤ+ for each 𝑖 ∈ 𝐼, and a positive
integer bin capacity 𝐵, find a partition of 𝐼 into disjoint sets 𝐼1 , … , 𝐼𝐾 such that the sum of the sizes of the items in
each 𝐼𝑗 is less or equal than 𝐵 and 𝐾 has the smallest possible value.
There are other variants of BPP such as 2D, 3D packing, packing with various surface area, packing by weights,
and others [Wu et al., 2010]. This CO problem has found its applications in many domains such as resource optimiza-
tion, logistics, and circuit design [Kellerer et al., 2004]. BPP is an NP-complete problem with many approximation
algorithms proposed in the literature. First-fit decreasing (FFD) and best-fit decreasing (BFD) are two simple ap-
proximation algorithms that first sort the items in the decreasing order of their costs and then assign each item to the
first (for FFD) or the fullest (for BFD) bin that it fits into. Both FFD and BFD run in 𝑂(𝑛 log 𝑛) time and have 11∕9
asymptotic performance guarantee [Korte et al., 2012]. Among exact approaches, one of the first attempts has been the
Martello-Toth algorithm that works under the branch-and-bound paradigm [Martello and Toth, 1990a,b]. In addition,
several recent improvements have been proposed [Schreiber and Korf, 2013; Korf, 2003] which can run on instances
with hundreds of items. Alternatively, BPP can be formulated as a MILP instance [Wu et al., 2010; Chen et al., 1995]
and solved using standard MILP solvers such as Gurobi [Gurobi Optimization, 2020] or CPLEX [CPLEX, 1987].
Definition 7 (Minimum Vertex Cover (MVC)). Given a graph 𝐺 = (𝑉 , 𝐸), find a subset of nodes 𝑆 ⊂ 𝑉 , such that
every edge is covered, i.e. (𝑢, 𝑣) ∈ 𝐸 ⟺ 𝑢 ∈ 𝑆 or 𝑣 ∈ 𝑆, and |𝑆| is minimized.
Vertex cover optimization is a fundamental problem with applications to computational biochemistry [Lancia et al.,
2001] and computer network security [Filiol et al., 2007]. There is a naïve approximation algorithm with a factor 2,
which works by adding both endpoints of an arbitrary edge to the solution and then removing this endpoints
( √ from)the
graph [Papadimitriou and Steiglitz, 1998]. A better approximation algorithm with a factor of 2 − Θ 1∕ log |𝑉 | is
√
known [Karakostas, 2009], although, it has been shown that MVC cannot be approximated within a factor 2 − 𝜀 for
any 𝜀 > 0 [Dinur and Safra, 2005; Subhash et al., 2018]. The problem can formulated as an integer linear program
∑
(ILP) by minimizing 𝑣∈𝑉 𝑐(𝑣)𝑥𝑣 , where 𝑥𝑣 ∈ {0, 1} denotes whether a node 𝑣 with a weight 𝑐(𝑣) is in a solution
set, subject to 𝑥𝑢 + 𝑥𝑣 ≥ 1. Solvers such as CPLEX [CPLEX, 1987] or Gurobi [Gurobi Optimization, 2020] can be
used to solve the ILP formulations with hundreds of thousands of nodes [Akiba and Iwata, 2016].
Definition 8 (Maximum Independent Set (MIS)). Given a graph 𝐺(𝑉 , 𝐸) find a subset of vertices 𝑆 ⊂ 𝑉 , such that
no two vertices in 𝑆 are connected by an edge of 𝐸, and |𝑆| is minimized.
MIS is a popular CO problem with applications in classification theory, molecular docking, recommendations, and
more [Feo et al., 1994; Gardiner et al., 2000; Agrawal et al., 1996]. As such the approaches of finding the solutions
for this problem have received a lot of attention from the academic community. It is easy to see that the complement
of an independent set in a graph 𝐺 is a vertex cover in 𝐺 and a clique in a complement graph 𝐺, ̄ hence, the solutions
to a minimum vertex cover in 𝐺 or a maximum clique in 𝐺 can be applied to solve the MIS problem. The running
time of the brute-force algorithm is 𝑂(𝑛2 2𝑛 ), which has been improved by [Tarjan and Trojanowski, 1977] to 𝑂(2𝑛∕3 ),
and recently to the best known bound 𝑂(1.1996𝑛 ) with polynomial space [Xiao and Nagamochi, 2017]. To cope with
medium and large instances of MIS several local search and evolutionary algorithms have been proposed. The local
search algorithms maintain a solution set, which is iteratively updated by adding and removing nodes that improve
the current objective value [Andrade et al., 2008; Katayama et al., 2005; Hansen et al., 2004; Pullan and Hoos, 2006].
In contrast, the evolutionary algorithms maintain several independent sets at the current iterations which are then
merged or pruned based on some fitness criteria [Lamm et al., 2015; Borisovsky and Zavolovskaya, 2003; Back and
Khuri, 1994]. Hybrid approaches exist that combine the evolutionary algorithms with the local search, capable to
solve instances with hundreds of thousands of vertices [Lamm et al., 2016].
In order to approach the outlined problems with reinforcement learning, we must represent the graphs, involved
in the problems, as vectors that can be further provided as an input to a machine learning algorithm. Next, we discuss
different approaches for learning the representations of these problems.
2.2. Encoders
In order to process the input structure 𝑆 (e.g. graphs) of CO problems, we must present a mapping from 𝑆 to a
𝑑-dimensional space ℝ𝑑 . We call such a mapping an encoder as it encodes the original input space. The encoders
vary depending on the particular type of the space 𝑆 but there are some common architectures that researchers have
developed over the last years to solve CO problems.
The first frequently used architecture is a recurrent
neural network (RNN). RNNs can operate on sequen-
tial data, encoding each element of the sequence into
a vector. In particular, the RNN is composed of the
block of parameters that takes as an input the current
element of the sequence and the previous output of the
RNN block and outputs a vector that is passed to the
next element of the sequence. For example, in the case
of TSP, one can encode a tour of TSP by applying RNN
to the current node (e.g. initially represented by a con-
Figure 2: The scheme for a recurrent neural network
stant 𝑑-dimensional vector) and the output of the RNN
(RNN). Each box represents an encoding function. Each
on the previous node of the tour. One can stack multi-
element in the sequence is encoded using its initial repre-
ple blocks of RNNs together making the neural network
sentation and the output of the model at the previous step.
deep. Popular choices of RNN blocks are a Long Short-
RNN parameters are shared across all elements of the se-
Term Memory (LSTM) unit [Hochreiter and Schmid-
quence.
huber, 1997] and Gated Recurrent Unit (GRU) [Cho
et al., 2014], which tackle the vanishing gradient prob-
lem [Goodfellow et al., 2016].
One of the fundamental limitations of RNN models is related to the modeling of the long-range dependencies:
as the model takes the output of the last time-step it may “forget” the information from the previous elements of
the sequence. Attention models fix this by forming a connection not just to the last input element, but to all input
elements. Hence, the output of the attention model depends on the current element of the sequence and all previous
elements of the sequence. In particular, similarity scores (e.g. dot product) are computed between the input element
and each of the previous elements, and these scores are used to determine the weights of the importance of each of the
previous elements to the current element. Attention models has recently gained the superior performance on language
modeling tasks (e.g. language translation) [Vaswani et al., 2017] and have been applied to solving CO problems (e.g.
for building incrementally a tour for TSP).
Note that the attention model relies on modeling de-
pendencies between each pair of elements in the input
structure, which can be inefficient if there are only few
relevant dependencies. One simple extension of the at-
tention model is a pointer network (PN) [Vinyals et al.,
2015]. Instead of using the weights among all pairs for
computation of the influence of each input element, the
pointer networks use the weights to select a single input
element that will be used for encoding. For example, in
Figure 3 the element "A" has the highest similarity to
the element "B", and, therefore, it is used for computa-
tion of the representation of element "B" (unlike atten-
Figure 3: The scheme for a pointer network. Element "B"
tion model, in which case the elements "C" and "D" are
in the sequence first computes similarity scores to all other
also used).
elements. Next we encode the representation of "B" us-
Although these models are general enough to be ap-
ing the element with maximum value ("A" in this case,
plied to various spaces 𝑆 (e.g. points for TSP), many CO
dashed). This process is then repeated for other elements
problems studied in this paper are associated with the
in the sequence.
• Model-based methods focus on the environments, where transition functions are known or can be learned, and
can be utilized by the algorithm when making decisions. This group includes Monte-Carlo Tree Search (MCTS)
algorithms such as AlphaZero [Silver et al., 2016] and MuZero [Schrittwieser et al., 2019].
• Model-free methods do not rely on the availability of the transition functions of the environment and utilize
solely the experience collected by the agent.
Furthermore, model-free methods can be split into two big families of RL algorithms – policy-based and value-
based methods. This partition is motivated by the way of deriving a solution of an MDP. In the case of policy-based
methods, a policy is approximated directly, while value-based methods focus on approximating a value function, which
is a measure of the quality of the policy for some state-action pair in the given environment. Additionally, there are
RL algorithms that combine policy-based methods with value-based methods. The type of methods that utilize such
training procedure is called actor-critic methods [Sutton et al., 2000; Mnih et al., 2016]. The basic principle behind
these algorithms is for the critic model to approximate the value function, and for the actor model to approximate
policy. Usually to do this, both actor and critic, use the policy and value-based RL, mentioned above. This way, the
critic provides the measure of how good the action taken by the actor has been, which allows to appropriately adjust
the learnable parameters for the next train step.
Next, we formally describe the value-based, policy-based, and MCTS approaches and the corresponding RL
algorithms that have been used to solve CO problems.
The notation 𝑉 𝜋 here and in the following sections means that the value function 𝑉 is defined with respect to the
policy 𝜋. It is also important to note, that the value of a terminal state in the case of a finite MDP equals 0.
At the same time, it can be more convenient to think of the value function as the function depending not only on
the state but also on the action.
Definition 10. Action-value function 𝑄(𝑠, 𝑎) is the expectation of the future discounted rewards, when starting from
the state 𝑠, taking the action 𝑎 and then following some policy 𝜋:
[ ]
∑
∞
𝜋 𝑡
𝑄 (𝑠, 𝑎) = 𝔼 𝛾 𝑟(𝑠𝑡 , 𝑎𝑡 )|𝜋, 𝑠0 = 𝑠, 𝑎0 = 𝑎 . (3)
𝑡=0
It is also clear that 𝑉 𝜋 (𝑠) can be interpreted in terms of the 𝑄𝜋 (𝑠, 𝑎) as:
From the definition of a value function comes a very important recursive property, representing the relationship
between the value of the state 𝑉 𝜋 (𝑠) and the values of the possible following states 𝑉 𝜋 (𝑠′ ), which lies at the foundation
of many value-based RL methods. This property can be expressed as an equation, called the Bellman equation
[Bellman, 1952]:
∑
𝑉 𝜋 (𝑠) = 𝑟(𝑠) + 𝛾 𝑇 (𝑠, 𝜋(𝑠), 𝑠′ )𝑉 𝜋 (𝑠′ ). (4)
𝑠′
The Bellman equation can be also rewritten in terms of the action-value function 𝑄𝜋 (𝑠, 𝑎) in the following way:
∑
𝑄𝜋 (𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾 𝑇 (𝑠, 𝑎, 𝑠′ ) max
′
𝑄𝜋 (𝑠′ , 𝑎′ ). (5)
𝑎
𝑠′
At the beginning of this section, we have stated that the goal of all of the RL tasks is to find a policy, which can
accumulate a lot of rewards. This means that one policy can be better than (or equal to) the other if the expected return
of this policy is greater than the one achieved by the other policy: 𝜋 ′ ≥ 𝜋. Moreover, by the definition of a value
′
function, we can claim that 𝜋 ′ ≥ 𝜋 if and only if 𝑉 𝜋 (𝑠) ≥ 𝑉 𝜋 (𝑠) in all states 𝑠 ∈ 𝑆.
Knowing this relationship between policies, we can state that there is a policy that is better or equal to all the other
possible policies. This policy is called an optimal policy 𝜋 ∗ . Evidently, the optimality of the action-value and value
functions is closely connected to the optimality of the policy they follow. This way, the value function of an MDP is
called optimal if it is the maximum of value functions across all policies:
Similarly, we can give the definition to the optimal action-value function 𝑄∗ (𝑠, 𝑎):
Given the Bellman equations (4) and (5), one can derive the optimal policy if the action-value or value functions
are known. In the case of a value function 𝑉 ∗ (𝑠), one can find optimal actions by doing the greedy one-step search:
picking the actions that correspond to the maximum value 𝑉 ∗ (𝑠) in the state 𝑠 computed by the Bellman equation (4).
On the other hand, in the case of the action-value function one-step search is not needed. For each state 𝑠 we can
easily find such action 𝑎 that maximizes the action-function, as in order to do that we just need to compute 𝑄∗ (𝑠, 𝑎).
This way, we do not need to know any information about the rewards and values in the following states 𝑠′ in contrast
with the value function.
Therefore, in the case of value-based methods, in order to find the optimal policy, we need to find the optimal
value functions. Notably, it is possible to explicitly solve the Bellman equation, i.e. find the optimal value function,
but only in the case when the transition function is known. In practice, it is rarely the case, so we need some methods
to approximate the Bellman’s equation solution.
• Q-learning. One of the popular representatives of the approximate value-based methods is Q-learning [Watkins
and Dayan, 1992] and its deep variant Deep Q-learning [Mnih et al., 2015]. In Q-learning, the action-value
function 𝑄(𝑠, 𝑎) is iteratively updated by learning from the collected experiences of the current policy. It has
been shown in [Sutton, 1988, Theorem 3] that the function updated by such a rule converges to the optimal
value function.
• DQN. With the rise of Deep Learning, neural networks (NNs) have proven to achieve state-of-the-art results
on various datasets by learning useful function approximations through the high-dimensional inputs. This led
researchers to explore the potential of NNs’ approximations of the Q-functions. Deep Q-networks (DQN) [Mnih
et al., 2015] can learn the policies directly using end-to-end reinforcement learning. The network approximates
Q-values for each action depending on the current input state. In order to stabilize the training process, authors
have used the following formulation of the loss function:
[( )2 ]
𝐿(𝜃𝑖 ) = 𝔼(𝑠,𝑎,𝑟,𝑠′ )∼𝐷 𝑟 + 𝛾 max
′
𝑄𝜃− (𝑠′ , 𝑎′ ) − 𝑄𝜃𝑖 (𝑠, 𝑎) , (6)
𝑎 𝑖
where 𝐷 is a replay memory buffer, used to store (𝑠, 𝑎, 𝑟, 𝑠′ ) trajectories. Equation (6) is the mean-squared error
between the current approximation of the Q-function and some maximized target value 𝑟 + 𝛾 max𝑎′ 𝑄𝜃 − (𝑠′ , 𝑎′ ).
𝑖
The training of DQN has been shown to be more stable, and, consequently, DQN has been effective for many
RL problems, including RL-CO problems.
• Policy gradient. In order to optimize (1) with respect to the policy parameters 𝜃, policy gradient theorem
[Sutton et al., 2000] can be applied to estimate the gradients of the policy function in the following form:
∑
𝐻
∇𝜃 𝐽 (𝜋𝜃 ) = 𝔼𝜋𝜃 [ ∇𝜃 log 𝜋𝜃 (𝑎𝑡 |𝑠𝑡 )𝐴(𝑠
̂ 𝑡 , 𝑎𝑡 )], (7)
𝑡=0
Although this approach introduces bias to the gradient estimates, it often reduces variance even further. More-
over, the actor-critic methods can be applied to the online and continual learning, as they no longer rely on
Monte-Carlo rollouts, i.e. unrolling the trajectory to a terminal state.
• PPO/DDPG. Further development of this group of reinforcement learning algorithms has resulted in the ap-
pearance of several more advanced methods such as Proximal Policy Optimization (PPO) [Schulman et al.,
2017], that performs policy updates with constraints in the policy space, or Deep Deterministic Policy Gradient
(DDPG) [Lillicrap et al., 2016], an actor-critic algorithm that attempts to learn a parametric state-action value
function 𝑄𝜙 (𝑠, 𝑎), corresponding to the current policy, and use it to compute the bootstrapped return estimate.
• MCTS. The algorithm follows the general procedure of Monte Carlo Tree Search (MCTS) [Browne et al.,
2012] consisting of selection, expansion, roll-out and backup steps (Figure 5). However, instead of evaluating
leaf nodes in a tree by making a rollout step, a neural network 𝑓𝜃 is used to provide a policy 𝑃 (𝑠, ∗) and state-
value estimates 𝑉 (𝑠) for the new node in the tree. The nodes in the tree refer to states 𝑠, and edges refer to
actions 𝑎. During the selection phase we start at a root state, 𝑠0 , and keep selecting next states that maximize
the upper confidence bound:
√∑
′
𝑎′ 𝑁(𝑠, 𝑎 )
UCB = 𝑄(𝑠, 𝑎) + 𝑐 ⋅ 𝑃 (𝑠, 𝑎) ⋅ , (9)
1 + 𝑁(𝑠, 𝑎)
Mazyavkina et al.: Preprint submitted to Elsevier Page 10 of 24
Reinforcement Learning for Combinatorial Optimization
Figure 5: Three steps of the Monte Carlo Tree Search (MCTS). Starting the simulation from the root node the select
step picks the node that maximizes the upper confidence bound. When previously unseen node is expanded, the policy
𝑃 (𝑠, ∗) and the state-value function 𝑉 (𝑠) are evaluated at this node, and the action-value 𝑄(𝑠, 𝑎) and the counter 𝑁(𝑠, 𝑎)
are initialized to 0. Then 𝑉 (𝑠) estimates are propagated back along the path of the current simulation to update 𝑄(𝑠, 𝑎)
and 𝑁(𝑠, 𝑎).
When previously unseen node in a search tree is encountered, policy 𝑃 (𝑠, ∗) and state-value value functions
𝑃 (𝑠, ∗) and state-value estimates 𝑉 (𝑠) are estimated for this node. After that 𝑉 (𝑠) estimate is propagated back
along the search tree updating the 𝑄(𝑠, 𝑎) and 𝑁(𝑠, 𝑎) values. After a number of search iterations we select the
next action from the root state according to the improved policy:
𝑁(𝑠0 , 𝑎)
𝜋(𝑠𝑜 ) = ∑ ′
. (10)
𝑎′ 𝑁(𝑠0 , 𝑎 )
3. Taxonomy of RL for CO
The full taxonomy of RL methods for CO can be challenging because of the orthogonality of the ways we can
classify the given works. In this section we will list all the taxonomy groups that are used in this survey.
One straightforward way of dividing the RL approaches, concerning the CO field, is by the family of the RL
methods used to find the solution of the given problem. As shown in the Figure 4, it is possible to split the RL
methods by either the first level of the Figure 4 (i.e. into the model-based and model-free methods) or by the second
level (i.e. policy-based, value-baded, the methods using Monte-Carlo approach) (section 2.3). In addition, another
division is possible by the type of encoders used for representing the states of the MDP (section 2.2). This division is
much more granular than the other ones discussed in this section, as can be seen from the works surveyed in the next
section.
Another way to aggregate the existing RL approaches is based on the integration of RL into the given CO problem,
i.e. if an RL agent is searching for a solution to CO problem or if an RL agent is facilitating the inner workings of the
existing off-the-shelf solvers.
• In principal learning an agent makes the direct decision that constitutes a part of the solution or the complete
solution of the problem and does not require the feedback from the off-the-shelf solver. For example, in TSP
the agent can be parameterized by a neural network that incrementally builds a path from a set of vertices and
then receives the reward in the form of the length of the constructed path, which is used to update the policy of
the agent.
• Alternatively one can learn the RL agent’s policy in the joint training with already existing solvers so that it can
improve some of the metrics for a particular problem. For example, in MILP a commonly used approach is the
Table 1
Summary of approaches for Travelling Salesman Problem.
Branch & Bound method, which at every step selects a branching rule on the node of the tree. This can have
a significant impact on the overall size of the tree and, hence, the running time of the algorithm. A branching
rule is a heuristic that typically requires either some domain expertise or a hyperparameter tuning procedure.
However, a parameterized RL agent can learn to imitate the policy of the node selection by receiving rewards
proportional to the running time.
Another dimension that the RL approaches can be divided into is the way the solution is searched by the learned
heuristics. In this regard, methods can be divided into those learning construction heuristics or improvement heuristics.
• Methods that learn construction heuristics are building the solutions incrementally using the learned policy by
choosing each element to add to a partial solution.
• The second group of methods start from some arbitrary solution and learn a policy that improves it iteratively.
This approach tries to address the problem that is commonly encountered with the construction heuristics learn-
ing, namely, the need to use some extra procedures to find a good solution like beam search or sampling.
4. RL for CO
In this section we survey existing RL approaches to solve CO problems that include Traveling Salesman Prob-
lem (Definition 4), Maximum Cut Problem (Definition 5), Bin Packing Problem (Definition 6), Minimum Vertex
Cover Problem (Definition 7), and Maximum Independent Set (Definition 8). These problems have received the most
attention from the research community and we juxtapose the approaches for all considered problems.
the one used by [Bello et al., 2017], except for the reward function 𝑟(𝑠, 𝑎). The reward, in this case, is defined as
the difference in the cost functions after transitioning from the state 𝑠 to the state 𝑠′ when taking some action 𝑎:
𝑟(𝑠, 𝑎) = 𝑐(ℎ(𝑠′ ), 𝐺) − 𝑐(ℎ(𝑠), 𝐺), where ℎ is the graph embedding function of the partial solutions 𝑠 and 𝑠′ , 𝐺 is
the whole graph, 𝑐 is the cost function. Because the weighted variant of the TSP is solved, the authors define a cost
function 𝑐(ℎ(𝑠), 𝐺) as the negative weighted sum of the tour length. Also, the work implements S2V [Dai et al., 2016]
for encoding the partial solutions, and DQN as the RL algorithm of choice for updating the network’s parameters.
Another more work by [Nazari et al., 2018], motivated by [Bello et al., 2017], concentrates on solving the Vehicle
Routing Problem (VRP), which is a generalization of TSP. However, the approach suggested in [Bello et al., 2017]
can not be applied directly to solve VRP due to its dynamic nature, i.e. the demand in the node becoming zero, once
the node has been visited since it embeds the sequential and static nature of the input. The authors of [Nazari et al.,
2018] extend the previous methods used for solving TSP to circumvent this problem and find the solutions to VRP
and its stochastic variant. Specifically, similarly to [Bello et al., 2017], in [Nazari et al., 2018] approach the state 𝑠
represents the embedding of the current solution as a vector of tuples, one value of which is the coordinates of the
customer’s location and the other is the customer’s demand at the current time step. An action 𝑎 is picking a node,
which the vehicle will visit next in its route. The reward is also similar to the one used for TSP: it is the negative
total route length, which is given to the agent only after all customers’ demands are satisfied, which is the terminal
state of the MDP. The authors of [Nazari et al., 2018] also suggest to improve the Pointer Network, used by [Bello
et al., 2017]. To do that, the encoder is simplified by replacing the LSTM unit with the 1-d convolutional embedding
layers so that the model is invariant to the input sequence order, consequently, being able to handle the dynamic state
change. The policy learning is then performed by using REINFORCE algorithm for TSP and VRP while using A3C
for stochastic VRP.
Similarly to [Nazari et al., 2018], the work by [Deudon et al., 2018] uses the same approach as the one by [Bello
et al., 2017], while changing the encoder-decoder network architecture. This way, while the MDP is the same as in
[Bello et al., 2017], instead of including the LSTM units, the GNN encoder architecture is based solely on the attention
mechanisms so that the input is encoded as a set and not as a sequence. The decoder, however, stays the same as in the
Pointer Network case. Additionally, the authors have looked into combining a solution provided by the reinforcement
learning agent with the 2-Opt heuristic [Croes, 1958], in order to further improve the inference results. REINFORCE
algorithm with critic baseline is used to update the parameters of the described encode-decoder network.
Parallel to [Deudon et al., 2018], inspired by the transformer architecture of [Vaswani et al., 2017], a construction
heuristic learning approach by [Kool et al., 2019] has been proposed in order to solve TSP, two variants of VRP
(Capacitated VRP and Split Delivery VRP), Orienteering Problem (OP), Prize Collecting TSP (PCTSP) and Stochastic
PCTSP (SPCTSP). In this work, the authors have implemented a similar encoder-decoder architecture as the authors
of [Deudon et al., 2018], i.e. the Transformer-like attention-based encoder, while the decoder is similar to one of the
Pointer Network. However, the authors found that slightly changing the training procedure and using a simple rollout
baseline instead of the one learned by a critic yields better performance. The MDP formulation, in this case, is also
similar to the one used by [Deudon et al., 2018], and, consequently, the one by [Bello et al., 2017].
One specific construction heuristic approach has been proposed in [Emami and Ranka, 2018]. The authors have
designed a novel policy gradient method, Sinkhorn Policy Gradient (SPG), specifically for the class of combinatorial
optimization problems, which involves permutations. This approach yields a different MDP formulation. Here, in
contrast with the case when the solution is constructed sequentially, the state space consists of instances of combina-
torial problems of a particular size. The action space, in this case, is outputting a permutation matrix, which, applied
to the original graph, produces the solution tour. The reward function is the negated sum of the Euclidean distances
between each stop along the tour. Finally, using a special Sinkhorn layer on the output of the feed-forward neural
network with GRUs to produce continuous and differentiable relaxations of permutation matrices, authors have been
able to train actor-critic algorithms similar to Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2016].
The work by [Cappart et al., 2020] combines two approaches to solving the traveling salesman problem with time
windows, namely the RL approach and the constraint programming (CP) one, so that to learn branching strategies.
In order to encode the CO problems, the authors bring up a dynamic programming formulation, that acts as a bridge
between both techniques and can be exposed both as an MDP and a CP problem. A state 𝑠 is a vector, consisting
of three values: the set of remaining cities that still have to be visited, the last city that has been visited, and the
current time. An action 𝑎 corresponds to choosing a city. The reward 𝑟(𝑠, 𝑎) corresponds to the negative travel
time between two cities. This MDP can then be transformed into a dynamic programming model. DQN and PPO
algorithms have been trained for the MDP formulation to select the efficient branching policies for different CP search
Table 2
Summary of approaches for Maximum Cut Problem.
strategies — branch-and-bound, iterative limited discrepancy search and restart based search, and have been used to
solve challenging CO problems.
The work by [Drori et al., 2020] differs from the previous works, which tailor their approaches to individual
problems. In contrast, this work provides a general framework for model-free reinforcement learning using a GNN
representation that adapts to different problem classes by changing a reward. This framework models problems by
using the edge-to-vertex line graph and formulates them as a single-player game framework. The MDPs for TSP
and VRP are the same as in [Bello et al., 2017]. Instead of using a full-featured Neural MCTS, [Drori et al., 2020]
represents a policy as a GIN encoder with an attention-based decoder, learning it during the tree-search procedure.
[Lu et al., 2020] suggests to learn the improvement heuristics in hierarchical manner for capacitated VRP as a part
of the joint approach. The authors have designed an intrinsic MDP, which incorporates not only the features of the
current solutions but also the running history. A state 𝑠𝑖 includes free capacity of the route containing a customer
𝑖, its location, a location of the node 𝑖− visited before 𝑖, a location of the node 𝑖+ visited after 𝑖,a distance from 𝑖−
to 𝑖,a distance from 𝑖 to 𝑖+ ,a distance from 𝑖− to 𝑖+ , an action taken ℎ steps before, an effect of 𝑎𝑡−ℎ . The action
consists of choosing between two groups of operators, that change the current solution, for example by applying the
2-Opt heuristic, which removes two edges and reconnects their endpoints. Concretely, these two operator groups
are improvement operators, that are chosen according to a learned policy, or perturbation operators in the case of
reaching a local minima. The authors have experimented with the reward functions, and have chosen the two most
successful ones: +1∕ − 1 reward for each time the solution improves/does not give any gains, and the advantage
reward, which takes the initial solution’s total distance as the baseline, and constitutes the difference between this
baseline and the distance of the subsequent solutions as the reward at each time step. The policy is parameterized by
the Graph Attention Network and is trained with the REINFORCE algorithm.
The final work, we are going to cover for this section of problems is by [Chen and Tian, 2019], who proposes
solving VRP and online job scheduling problems by learning improvement heuristics. The algorithm rewrites different
parts of the solution until convergence instead of constructing the solution in the sequential order. The state space is
represented as a set of all solutions to the problem, while the action set consists of regions, i.e. nodes in the graph, and
their corresponding rewriting rules. The reward, in this case, is the difference in the costs of the current and previous
solutions. The authors use an LSTM encoder, specific to each of the covered problems and train region-picking and
rule-picking policies jointly by applying the Q-Actor-Critic algorithm.
graphs with up to 1000-1200 nodes, achieving very good approximation ratios to exact solutions.
[Barrett et al., 2020] improved on the work of [Khalil et al., 2017] in terms of the approximation ratio as well as
the generalization by proposing an ECO-DQN algorithm. The algorithm kept the general framework of S2V-DQN but
introduced several modifications. The agent was allowed to remove vertices from the partially constructed solution
to better explore the solution space. The reward function was modified to provide a normalized incremental reward
for finding a solution better than seen in the episode so far, as well as give small rewards for finding a locally optimal
solution that had not yet been seen during the episode. In addition, there were no penalties for decreasing cut value.
The input of the state encoder was modified to account for changes in the reward structure. Since in this setting the
agent had been able to explore indefinitely, the episode length was set to 2|𝑉 |. Moreover, the authors allowed the
algorithm to start from an arbitrary state, which could be useful by combining this approach with other methods, e.g.
heuristics. This method showed better approximation ratios than S2V-DQN, as well as better generalization ability.
[Cappart et al., 2019] devised the joint approach to the Max-Cut problem by incorporation of the reinforcement
learning into the Decision Diagrams (DD) framework [Bergman et al., 2016] to learn the constructive heuristic. The
integration of the reinforcement learning allowed to provide tighter objective function bounds of the DD solution by
learning heuristics for variable ordering. They have formulated the problem as an MDP, where the state space, 𝑆,
is represented as a set of ordered sequences of selected variables along with partially constructed DDs. The action
space, 𝐴, consists of variables, that are not yet selected. The transition function, 𝑇 , adds variables to the selected
variables set and to the DD. The reward function is designed to tighten the bounds of the DD and is encoded as the
relative upper and lower bounds improvements after the addition of the variable to the set. The training was performed
on the generated random graphs with the algorithm and state encoding described above in [Khalil et al., 2017]. The
authors showed that their approach had outperformed several ordering heuristics and generalized well to the larger
graph instances, but didn’t report any comparison to the other reinforcement learning-based methods.
Another joint method proposed by [Tang et al., 2020] combined a reinforcement learning framework with the
cutting plane method. Specifically, in order to learn the improvement heuristics to choose Gomory’s cutting plane,
which is frequently used in the Branch-and-Cut solvers, an efficient MDP formulation was developed. The state
space, 𝑆, includes the original linear constraints and cuts added so far. Solving the linear relaxation produces the
action space, 𝐴, of Gomory cut that can be added to the problem. After that the transition function, 𝑇 , adds the
chosen cuts to the problem that results in a new state. The reward function is defined as a difference between the
objective function of two consecutive linear problem solutions. The policy gradient algorithm was used to select new
Gamory cuts, and the state was encoded with an LSTM network (to account for a variable number of variables) along
with the attention-based mechanism (to account for a variable number of constraints). The algorithm was trained on
the generated graph instances using evolution strategies and had been shown to improve the efficiency of the cuts,
the integrality gaps, and the generalization, compared to the usually used heuristics used to choose Gomory cuts.
Also, the approach was shown to be beneficial in combination with the branching strategy innexperiments with the
Branch-and-Cut algorithm.
[Abe et al., 2019] proposed to use a graph neural network along with Neural MTCS approach to learn the con-
struction heuristics. The MDP formulation defines the state space, 𝑆, as a set of partial graphs from where nodes
can be removed and colored in one of two colors representing two subsets. The action space, 𝐴, represents sets of
nodes still left in the graph and their available colors. The transition function, 𝑇 , colors the selected node of the graph
and removes it along with the adjacent edges. The neighboring nodes, that have been left, are keeping a counter of
how many nodes of the adjacent color have been removed. When the new node is removed, the number of the earlier
removed neighboring nodes of the opposite color is provided as the incremental reward signal, 𝑅 (the number of edges
that were included in the cut set). Several GNNs were compared as the graph encoders, with GIN[Xu et al., 2018]
being shown to be the most performing. Also, the training procedure similar to AlphaGo Zero was employed with
the modification to accommodate for a numeric rather than win/lose solution. The experiments were performed with
a vast variety of generated and real-world graphs. The extensive comparison of the method with several heuristics
and with previously described S2V-DQN [Khalil et al., 2017] showed the superior performance as well as the better
generalization ability to larger graphs, yet they didn’t report any comparison with the exact methods.
[Gu and Yang, 2020] applied the Pointer Network [Vinyals et al., 2015] along with the Actor-Critic algorithm
similar to [Bello et al., 2017] to iteratively construct a solution. The MDP formulation defines the state, 𝑆, as a
symmetric matrix, 𝑄, the values of which are the edge weights between nodes (0 for the disconnected nodes). Columns
of this matrix are fed to the Pointer Network, which sequentially outputs the actions, 𝐴, in the form of pointers to input
vectors along with a special end-of-sequence symbol "EOS". The resulting sequence of nodes separated by the "EOS"
Table 3
Summary of approaches for Bin Packing Problem.
symbol represents a solution to the problem, from which the reward is calculated. The authors conducted experiments
with simulated graphs with up to 300 nodes and reported fairly good approximations ratios, but, unfortunately, didn’t
compare with the previous works or known heuristics.
Table 4
Summary of approaches for Minimum Vertex Cover problem.
the proposed method achieves a smaller bin gap ratio for the problems of size up to 30 items.
[Cai et al., 2019] has taken the joint approach to solving a 1D bin packing problem by combining proximal policy
optimization (PPO) with the simulated annealing (SA) heuristic algorithm. PPO is used to learn the improvement
heuristic to build an initial starting solution for SA, which in turn, after finding a good solution in a limited number
of iterations, calculates the reward function, 𝑅, as the difference in costs between the initial and final solutions and
passes it to the PPO agent. The action space, 𝐴, is represented by a set of changes of the bins between two items,
e.g. a perturbation to the current solution. The state space, 𝑆, is described with a set of assignments of items to bins.
The work has showed that the combination of RL and the heuristics can find solutions better than these algorithms in
isolation, but has not provides any comparison with known heuristics or other algorithms.
Table 5
Summary of approaches for Maximum Independent Set problem.
the initial graph 𝐺0 ; an action 𝑎 ∈ 𝐴 be a selection of one node of a graph in the current state; a transition function
𝑇 (𝑠, 𝑎, 𝑠′ ) be the function returning the next state, corresponding to the graph where edges covered by the action 𝑎 and
its adjacent nodes are deleted; and, finally, a reward function 𝑟(𝑠, 𝑎) be a constant and equal to 1. For encoders, the
authors proposed to apply a GIN, to account for the variable size of a state representation in a search tree. [Abe et al.,
2019] uses a model-based algorithm, namely AlphaGo Zero, to update the parameters of the GIN network.
The latest work, [Ahn et al., 2020] modifies the MDP formulation, by applying a label to each node, i.e. each node
can be either included into the solution, excluded from the solution or the determination of its label can be deferred.
This way, a state 𝑠 ∈ 𝑆 becomes a vector, the size of which is equal to the number of nodes in the graph, and which
consists of the labels that have been given to each node at the current time step. The initial state 𝑠0 is a vector with all
labels set to being deferred. An action 𝑎 ∈ 𝐴 is a vector with new label assignments for the next state of only currently
deferred nodes. To maintain the independence of the solution set, the transition function 𝑇 (𝑠, 𝑎) is set to consist of two
phases: the update phase and clean-up phase. The first phase represents the naive assignment of the labels by applying
the action 𝑎, which leads to the intermediate state 𝑠. ̂ In the clean-up phase, the authors modify the intermediate state
𝑠̂ in such a way that the included nodes are only adjacent to the excluded ones. Finally, the reward function 𝑟(𝑠, 𝑎)
is equal to the increase in the cardinality of included vertices between the current state 𝑠′ and the previous state 𝑠.
The authors propose to use the Graph Convolutional Network encoder and PPO method with the rollback procedure
to learn the optimal Deep Auto-Deferring Policy (ADP), which outputs the improvement heuristic to solve the MIS
problem.
5. Comparison
In this section, we will partially compare the results achieved by the works presented in this survey. Concretely,
we have distinguished the two most frequently mentioned problems, namely, Travelling Salesman Problem (TSP) and
Capacitated Vehicle Routing Problem (CVRP). The average tour lengths for these problems, reported in the works [Lu
et al., 2020; Kool et al., 2019; Lodi et al., 2002; Bello et al., 2017; Emami and Ranka, 2018; Ma et al., 2020; Nazari
et al., 2018; Chen and Tian, 2019], are shown in Tables 6, 7.
The presented results have been achieved on Erdős–Rényi (ER) graphs of various sizes, namely, with the number
of nodes of 20, 50, 100 for TSP, and 10, 20, 50, 100 for CVRP. In the case of CVRP, we have also specified the capacity
of the vehicle (Cap.), which varies from 10 to 50. Also, we have included the results achieved by the OR-Tools solver
[Perron and Furnon, 2019] and LK-H heuristic algorithm [Helsgaun, 2017] as the baseline solutions.
Best performing methods. It is clear from the presented table that the best performing methods for TSP are [Kool
et al., 2019] and [Bello et al., 2017], and for VRP — [Lu et al., 2020]. These algorithms perform on par with the
baseline, and in some cases demonstrate better results. Moreover, in the case of [Lu et al., 2020], the algorithm
manages to present the best performance across all the other methods, even for tasks with smaller vehicle capacities.
Focus on smaller graphs. Throughout our analysis, we have found that most of the articles focus on testing the
CO-RL methods on graphs with the number of nodes of 20, 50, 100. At the same time, [Ma et al., 2020] presents
the results for bigger graphs with 250, 500, 750, and 1000 nodes for a TSP problem. This may be connected to the
fact that with the increasing size of the graphs the process of finding the optimal solution also becomes much more
computationally difficult even for the commercial solvers. The comparison of the reported results and the baseline
further supports this fact: for TSP it can be seen how for smaller graphs almost all of the methods outperform OR-
Tools, while for bigger graphs it is no longer the case. Consequently, this can be a promising direction for further
research.
Non-overlapping problems. We can see that although there have emerged a lot of works focused on creating well-
performing RL-based solvers, the CO problems, covered in these articles, rarely coincide, which makes the fair
comparison a much harder task. We are convinced that further analysis should be focused on unifying the results from
different sources, and, hence, identifying more promising directions for research.
Running times. One of the main pros of using machine learning and reinforcement learning algorithms to solve CO
problems is the considerable reduction in running times compared to the ones obtained by the metaheuristic algorithms
and solvers. However, it is still hard to compare the running time results of different works as they can significantly
vary depending on the implementations and the hardware used for experimentation. For these reasons, we do not
attempt to exactly compare the times achieved by different RL-CO works. Still, however, we can note that some of the
works such as [Nazari et al., 2018], [Chen and Tian, 2019], [Lu et al., 2020] claim to have outperformed the classic
heuristic algorithms. Concretely, the authors of [Nazari et al., 2018], show that for larger problems, their framework
is faster than the randomized heuristics and their running times grow slower with the increase of the complexities of
the CO problems than the ones achieved by Clarke-Wright [Clarke and Wright, 1964] and Sweep heuristics [Wren
and Holliday, 1972]. [Chen and Tian, 2019] claim that their approach outperforms the expression simplification
component in Z3 solver [De Moura and Bjørner, 2008] in terms of both the objective metrics and the time efficiency.
Finally, although the exact training times are not given in the article, the authors of [Lu et al., 2020] note that the given
time of their algorithm is much smaller than that of LK-H. In addition, although also acknowledging the complexity
of comparing the times of different works, [Kool et al., 2019] have claimed that the running time of their algorithm is
ten times faster than the one of [Bello et al., 2017].
Table 7
The average tour lengths comparison for Capacitated VRP for ER graphs with the num-
ber of nodes 𝑁 equal to 10, 20, 50, 100. Cap. represents the capacity of the vehicle for
CVRP.
Generalization to other problems. In 5, we have formulated one of the main problems of the current state of
the RL-CO field, which is a limited number of experimental comparisons. Indeed, the CO group of mathematical
problems is vast, and the current approaches often require being implemented for a concrete set of problems. RL
field, however, has already made some steps towards the generalization of the learned policies to the unseen problems
(for example, [Groshev et al., 2018]). In the case of CO, these unseen problems can be smaller instances of the same
problem, problem instances with different distributions, or even the ones from the other group of CO problems. We
believe, that although this direction is challenging, it is extremely promising for future development in the RL-CO
field.
Improving the solution quality. A lot of the reviewed works, presented in this survey, have demonstrated superior
performance compared to the commercial solvers. Moreover, some of them have also achieved the quality of the
solutions equal to the optimal ones or the ones achieved by the heuristic algorithms. However, these results are true
only for the less complex versions of CO problems, for example, the ones with smaller numbers of nodes. This leaves
us with the possibility of further improvement of the current algorithms in terms of the objective quality. Some of the
possible ways for this may be further incorporation of classical CO algorithms with the RL approaches, for example,
with using imitation learning as in [Hester et al., 2018].
Filling the gaps. One of the ways to classify RL-CO approaches, which we have mentioned previously, is by
grouping them into joint and constructive methods. Tables 1, 2, 3, 4, 5 contain the information about these labels
for each of the reviewed article, and from them, we can identify some unexplored approaches for each of the CO
problems. This way from Table 3, it can be seen that there has not been published any both joint and constructive
algorithm for solving the Bin Packing problem. The same logic can be applied to the Minimum Vertex Problem, Table
3, where there are no approaches of joint-constructive and joint-nonconstructive type. Exploring these algorithmic
possibilities can provide us not only with the new methods but also with useful insights into the effectiveness of these
approaches.
In conclusion, we see the field of RL for CO problems as a very promising direction for CO research, because of
the effectiveness in terms of the solution quality, the capacity to outperform the existing algorithms, and huge running
time gains compared to the classical heuristic approaches.
References
K. Abe, Z. Xu, I. Sato, and M. Sugiyama. Solving np-hard problems on graphs with extended alphago zero, 2019.
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo, et al. Fast discovery of association rules. Advances in knowledge discovery and
data mining, 12(1):307–328, 1996.
S. Ahn, Y. Seo, and J. Shin. Deep auto-deferring policy for combinatorial optimization, 2020. URL https://openreview.net/forum?
id=Hkexw1BtDr.
T. Akiba and Y. Iwata. Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover. Theoretical Computer Science,
609:211–225, 2016.
D. V. Andrade, M. G. Resende, and R. F. Werneck. Fast local search for the maximum independent set problem. In International Workshop on
Experimental and Efficient Algorithms, pages 220–234. Springer, 2008.
T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search. In Proceedings of the 31st International Con-
ference on Neural Information Processing Systems, NIPS’17, page 5366–5376, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN
9781510860964.
D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook. The traveling salesman problem: a computational study. Princeton university press,
2006.
T. Back and S. Khuri. An evolutionary heuristic for the maximum independent set problem. In Proceedings of the First IEEE Conference on
Evolutionary Computation. IEEE World Congress on Computational Intelligence, pages 531–535. IEEE, 1994.
F. Barahona. On the computational complexity of ising spin glass models. Journal of Physics A: Mathematical and General, 15(10):3241, 1982.
ISSN 13616447. doi: 10.1088/0305-4470/15/10/028.
T. D. Barrett, W. R. Clements, J. N. Foerster, and A. Lvovsky. Exploratory combinatorial optimization with reinforcement learning. In Proceedings
of the the 34th National Conference on Artificial Intelligence, AAAI, pages 3243–3250, 2020. doi: 10.1609/aaai.v34i04.5723.
R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8),
1952. ISSN 0027-8424. doi: 10.1073/pnas.38.8.716.
R. Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957. ISSN 0022-2518.
I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning. In Workshop Proceedings
of the 5th International Conference on Learning Representations, ICLR, 2017.
Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon. European Journal of
Operational Research, Aug. 2020. ISSN 03772217. doi: 10.1016/j.ejor.2020.07.063.
D. Bergman, A. A. Cire, W.-J. v. Hoeve, and J. Hooker. Decision Diagrams for Optimization. Springer Publishing Company, Incorporated, 1st
edition, 2016. ISBN 3319428470.
P. A. Borisovsky and M. S. Zavolovskaya. Experimental comparison of two evolutionary algorithms for the independent set problem. In Workshops
on Applications of Evolutionary Computation, pages 154–164. Springer, 2003.
C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener, D. Perez Liebana, S. Samothrakis, and S. Colton. A
survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012. ISSN 1943068X. doi:
10.1109/TCIAIG.2012.2186810.
Q. Cai, W. Hang, A. Mirhoseini, G. Tucker, J. Wang, and W. Wei. Reinforcement learning driven heuristic optimization. In Proceedings of
Workshop on Deep Reinforcement Learning for Knowledge Discovery, DRL4KDD, 2019.
Q. Cappart, E. Goutierre, D. Bergman, and L.-M. Rousseau. Improving optimization bounds using machine learning: Decision diagrams meet
deep reinforcement learning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI, 2019. ISBN 9781577358091. doi:
10.1609/aaai.v33i01.33011443.
Q. Cappart, T. Moisan, L.-M. Rousseau, I. Prémont-Schwarz, and A. Cire. Combining reinforcement learning and constraint programming for
combinatorial optimization. arXiv preprint arXiv:2006.01610, 2020.
C. Chen, S.-M. Lee, and Q. Shen. An analytical model for the container loading problem. European Journal of Operational Research, 80(1):68–76,
1995.
X. Chen and Y. Tian. Learning to perform local rewriting for combinatorial optimization. In Proceedings of the 33rd Conference on Advances in
Neural Information Processing Systems, NeurIPS, pages 6281–6292, 2019.
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn
encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
N. Christofides. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report, Carnegie-Mellon Univ Pittsburgh Pa
Management Sciences Research Group, 1976.
G. Clarke and J. W. Wright. Scheduling of Vehicles from a Central Depot to a Number of Delivery Points. Operations Research, 12(4):568–581,
1964. ISSN 0030-364X. doi: 10.1287/opre.12.4.568.
CPLEX. IBM ILOG CPLEX optimization studio. Version, 12:1987–2018, 1987.
A. Croes. A method for solving traveling salesman problems. Operations Research, 5:791—-812, 1958. ISSN 0030-364X. doi: 10.1287/opre.6.6.
791.
H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International
Conference on Machine Learning, ICML, 2016. ISBN 9781510829008.
G. Dantzig, R. Fulkerson, and S. Johnson. Solution of a large-scale traveling-salesman problem. Journal of the operations research society of
America, 2(4):393–410, 1954.
G. B. Dantzig and M. N. Thapa. Linear programming 1: Introduction. Springer International Publishing, New York, NY, 1997. doi: https:
//doi.org/10.1007/b97672.
L. De Moura and N. Bjørner. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis
of Systems, pages 337–340. Springer, 2008. ISBN 3540787992. doi: 10.1007/978-3-540-78800-3_24.
M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau. Learning heuristics for the TSP by policy gradient. In Lecture Notes in Com-
puter Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018. ISBN 9783319930305.
doi: 10.1007/978-3-319-93031-2_12.
I. Dinur and S. Safra. On the hardness of approximating minimum vertex cover. Annals of Mathematics, pages 439–485, 2005. ISSN 0003486X.
doi: 10.4007/annals.2005.162.439.
I. Drori, A. Kharkar, W. R. Sickinger, B. Kates, Q. Ma, S. Ge, E. Dolev, B. Dietrich, D. P. Williamson, and M. Udell. Learning to solve combinatorial
optimization problems on real-world graphs in linear time, 2020.
L. Duan, H. Hu, Y. Qian, Y. Gong, X. Zhang, J. Wei, and Y. Xu. A multi-task selected learning approach for solving 3d flexible bin packing
problem. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS, page 1386–1394,
2019. ISBN 9781510892002.
N. Elsokkary, F. S. Khan, D. La Torre, T. S. Humble, and J. Gottlieb. Financial portfolio management using d-wave quantum optimizer: The case
of abu dhabi securities exchange. Technical report, Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), 2017.
P. Emami and S. Ranka. Learning permutations with sinkhorn policy gradient, 2018.
T. A. Feo, M. G. Resende, and S. H. Smith. A greedy randomized adaptive search procedure for maximum independent set. Operations Research,
42(5):860–878, 1994.
E. Filiol, E. Franc, A. Gubbioli, B. Moquet, and G. Roblot. Combinatorial optimisation of worm propagation on an unknown network. International
Journal of Computer Science, 2(2):124–130, 2007.
E. J. Gardiner, P. Willett, and P. J. Artymiuk. Graph-theoretic techniques for macromolecular docking. Journal of Chemical Information and
Computer Sciences, 40(2):273–279, 2000.
A. Gleixner, L. Eifler, T. Gally, G. Gamrath, P. Gemander, R. L. Gottwald, G. Hendel, C. Hojny, T. Koch, M. Miltenberger, B. Müller, M. E. Pfetsch,
C. Puchert, D. Rehfeldt, F. Schlösser, F. Serrano, Y. Shinano, J. M. Viernickel, S. Vigerske, D. Weninger, J. T. Witt, and J. Witzig. The SCIP
Optimization Suite 5.0. Technical report, Optimization Online, December 2017. URL http://www.optimization-online.org/DB_
HTML/2017/12/6385.html.
M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite
programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995. ISSN 1557735X. doi: 10.1145/227683.227684.
T. F. Gonzalez. Handbook of approximation algorithms and metaheuristics. CRC Press, 2007. ISBN 9781420010749. doi: 10.1201/
9781420010749.
I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
E. Groshev, M. Goldstein, A. Tamar, S. Srivastava, and P. Abbeel. Learning generalized reactive policies using deep neural networks. In Proceedings
of the 28th International Conference on Automated Planning and Scheduling, ICAPS, pages 408–416, 2018.
S. Gu and Y. Yang. A deep learning algorithm for the max-cut problem based on pointer network structure with supervised learning and reinforce-
ment learning strategies. Mathematics, 8(2):298, Feb 2020. ISSN 2227-7390. doi: 10.3390/math8020298. URL http://dx.doi.org/
10.3390/math8020298.
T. Guo, C. Han, S. Tang, and M. Ding. Solving combinatorial problems with machine learning methods. In Nonlinear Combinatorial Optimization,
pages 207–229. Springer International Publishing, Cham, 2019. ISBN 978-3-030-16194-1. doi: 10.1007/978-3-030-16194-1_9.
L. Gurobi Optimization. Gurobi optimizer reference manual, 2020. URL http://www.gurobi.com.
P. Hansen, N. Mladenović, and D. Urošević. Variable neighborhood search for the maximum clique. Discrete Applied Mathematics, 145(1):
117–125, 2004.
M. Held and R. M. Karp. A dynamic programming approach to sequencing problems. Journal of the Society for Industrial and Applied mathematics,
10(1):196–210, 1962.
K. Helsgaun. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):
106–130, 2000.
K. Helsgaun. An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Technical
report, Roskilde University, 2017.
T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from
demonstrations. In Proceedings of the 32nd Conference on Artificial Intelligence, AAAI, 2018. ISBN 9781577358008.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
H. Hu, X. Zhang, X. Yan, L. Wang, and Y. Xu. Solving a new 3d bin packing problem with deep reinforcement learning method, 2017.
G. Karakostas. A better approximation ratio for the vertex cover problem. ACM Transactions on Algorithms (TALG), 5(4):1–8, 2009.
R. M. Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972. doi: 10.1007/
978-1-4684-2001-2_9.
K. Katayama, A. Hamamoto, and H. Narihisa. An effective local search for the maximum clique problem. Information Processing Letters, 95(5):
503–511, 2005.
H. Kellerer, U. Pferschy, and D. Pisinger. Multidimensional knapsack problems. In Knapsack problems, pages 235–283. Springer, 2004.
E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. In Proceedings of the 31st
Conference on Advances in Neural Information Processing Systems, NeurIPS, 2017.
T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Represen-
tations (ICLR), 2017.
W. Kool, H. van Hoof, and M. Welling. Attention, learn to solve routing problems! In Proceedings of the 7th International Conference on Learning
Representations, ICLR, 2019.
R. E. Korf. An improved algorithm for optimal bin packing. In IJCAI, volume 3, pages 1252–1258. Citeseer, 2003.
B. Korte, J. Vygen, B. Korte, and J. Vygen. Combinatorial optimization, volume 2. Springer, 2012.
S. Lamm, P. Sanders, and C. Schulz. Graph partitioning for independent sets. In International Symposium on Experimental Algorithms, pages
68–81. Springer, 2015.
S. Lamm, P. Sanders, C. Schulz, D. Strash, and R. F. Werneck. Finding near-optimal independent sets at scale. In 2016 Proceedings of the
Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 138–150. SIAM, 2016.
G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. Snps problems, complexity, and algorithms. In European symposium on algorithms,
pages 182–193. Springer, 2001.
A. Laterre, Y. Fu, M. K. Jabri, A.-S. Cohen, D. Kas, K. Hajjar, T. S. Dahl, A. Kerkeni, and K. Beguir. Ranked reward: Enabling self-play
reinforcement learning for combinatorial optimization, 2018.
T. Leleu, Y. Yamamoto, P. L. McMahon, and K. Aihara. Destabilization of local minima in analog spin systems by correction of amplitude
heterogeneity. Physical review letters, 122(4), 2019. ISSN 10797114. doi: 10.1103/PhysRevLett.122.040607.
D. Li, C. Ren, Z. Gu, Y. Wang, and F. Lau. Solving packing problems by conditional query learning, 2020. URL https://openreview.
net/forum?id=BkgTwRNtPB.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning.
In Proceedings of the 4th International Conference on Learning Representations, ICLR, 2016.
S. Lin and B. W. Kernighan. An effective heuristic algorithm for the traveling-salesman problem. Operations research, 21(2):498–516, 1973.
A. Lodi, S. Martello, and D. Vigo. Heuristic algorithms for the three-dimensional bin packing problem. European Journal of Operational Research,
141(2):410–420, 2002.
H. Lu, X. Zhang, and S. Yang. A learning-based iterative method for solving vehicle routing problems. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=BJe1334YDH.
Q. Ma, S. Ge, D. He, D. Thaker, and I. Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In
AAAI Workshop on Deep Learning on Graphs: Methodologies and Applications, AAAI, 2020.
A. Makhorin. Glpk (gnu linear programming kit), 2012. URL http://www.gnu.org/software/glpk/glpk.html.
S. Manchanda, A. Mittal, A. Dhawan, S. Medya, S. Ranu, and A. K. Singh. Learning heuristics over large graphs via deep reinforcement learning.
CoRR, abs/1903.03332, 2019. URL http://arxiv.org/abs/1903.03332.
S. Martello and P. Toth. Bin-packing problem. Knapsack problems: Algorithms and computer implementations, pages 221–245, 1990a.
S. Martello and P. Toth. Lower bounds and reduction procedures for the bin packing problem. Discrete applied mathematics, 28(1):59–70, 1990b.
O. Mersmann, B. Bischl, J. Bossek, H. Trautmann, M. Wagner, and F. Neumann. Local search and the traveling salesman problem: A feature-based
characterization of problem hardness. In International Conference on Learning and Intelligent Optimization, pages 115–129. Springer, 2012.
C. E. Miller, A. W. Tucker, and R. A. Zemlin. Integer programming formulation of traveling salesman problems. Journal of the ACM (JACM), 7
(4):326–329, 1960.
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press,
USA, 2005. ISBN 9780521835402.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.
Human-level control through deep reinforcement learning. Nature, 2015. ISSN 14764687. doi: 10.1038/nature14236.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement
learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, volume 48 of ICML, page
1928–1937, 2016. ISBN 9781510829008.
M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác. Reinforcement learning for solving the vehicle routing problem. In Proceedings of the 32nd
Conference on Advances in Neural Information Processing Systems, NeurIPS, pages 9839–9849, 2018.
C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Courier Corporation, 1998. ISBN 9780486402581.
A. Perdomo-Ortiz, N. Dickson, M. Drew-Brook, G. Rose, and A. Aspuru-Guzik. Finding low-energy conformations of lattice protein models by
quantum annealing. Scientific reports, 2:571, 2012. ISSN 20452322. doi: 10.1038/srep00571.
L. Perron and V. Furnon. Or-tools, 2019. URL https://developers.google.com/optimization/.
W. Pullan and H. H. Hoos. Dynamic local search for the maximum clique problem. Journal of Artificial Intelligence Research, 25:159–185, 2006.
L. Schrage. Linear, Integer, and Quadratic Programming with LINDO: User’s Manual, 1986.
E. L. Schreiber and R. E. Korf. Improved bin completion for optimal bin packing and number partitioning. In Twenty-Third International Joint
Conference on Artificial Intelligence, 2013.
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and
D. Silver. Mastering atari, go, chess and shogi by planning with a learned model, 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ISSN 14764687. doi:
10.1038/nature16961.
J. Song, R. Lanka, Y. Yue, and M. Ono. Co-training for policy learning. In Proceedings of the 35th Conference on Uncertainty in Artificial
Intelligence, UAI 2019, volume 115 of Proceedings of Machine Learning Research, pages 1191–1201, Tel Aviv, Israel, 22–25 Jul 2020. PMLR.
K. Subhash, D. Minzer, and M. Safra. Pseudorandom sets in grassmann graph have near-perfect expansion. In 2018 IEEE 59th Annual Symposium
on Foundations of Computer Science (FOCS), pages 592–601. IEEE, 2018.
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 1988. ISSN 15730565. doi: 10.1023/A:
1022633531479.
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In
Advances in neural information processing systems, pages 1057–1063, 2000. ISBN 0262194503.
Y. Tang, S. Agrawal, and Y. Faenza. Reinforcement learning for integer programming: Learning to cut. In Proceedings of the International
Conference on Machine Learning, ICML, pages 1483–1492, 2020.
R. E. Tarjan and A. E. Trojanowski. Finding a maximum independent set. SIAM Journal on Computing, 6(3):537–546, 1977.
The Sage Developers. SageMath, the Sage Mathematics Software System (Version 9.0.0), 2020. URL https://www.sagemath.org.
E. S. Tiunov, A. E. Ulanov, and A. Lvovsky. Annealing by simulating the coherent ising machine. Optics express, 27(7):10288–10295, 2019. ISSN