0% found this document useful (0 votes)
22 views11 pages

RL and CP For CO

Uploaded by

j3qbzodj7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

RL and CP For CO

Uploaded by

j3qbzodj7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Combining Reinforcement Learning and


Constraint Programming for Combinatorial Optimization
Quentin Cappart,1, 2 Thierry Moisan,2 Louis-Martin Rousseau1 ,
Isabeau Prémont-Schwarz2 , Andre A. Cire3
1
Ecole Polytechnique de Montréal, Montreal, Canada
2
Element AI, Montreal, Canada
3
University of Toronto Scarborough, Toronto, Canada
{quentin.cappart, louis-martin.rousseau}@polymtl.ca
[email protected]
[email protected]
[email protected]
Abstract Such algorithms will eventually find the optimal solution, but
they may be prohibitive for solving large instances because
Combinatorial optimization has found applications in numer-
of the exponential increase of the execution time. That being
ous fields, from aerospace to transportation planning and eco-
nomics. The goal is to find an optimal solution among a finite said, well-designed exact algorithms can nevertheless be used
set of possibilities. The well-known challenge one faces with to obtain sub-optimal solutions by interrupting the search be-
combinatorial optimization is the state-space explosion prob- fore its termination. This flexibility makes exact methods
lem: the number of possibilities grows exponentially with the appealing and practical, and as such they constitute the core
problem size, which makes solving intractable for large prob- of modern optimization solvers as CPLEX (Cplex 2009),
lems. In the last years, deep reinforcement learning (DRL) has Gurobi (Optimization 2014), or Gecode (Schulte, Lagerkvist,
shown its promise for designing good heuristics dedicated to and Tack 2006). It is the case of constraint programming (CP)
solve NP-hard combinatorial optimization problems. However, (Rossi, Van Beek, and Walsh 2006), which has the additional
current approaches have an important shortcoming: they only asset to be a generic tool that can be used to solve a large
provide an approximate solution with no systematic ways to
variety of COPs, whereas mixed integer programming (MIP)
improve it or to prove optimality. In another context, constraint
programming (CP) is a generic tool to solve combinatorial (Bénichou et al. 1971) solvers only deal with linear prob-
optimization problems. Based on a complete search procedure, lems and limited non-linear cases. A critical design choice in
it will always find the optimal solution if we allow an execu- CP is the branching strategy, i.e., directing how the search
tion time large enough. A critical design choice, that makes space must be explored. Naturally, well-designed heuristics
CP non-trivial to use in practice, is the branching decision, are more likely to discover promising solutions, whereas bad
directing how the search space is explored. In this work, we heuristics may bring the search into a fruitless subpart of
propose a general and hybrid approach, based on DRL and CP, the solution space. In general, the choice of an appropriate
for solving combinatorial optimization problems. The core of branching strategy is non-trivial and their design is a hot
our approach is based on a dynamic programming formulation, topic in the CP community (Palmieri, Régin, and Schaus
that acts as a bridge between both techniques. We experimen-
2016; Fages and Prud’Homme 2017; Laborie 2018).
tally show that our solver is efficient to solve three challenging
problems: the traveling salesman problem with time windows, On the other hand, heuristic algorithms (Aarts and Lenstra
the 4-moments portfolio optimization problem, and the 0-1 2003; Gendreau and Potvin 2005) are incomplete methods
knapsack problem. Results obtained show that the framework that can compute solutions efficiently, but are not able to
introduced outperforms the stand-alone RL and CP solutions, prove the optimality of a solution. They also often require
while being competitive with industrial solvers. substantial problem-specific knowledge for building them. In
the last years, deep reinforcement learning (DRL) (Sutton,
Introduction Barto et al. 1998; Arulkumaran et al. 2017) has shown its
promise to obtain high-quality approximate solutions to some
The design of efficient algorithms for solving NP-hard prob-
NP-hard COPs (Bello et al. 2016; Khalil et al. 2017; Deudon
lems, such as combinatorial optimization problems (COPs),
et al. 2018; Kool, Van Hoof, and Welling 2018). Once a model
has long been an active field of research (Wolsey and
has been trained, the execution time is typically negligible
Nemhauser 1999). Broadly speaking, there exist two main
in practice. The good results obtained suggest that DRL is a
families of approaches for solving COPs, each of them having
promising new tool for finding efficiently good approximate
pros and cons. On the one hand, exact algorithms are based
solutions to NP-hard problems, provided that (1) we know
on a complete and clever enumeration of the solutions space
the distribution of problem instances and (2) that we have
(Lawler and Wood 1966; Rossi, Van Beek, and Walsh 2006).
enough instances sampled from this distribution for training
Copyright c 2021, Association for the Advancement of Artificial the model. Nonetheless, current methods have shortcomings.
Intelligence (www.aaai.org). All rights reserved. Firstly, they are mainly dedicated to solve a specific problem,

3677
as the travelling salesman problem (TSP), with the notewor- complete solver, based on DRL and CP, in order to solve
thy exception of (Khalil et al. 2017) that tackle three other COPs that can be modelled using DP. This section describes
graph-based problems, and of (Kool, Van Hoof, and Welling the complete architecture of the framework we propose. A
2018) that target routing problems. Secondly, they are only high-level picture of the architecture is shown in Figure 1. It is
designed to act as a constructive heuristic, and come with no divided into three parts: the learning phase, the solving phase
systematic ways to improve the solutions obtained, unlike and the unifying representation, acting as a bridge between
complete methods, such as CP. the two phases. Each part contains several components. Green
As both exact approaches and learning-based heuristics blocks and arrows represent the original contributions of this
have strengths and weaknesses, a natural question arises: work and blue blocks corresponds to known algorithms that
How can we leverage these strengths together in order to we adapted for our framework.
build a better tool to solve combinatorial optimization prob-
lems ? In this work, we show that it can be successfully done Learning phase Unifying representation Solving phase
by the combination of reinforcement learning and constraint Training instances Combinatorial Evaluated
programming, using dynamic programming as a bridge be- (randomly generated) optimization problem instances
tween both techniques. Dynamic programming (DP) (Bell-
man 1966), which has found successful applications in many Reinforcement DP model Constraint
fields (Godfrey and Powell 2002; Topaloglou, Vladimirou, learning programming
and Zenios 2008; Tang, Mu, and He 2017; Ghasempour and Environment Dominance pruning Model
Heydecker 2019), is an important technique for modelling rules
COPs. In its simplest form, DP consists in breaking a prob-
lem into sub-problems that are linked together through a Value-selection
Agent Search
recursive formulation (i.e., the well-known Bellman equa- heuristic
Solution
tion). The main issue with exact DP is the so-called curse of
dimensionality: the number of generated sub-problems grows
Figure 1: Overview of our framework for solving COPs.
exponentially, to the point that it becomes infeasible to store
all of them in memory.
This paper proposes a generic and complete solver, based
on DRL and CP, in order to solve COPs that can be modelled Dynamic Programming Model
using DP. Our detailed contributions are as follows: (1) A Dynamic programming (DP) (Bellman 1966) is a technique
new encoding able to express a DP model of a COP into a RL combining both mathematical modeling and computer pro-
environment and a CP model; (2) The use of two standard gramming for solving complex optimization problems, such
RL training procedures, deep Q-learning and proximal policy as NP-hard problems. In its simplest form, it consists in
optimization, for learning an appropriate CP branching strat- breaking a problem into sub-problems and to link them
egy. The training is done using randomly generated instances through a recursive formulation. The initial problem is then
sampled from a similar distribution to those we want to solve; solved recursively, and the optimal values of the decision
(3) The integration of the learned branching strategies on variables are recovered successively by tracking back the in-
three CP search strategies, namely branch-and-bound, iter- formation already computed. Let us consider a general COP
ative limited discrepancy search and restart based search; Q : {max f (x) : x ∈ X ⊆ Zn }, where xi with i ∈ {1..n}
(4) Promising results on three challenging COPs, namely are n discrete variables that must be assigned in order to
the travelling salesman problem with time windows, the 4- maximize a function f (x). In the DP terminology, and as-
moments portfolio optimization, the 0-1 knapsack problem; suming a fixed-variable ordering where a decision has to
(5) The open-source release of our code and models, in order been taken at each stage, the decision variables of Q are
to ease the future research in this field1 . referred to as the controls (xi ). They take value from their
In general, as there are no underlying hypothesis such domain D(xi ), and enforce a transition (T : S × X → S)
as linearity or convexity, a DP cannot be trivially encoded from a state (si ) to another one (si+1 ) where S is the set
and solved by standard integer programming techniques of states. The initial state (s1 ) is known and a transition is
(Bergman and Cire 2018). It is one of the reasons that drove done at each stage (i ∈ {1, . . . , n}) until all the variables
us to consider CP for the encoding. The next section presents have been assigned. Besides, a reward (R : S × X → R)
the hybrid solving process that we designed. Then, exper- is induced after each transition. Finally, a DP model can
iments on the two case studies are carried out. Finally, a also contain validity conditions (V : S × X → {0, 1}) and
discussion on the current limitations of the approach and the dominance rules (P : S × X → {0, 1}) that restrict the
next research opportunities are proposed. set of feasible actions. The difference between both is that
validity conditions are mandatory to ensure the correctness
A Unifying Representation Combining of the DP model (V (s, x) = 0 ⇔ T (s, x) = ⊥) whereas
Learning and Searching the dominance rules are only used for efficiency purposes
Because of the state-space explosion, solving NP-hard COPs (P (s, x) = 0 ⇒ T (s, x) = ⊥), where ⇔, ⇒, and ⊥ repre-
remains a challenge. In this paper, we propose a generic and sent the equivalence, the implication, and the unfeasible state,
respectively. A DP model for a COP can then be modelled
1
https://github.com/qcappart/hybrid-cp-rl-solver as a tuple hS, X, T, R, V, P i. The problem can be solved re-

3678
cursively using Bellman Equation, where gi : X → R is a Transition The RL transition T gives the state si+1 from si
state-value function representing the optimal reward of being and ai in the same way as the transition function T of the DP
at state si at stage i: model gives a state si+1 from a previous state si and a control
n o value vi . Formally, we have the deterministic transition:
gi (si ) = max R(si , xi ) + gi+1 T (si , xi ) (1)  
si+1 = T(si , ai ) = Qp , T (si , ai ) = Qp , T (si , vi ) (3)
This applies ∀i ∈ {1..n} and such that T (si , xi ) 6= ⊥. The
reward is equal to zero for the final state (gn+1 (sn+1 ) = 0) Reward An initial idea for designing the RL reward func-
and is backtracked until g1 (s1 ) has been computed. This tion R is to use the reward function R of the DP model using
last value gives the optimal cost of Q. Then, by tracing the the current state si and the action ai that has been selected.
values assigned to the variables xi , the optimal solution is However, performing a sequence of actions in a DP subject
recovered. Unfortunately, DP suffers from the well-known to validity conditions can lead to a state with no solutions,
curse of dimensionality, which prevents its use when dealing which must be avoided. Such a situation happens when a state
with problems involving large state/control spaces. A par- with no action is reached whereas at least one control x ∈ X
tial solution to this problem is to prune dominated actions has not been assigned to a value v. Finding first a feasible
(P (s, x) = 0). An action is dominated if it is valid according solution must then be prioritized over maximizing the DP
to the recursive formulation, but is (1) either strictly worse reward and is not considered with this simple form of the
than another action, or (2) it cannot lead to a feasible solution. RL reward. Based on this, two properties must be satisfied in
In practice, pruning such dominated actions can have a huge order to ensure that the reward will drive the RL agent to the
impact on the size of the search space, but identifying them optimal solution of the COP: (1) the reward collected through
is not trivial as assessing those two conditions precisely is an episode e1 must be lesser than the reward of an episode
problem-dependent. Besides, even after pruning the domi- e2 if the COP solution of e1 is worse than the one obtained
nated actions, the size of the state-action space may still be with e2 , and (2) the total reward collected through an episode
too large to be completely explored in practice. giving an unfeasible solution must be lesser than the reward
of any episode giving a feasible solution. A formal definition
of these properties is proposed in the supplementary material.
RL Encoding
By doing so, we ensure that the RL agent has incentive to
An introduction to reinforcement learning is proposed in find, first, feasible solutions (i.e., maximizing the first term is
appendices. Note that all the sets used to define an RL envi- more rewarding), and, then, finding the best ones (i.e., then,
ronment are written using a larger size font. Encoding the maximizing the second term). The reward we designed  is as
DP formulation into a RL environment requires to define, follows : R(s, a) = ρ × 1 + |UB(Qp )| + R(s, a) ; where
adequately, the set of states, the set of actions, the transition UB(Qp ) corresponds to an upper bound of the objective value
function, and the reward function, as the tuple hS, A, T, Ri that can be reached for the COP Qp . The term 1+|UB(Qp )| is
from the DP model hS, X, T, R, V, P i and a specific instance a constant factor that gives a strict upper bound on the reward
Qp of the COP that we are considering. The initial state of of any solution of the DP and drives the agent to progress
the RL environment corresponds to the first stage of the DP into a feasible solution first. For the travelling salesman prob-
model, where no variable has been assigned yet. lem with time windows, this bound can be, for instance, the
maximum distance that can be traveled in a complete tour
(computed in O(1)). This term is required in order to pri-
State For each stage i of the DP model, we define the oritize the fact that we want first a feasible solution. The
RL state si as the pair (Qp , si ), where si ∈ S is the DP absolute value ensures that the term is positive and is used
state at the same stage i, and Qp is the problem instance we to negate the effect of negative rewards that may lead the
are considering. Note that the second part of the state (si ) agent to stop the episode as soon as possible. The second
is dynamic, as it depends on the current stage i in the DP term R(s, a) forces then the agent to find the best feasible
model, or similarly, to the current time-step of the RL episode, solution. Finally, a scaling factor ρ ∈ R can also be added in
whereas the first part (Qp ) is static as it remains the same for order to compress the space of rewards into a smaller interval
the whole episode. In practice, each state is embedded into a value near zero. Note that for DP models having only feasible
tensor of features, as it serves as input of a neural network. solutions, the first term can be omitted.

Action Given a state si from the DP model at stage i and Learning Algorithm
its control xi , an action ai ∈ A at a state si has a one-to-one We implemented two different agents, one based on a value-
relationship with the control xi . The action ai can be done based method (DQN) and a second one based on policy gra-
if and only if xi is valid under the DP model. The idea is to dient (PPO). In both cases, the agent is used to parametrize
allow only actions that are consistent with regards to the DP the weight vector (w) of a neural network giving either the Q-
model, the validity conditions, and the eventual dominance values (DQN), or the policy probabilities (PPO). The training
conditions. Formally, the set of feasible actions A at stage i is done using randomly generated instances sampled from a
are as follows: similar distribution to those we want to solve. It is important
 to mention that this learning procedure makes the assump-
Ai = vi vi ∈ D(xi )∧V (si , vi ) = 1∧P (si , vi ) = 1 (2) tion that we have a generator able to create random instances

3679
(Qp ) that follows the same distribution that the ones we want state of the system. In the optimal solution, the variables thus
to tackle, or a sufficient number of similar instances from indicate the best state that can be reached at each stage, and
past data. Such an assumption is common in the vast major- the best action to select as well.
ity of works tackling NP-hard problems using ML (Khalil
et al. 2017; Kool, Van Hoof, and Welling 2018; Cappart et al. Constraints The constraints of our encoding have two
2019), and, despite being strong, has nonetheless a practi- purposes. Firstly, they must ensure the consistency of the
cal interest when repeatedly solving similar instances of the DP formulation. It is done (1) by setting the initial state to
same problem (e.g., package shipping by retailers) a value (e.g., ), (2) by linking the state of each stage to
the previous one through the transition function (T ), and
Neural Network Architecture finally (3) by enforcing each transition to be valid, in the
In order to ensure the genericity and the efficiency of the sense that they can only generate a feasible state of the sys-
framework, we have two requirements for designing the neu- tem. Secondly, other constraints are added in order to re-
ral network architecture: (1) be able to handle instances of move dominated actions and the subsequent states. In the
the same COPs, but that have a different number of variables CP terminology, such constraints are called redundant con-
(i.e., able to operate on non-fixed dimensional feature vectors) straint, they do not change the semantic of the model, but
and (2) be invariant to input permutations. In other words, speed-up the search. The constraints inferred by our en-
encoding variables x1 , x2 , and x3 should produce the same coding are as follows, where validityCondition and
prediction as encoding x3 , x1 , and x2 . A first option is to dominanceCondition are both Boolean functions de-
embed the variables into a set transformer architecture (Lee tecting non-valid transitions and dominated actions, respec-
et al. 2018), that ensures these two requirements. Besides, tively.
many COPs also have a natural graph structure that can be
xs1 =  (4)
exploited. For such a reason, we also considered another em-
bedding based on graph attention network (GAT) (Veličković ∀i ∈ {1, . . . , n} : xsi+1 = T (xsi , xai ) (5)
et al. 2017). The embedding, either obtained using GAT or set ∀i ∈ {1, . . . , n} : validityCondition(xsi , xai ) (6)
transformer, can then be used as an input of a feed-forward ∀i ∈ {1, . . . , n} : dominanceCondition(xsi , xai ) (7)
network to get a prediction. Case studies will show a practical
application of both architectures. For the DQN network, the Setting the initial state is done in Eq. (4), enforcing the
dimension of the last layer output corresponds to the total transition function in Eq. (5), keeping only the valid transi-
number of actions for the COP and output an estimation of tions in Eq. (6), and pruning the dominated states in Eq. (7)
the Q-values for each of them. The output is then masked
in order to remove the unfeasible actions. Concerning PPO, Objective function The goal is to maximize the accumu-
distinct networks for the actor and the critic are built. The last lated sum of rewards generated through the (R:
Pntransition
layer on the critic output only a single value. Concerning the S × A → R) during the n stages: maxxa s a
i=1 R(xi , xi ) .
actor, it is similar as the DQN case but a softmax selection is Note that the optimization and branching selection is done
used after the last layer in order to obtain the probability to only on the decision variables (xa ).
select each action.
Search Strategy
CP Encoding From a single DP formulation, we are able to (1) build a RL
An introduction to constraint programming is proposed in environment dedicated to learn the best actions to perform,
appendices. Note that the teletype font is used to refer and (2) state a CP model of the same problem (Figure 1).
to CP notations. This section describes how a DP formulation This consistency is at the heart of the framework. This section
can be encoded in a CP model. Modeling a problem using CP shows how the knowledge learned during the training phase
consists in defining the tuple hX, D, C, Oi where X is the set can be transferred into the CP search. We considered three
of variables, D(X) is the set of domains, C is the set of con- standard CP specific search strategy: depth-first branch-and-
straints, and O is the objective function. Let us consider the bound search (BaB), and iterative limited discrepancy search
DP formulation hS, X, T, R, V, P i with also n the number (ILDS), that are able to leverage knowledge learned with
of stages. a value-based method as DQN, and restart based search
(RBS), working together with policy gradient methods. The
remaining of this section presents how to plug a learned
Variables and domains We make a distinction between
heuristics inside these three search strategies.
the decision variables, on which the search is performed,
and the auxiliary variables that are linked to the decision Depth-First Branch-and-Bound Search with DQN This
variables, but that are not branched on during the search. The search works in a depth-first fashion. When a feasible solu-
encoding involves two variables per stage: (1) xsi ∈ X is an tion has been found, a new constraint ensuring that the next
auxiliary variable representing the current state at stage i solution has to be better than the current one is added. In case
whereas (2) xai ∈ X is a decision variable representing the of an unfeasible solution due to an empty domain reached,
action done at this state, similarly to the regular decom- the search is backtracked to the previous decision. With this
position (Pesant 2004). Besides, a last auxiliary variable is procedure, and provided that the whole search space has been
considered for the stage n + 1, which represents the final explored, the last solution found is then proven to be optimal.

3680
This search requires a good heuristic for the value-selection. Iterative Limited Discrepancy Search with DQN
This can be achieved by a value-based RL agent, such as Iterative limited discrepancy search (ILDS) (Harvey and
DQN. After the training, the agent gives a parametrized state- Ginsberg 1995) is a search strategy commonly used when
action value function Q̂(s, a, w), and a greedy policy can be we have a good prior on the quality of the value selection
used for the value-selection heuristic, which is intended to be heuristic used for driving the search. The idea is to restrict
of a high, albeit non-optimal, quality. The variable ordering the number of decisions deviating from the heuristic choices
must follow the same order as the DP model in order to keep (i.e., a discrepancy) by a threshold. By doing so, the search
the consistency with both encoding. As highlighted in other will explore a subset of solutions that are likely to be good
works (Cappart et al. 2019), an appropriate variable ordering according to the heuristic while giving a chance to recon-
has an important impact when solving DPs. However, such sider the heuristic selection which may be sub-optimal. This
an analysis goes beyond the scope of this work. mechanism is often enriched with a procedure that iteratively
increases the number of discrepancies allowed once a level
has been fully explored.
Algorithm 1: BaB-DQN Search Procedure. As ILDS requires a good heuristic for the value-selection,
. Pre: Qp is a COP having a DP formulation. it is complementary with a value-based RL agent, such as
. Pre: w is a trained weight vector. DQN. After the training, the agent gives a parametrized
hX, D, C, Oi := CPEncoding(Qp ) state-action value function Q̂(s, a, w), and the greedy policy
K=∅ argmaxa Q̂(s, a, w) can be used for the value-selection heuris-
Ψ := BaB-search(hX, D, C, Oi) tic, which is intended to be of a high, albeit non-optimal,
while Ψ is not completed do quality. The variable ordering must follow the same order
s := encodeStateRL(Ψ) as the DP model in order to keep the consistency with both
x := takeFirstNonAssignedVar(X) encoding.
if s ∈ K then
v := peek(K, s)
else Algorithm 2: ILDS-DQN Search Procedure.
v := argmaxu∈D(x) Q̂(s, u, w) . Pre: Qp is a COP having a DP formulation.
. Pre: w is a trained weight vector.
end
. Pre: I is the threshold of the iterative LDS.
K := K ∪ {hs, vi}
branchAndUpdate(Ψ, x, v)
hX, D, C, Oi := CPEncoding(Qp )
end c? = −∞, K = ∅
return bestSolution(Ψ) for i from 0 to I do
Ψ := LDS-search(hX, D, C, Oi, i)
while Ψ is not completed do
The complete search procedure (BaB-DQN) is presented s := encodeStateRL(Ψ)
in Algorithm 1, taking as input a COP Qp , and a pre-trained x := takeFirstNonAssignedVar(X)
model with the weight vector w. First, the optimization prob- if s ∈ K then
lem Qp in encoded into a CP model. Then, a new BaB- v := peek(K, s)
search Ψ is initialized and executed on the generated CP else
model. Until the search is not completed, a RL state s is ob- v := argmaxu∈D(x) Q̂(s, u, w)
tained from the current CP state (encodeStateRL). The
end
first non-assigned variable xi of the DP is selected and is
K := K ∪ {hs, vi}
assigned to the value maximizing the state-action value func-
branchAndUpdate(Ψ, x, v)
tion Q̂(s, a, w). All the search mechanisms inherent of a CP end
solver but not related to our contribution (propagation, back- 
c? := max c? , bestSolution(Ψ)
tracking, etc.), are abstracted in the branchAndUpdate
function. Finally, the best solution found during the search end
is returned. We enriched this procedure with a cache mech- return c?
anism (K). During the search, it happens that similar states
are reached more than once (Chu, de La Banda, and Stuckey The complete search procedure we designed (ILDS-DQN)
2010). In order to avoid recomputing the Q-values, one can is presented in Algorithm 2, taking as input a COP Q, a
store the Q-values related to a state already computed and pre-trained model with the weight vector w, and an itera-
reuse them if the state is reached again. In the worst-case, all tion threshold I for the ILDS. First, the optimization prob-
the action-value combinations have to be tested. This gives lem Q in encoded into a CP model. Then, for each num-
the upper bound O(dm ), where m is the number of actions of ber i ∈ {1, . . . , I} of discrepancies allowed, a new search
the DP model and d the maximal domain size. Note that this Ψ is initialized and executed on Q. Until the search is not
bound is standard in a CP solver. As the algorithm is based completed, a RL state s is obtained from the current CP
on DFS, the worst-case space complexity is O(d × m + |K|), state (encodeStateRL). The first non-assigned variable
where |K| is the cache size. xi of the DP is selected and is assigned to the value max-

3681
imizing the state-action value function Q̂(s, a, w). All the Algorithm 3: RBS-PPO Search Procedure.
search mechanisms inherent of a CP solver but not related
. Pre: Qp is a COP having a DP formulation.
to our contribution (propagation, backtracking, etc.), are ab-
. Pre: w is a trained weight vector.
stracted in the branchAndUpdate function. Finally, the
. Pre: I is the number of restarts to do.
best solution found during the search is returned. The cache
. Pre: σ is the Luby scale factor.
mechanism (K) introduced for the BaB search is reused. The
. Pre: τ is the softmax temperature.
worst-case bounds are the same as BaB-DQN presented in
the main manuscript: O(dm ) for the time complexity, and
hX, D, C, Oi := CPEncoding(Qp )
O(d × m + |K|) for the space complexity, where m is the
c? = −∞, K = ∅
number of actions of the DP model, d is the maximal domain
for i from 0 to I do
size, and |K| is the cache size.
L = Luby(σ, i)
Ψ := BaB-search(hX, D, C, Oi, L)
Restart-Based Search with PPO while Ψ is not completed do
Restart-based search (RBS) is another search strategy, which s := encodeStateRL(Ψ)
involves multiple restarts to enforce a suitable level of explo- x := takeFirstNonAssignedVar(X)
ration. The idea is to execute the search, to stop it when a if s ∈ K then
given threshold is reached (i.e., execution time, number of p := peek(K, s)
nodes explored, number of failures, etc.), and to restart it. else
Such a procedure works only if the search has some random- p := π(s, w)
ness in it, or if new information is added along the search end
runs. Otherwise, the exploration will only consider similar K := K ∪ {hs, pi}
sub-trees. A popular design choice is to schedule the restart v ∼D(x) softmaxSelection(p, τ )
on the Luby sequence (Luby, Sinclair, and Zuckerman 1993), branchAndUpdate(Ψ, x, v)
using the number of failures for the threshold, and branch- end
and-bound for creating the search tree.

c? := max c? , bestSolution(Ψ)
The sequence starts with a threshold of 1. Each next parts end
of the sequence is the entire previous sequence with the last return c?
value of the previous sequence doubled. The sequence can
also be scaled with a factor σ, multiplying each element. As
a controlled randomness is a key component of this search, it
to ensure reproducibility, the implementation, the models, the
can naturally be used with a policy π(s, w) parametrized with
results, and the hyper-parameters used are released with the
a policy gradient algorithm. By doing so, the heuristic ran-
permissive MIT open-source license. Algorithms used for
domly selects a value among the feasible ones, and according
training have been implemented in Python and Pytorch
to the probability distribution of the policy through a softmax
(Paszke et al. 2019) is used for designing the neural networks.
function. It is also possible to control the exploration level
Library DGL (Wang et al. 2019) is used for implementing
by tuning the softmax function with a standard Boltzmann
graph embedding, and SetTransformer (Lee et al. 2018)
temperature τ . The complete search process is depicted in
for set embedding. The CP solver used is Gecode (Schulte,
Algorithm 3. Note that the cache mechanism is reused in or-
Lagerkvist, and Tack 2006), which has the benefit to be open-
der to store the vector of action probabilities for a given state.
source and to offer a lot of freedom for designing new search
The worst-case bounds are the same as BaB-DQN presented
procedures. As Gecode is implemented in C++, an oper-
in the main manuscript: O(dm ) for the time complexity, and
ability interface with Python code is required. It is done
O(d × m + |K|) for the space complexity, where m is the
using Pybind11 (Jakob, Rhinelander, and Moldovan 2017).
number of actions of the DP model, d is the maximal domain
Training time is limited to 48 hours, memory consumption
size and, |K| is the cache size.
to 32 GB and 1 GPU (Tesla V100-SXM2-32GB) is used per
model. Models are trained with a single run. A new model
Experimental Results is recorded after each 100 episodes of the RL algorithm and
The goal of the experiments is to evaluate the efficiency of the model achieving the best average reward on a valida-
the framework for computing solutions of challenging COPs tion set of 100 instances generated in the same way as for
having a DP formulation. To do so, comparisons of our three the training is selected. The final evaluation is done on 100
learning-based search procedures (BaB-DQN, ILDS-DQN, other instances (still randomly generated in the same man-
RBS-PPO) with a standard CP formulation (CP-model), ner) using Intel Xeon E5-2650 CPU with 32GB of RAM
stand-alone RL algorithms (DQN, PPO), and industrial solvers and a time limit of 60 minutes. Detailed information about
are performed. Three NP-hard problems are considered in the the hyper-parameters tested and selected are proposed in the
main manuscript: the travelling salesman problem with time supplementary material.
windows (TSPTW), involving non-linear constraints, and the
4-moments portfolio optimization problem (PORT), which Travelling Salesman Problem with Time Windows
has a non-linear objective, and the 0-1 knapsack problem Detailed information about this case study (TSPTW) and
(KNAP). In order to ease the future research in this field and the baselines used for comparison is proposed in supplemen-

3682
tary material. In short, OR-Tools is an industrial solver of 50, 100 and 200 with three kinds of weight/profit correla-
developed by Google, PPO uses a beam-search decoding of tions - easy, medium, and hard) are summarized in Table 4.
width 64, and CP-nearest solves the DP formulation with For each approach, the optimality gap (i.e., the ratio with the
CP, but without the learning part. A nearest insertion heuristic optimal solution) is proposed. First, it is important to note
is used for the value-selection instead. Results are summa- that an integer programming solver, as COIN-OR (Saltzman
rized in Table 1. First of all, we can observe that OR-Tools, 2002), is far more efficient than CP for solving the knapsack
CP-model, and DQN are significantly outperformed by the problem, which was already known. For all the instances
hybrid approaches. Good results are nevertheless achieved tested, COIN-OR has been able to find the optimal solution
by CP-nearest, and PPO. We observe that the former and to prove it. No other methods have been able to prove
is better to prove optimality, whereas the latter is better to optimality for all of the instances of any configuration. We
discover feasible solutions. However, when the size of in- observe that RBS-PPO? has good performances, and outper-
stances increases, both methods have more difficulties to forms the RL and CP approaches. Methods based on DQN
solve the problem and are also outperformed by the hybrid seems to have more difficulties to handle large instances,
methods, which are both efficient to find solutions and to unless they are strongly correlated.
prove optimality. Among the hybrid approaches, we observe
that DQN-based searches give the best results, both in finding Discussion and Limitations
solutions and in proving optimality. First of all, let us highlight that this work is not the first
We also note that caching the predictions is useful. Indeed, one attempting to use ML for guiding the decision process
the learned heuristics are costly to use, as the execution time of combinatorial optimization solvers (He, Daume III, and
to finish the search is larger when the cache is disabled. For Eisner 2014). According to the survey and taxonomy of (Ben-
comparison, the average execution time of a value-selection gio, Lodi, and Prouvost 2018), this kind of approach belongs
without caching is 34 milliseconds for BaB-DQN (100 cities), to the third class (Machine learning alongside optimization
and goes down to 0.16 milliseconds when caching is enabled. algorithms) of ML approaches for solving COPs. It is for
For CP-nearest, the average time is 0.004 milliseconds. instance the case of (Gasse et al. 2019), which propose to aug-
It is interesting to see that, even being significantly slower ment branch-and-bound procedures using imitation learning.
than the heuristic, the hybrid approach is able to give the best However, their approach requires supervised learning and is
results. only limited to (integer) linear problems. The differences we
have with this work are that (1) we focus on COPs modelled
4-Moments Portfolio Optimization (PORT) as a DP, and (2) the training is entirely based on RL. Thanks
Detailed information about this case study (Atamtürk and to CP, the framework can solve a large range of problems, as
Narayanan 2008; Bergman and Cire 2018) is proposed in the TSPTW, involving non-linear combinatorial constraints,
supplementary material. In short, Knitro and APOPT are or the portfolio optimization problem, involving a non-linear
two general non-linear solvers. Given that the problem is objective function. Another limitation of imitation learning is
non-convex, these solvers are not able to prove optimality as that it requires the solver to be able to find a least a feasible
they may be blocked in local optima. The results are sum- solution for collecting data, which can be challenging for
marized in Tables 2 and 3. When optimality is not proved, some problems as the TSPTW. Thanks to the use of rein-
hybrid methods are run until the timeout. Let us first consider forcement learning, our framework does not suffer from this
the continuous case (Table 2). For the smallest instances, restriction.
we observe that BaB-DQN? , ILDS-DQN? , and CP-model Besides its expressiveness, and in contrast to most of the re-
achieve the best results, although only BaB-DQN? has been lated works solving the problem end-to-end (Bello et al. 2016;
able to prove optimality for all the instances. For larger contin- Kool, Van Hoof, and Welling 2018; Deudon et al. 2018; Joshi,
uous instances, the non-linear solvers achieve the best results, Laurent, and Bresson 2019), our approach is able to deal with
but are nevertheless closely followed by RBS-PPO? . When problems where finding a feasible solution is difficult and is
the coefficients of variables are floored (Table 3), the objec- able to provide optimality proofs. This was considered by
tive function is not continuous anymore, making the problem (Bengio, Lodi, and Prouvost 2018) as an important challenge
harder for non-linear solvers, which often exploit information in learning-based methods for combinatorial optimization.
from derivatives for the solving process. Such a variant is not Note also that compared to failure-driven explanation-based
supported by APOPT. Interestingly, the hybrid approaches do learning (Kambhampati 1998), hybridation with ant colony
not suffer from this limitation, as no assumption on the DP optimization (Meyer 2008; Khichane, Albert, and Solnon
formulation is done beforehand. Indeed, ILDS-DQN? and 2010; Di Gaspero, Rendl, and Urli 2013), and related mech-
BaB-DQN? achieve the best results for the smallest instances anisms (Katsirelos and Bacchus 2005; Xia and Yap 2018),
and RBS-PPO? for the larger ones. where learning is used to improve the search of the solving
process for a specific instance, the knowledge learned by
0-1 Knapsack Problem (KNAP) our approach can be used to solve new instances. The clos-
Detailed information about this case study is proposed in est related work we identified is the approach of (Antuori
supplementary material. In short, COIN-OR is a integer pro- et al. 2020) that has been developed in parallel by another
gramming solver, and three types of instances, that differ team independently. Reinforcement learning is also lever-
from the correlation between the weight and the profit of aged for directing the search of a constraint programming
each item, are considered (Pisinger 2005). The results (size solver. However, this last approach is restricted to a realistic

3683
Approaches 20 cities 50 cities 100 cities
Type Name Success Opt. Gap Time Success Opt. Gap Time Success Opt. Gap Time
OR-Tools 100 0 0 <1 0 0 - t.o. 0 0 - t.o.
Constraint programming CP-model 100 100 0 <1 0 0 - t.o. 0 0 - t.o.
CP-nearest 100 100 0 <1 99 99 - 6 0 0 - t.o.
DQN 100 0 1.91 <1 0 0 - <1 0 0 - <1
Reinforcement learning
PPO 100 0 0.13 <1 100 0 0.86 5 21 0 - 46
BaB-DQN 100 100 0 <1 100 99 0 2 100 52 0.06 20
Hybrid (no cache)
ILDS-DQN 100 100 0 <1 100 100 0 2 100 53 0.06 39
RBS-PPO 100 100 0 <1 100 80 0.02 12 100 0 0.18 t.o.
BaB-DQN? 100 100 0 <1 100 100 0 <1 100 91 0 15
Hybrid (with cache)
ILDS-DQN? 100 100 0 <1 100 100 0 1 100 90 0 15
RBS-PPO? 100 100 0 <1 100 99 0 2 100 11 0.04 32

Table 1: Results for TSPTW. Methods with ? indicate that caching is used, Success reports the number of instances where at least
a solution has been found (among 100), Opt. reports the number of instances where the optimality has been proven (among 100),
Gap reports the average gap with the best solution found by any method (in %, and only including the instances having only
successes) and Time reports the average execution time to complete the search (in minutes, and only including the instances
where the search has been completed; when the search has been completed for no instance t.o. (timeout) is indicated.

Approaches 20 items 50 items 100 items


Type Name Sol. Opt. Time Sol. Opt. Time Sol. Opt. Time
KNITRO 343.79 0 <1 1128.92 0 <1 2683.55 0 <1
Non-linear solver
APOPT 342.62 0 <1 1127.71 0 <1 2678.48 0 <1
Constraint programming CP-model 356.49 98 8 1028.82 0 t.o. 2562.59 0 t.o.
DQN 306.71 0 <1 879.68 0 <1 2568.31 0 <1
Reinforcement learning
PPO 344.95 0 <1 1123.18 0 <1 2662.88 0 <1
BaB-DQN? 356.49 100 <1 1047.13 0 t.o. 2634.33 0 t.o.
Hybrid (with cache) ILDS-DQN? 356.49 100 <1 1067.20 0 t.o. 2639.18 0 t.o.
RBS-PPO? 356.35 0 t.o. 1126.09 0 t.o. 2674.96 0 t.o.

Table 2: Results for PORT (continuous coefficients). Best results are highlighted, Sol. reports the best average objective profit
reached, Opt. reports the number of instances where the optimality has been proven (among 100), and Time reports the average
execution time to complete the search (in minutes, and only including the instances where the search has been completed; when
the search has been completed for no instance t.o. -timeout- is indicated).

Approaches 20 items 50 items 100 items


Type Name Sol. Opt. Time Sol. Opt. Time Sol. Opt. Time
KNITRO 211.60 0 <1 1039.25 0 <1 2635.15 0 <1
Non-linear solver
APOPT - - - - - - - - -
Constraint programming CP-model 359.81 100 t.o. 1040.30 0 t.o. 2575.64 0 t.o.
DQN 309.17 0 <1 882.17 0 <1 2570.81 0 <1
Reinforcement learning
PPO 347.85 0 <1 1126.06 0 <1 2665.68 0 <1
BaB-DQN? 359.81 100 <1 1067.37 0 t.o. 2641.22 0 t.o.
Hybrid (with cache) ILDS-DQN? 359.81 100 <1 1084.21 0 t.o. 2652.53 0 t.o.
RBS-PPO? 359.69 0 t.o. 1129.53 0 t.o. 2679.57 0 t.o.

Table 3: Results for PORT (discrete coefficients). Best results are highlighted, Sol. reports the best average objective profit
reached, Opt. reports the number of instances where the optimality has been proven (among 100), and Time reports the average
execution time to complete the search (in minutes, and only including the instances where the search has been completed; when
the search has been completed for no instance t.o. -timeout- is indicated).

transportation problem. In our work, we proposed a generic In most situations, experiments show that our approach
approach that can be used for a larger range of problems can obtain more and better solutions than the other methods
thanks to the DP formulation, but without considering realis- with a smaller execution time. However, they also highlighted
tic instances. Then, we think that the ideas of both works are that resorting to a neural network prediction is an expensive
complementary. operation to perform inside a solver, as it has to be called

3684
Approaches 50 items 100 items 200 items
Type Name Easy Medium Hard Easy Medium Hard Easy Medium Hard
Integer programming COIN-OR 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Constraint programming CP-model 0.30 2.35 2.77 16.58 5.99 5.44 26.49 7.420 6.19
DQN 2.11 1.97 1.36 2.08 4.87 1.32 35.88 8.98 5.99
Reinforcement learning
PPO 0.08 0.27 0.21 0.16 0.42 0.14 0.37 0.80 0.80
BaB-DQN? 0.02 0.01 0.00 0.44 1.73 0.60 4.20 7.84 0.00
Hybrid (with cache) ILDS-DQN? 0.03 0.05 0.01 0.38 2.90 0.35 30.33 7.80 4.91
RBS-PPO? 0.01 0.01 0.00 0.01 0.12 0.04 0.11 0.90 0.28

Table 4: Results for KNAP. The best results after COIN-OR are highlighted, and the average optimality gap is reported. A
timeout is always reached for the hybrids and the standard CP method.

numerous times during the solving process. It is currently a lem size, solving is often intractable for large instances. In
bottleneck, especially if we would like to consider larger in- this paper, we propose a hybrid approach, based on both
stances. It is why caching, despite being a simple mechanism, deep reinforcement learning and constraint programming,
is important. Another possibility is to reduce the complexity for solving COPs that can be formulated as a dynamic pro-
of the neural network by compressing its knowledge, which gram. To do so, we introduced an encoding able to express a
can for instance be done using knowledge-distillation (Hin- DP model into a reinforcement learning environment and a
ton, Vinyals, and Dean 2015) or by building a more compact constraint programming model. Then, the learning part can
equivalent network (Serra, Kumar, and Ramalingam 2020). be carried out with reinforcement learning, and the solving
Note that the Pybind11 binding between the Python and C++ part with constraint programming. The experiments carried
code is also a source of inefficiency. Another solution would out on the travelling salesman problem with time windows,
be to implement the whole framework into a single, efficient, the 4-moments portfolio optimization, and the 0-1 knapsack
and expressive enough, programming language. Although problem show that this framework is competitive with stan-
not considered in this paper, it is worth mentioning that vari- dard approaches and industrial solvers for instances up to 100
able ordering also plays an important role in the efficiency variables. These results suggest that the framework may be a
of CP solvers. Learning a good variable ordering is another promising new avenue for solving challenging combinatorial
promising direction but raises additional challenge, such as a optimization problems. In future work, we plan to tackle in-
correct design of the reward. dustrial problems with realistic instances in order to assess
Only three case studies are considered, but the approach the applicability of the approach for real-world problems.
proposed can be easily extended to other COPs that can be
modeled as a DP. Many COPs have an underlying graph References
structure, and can then be represented by a GNN (Khalil et al. Aarts, E.; and Lenstra, J. K. 2003. Local search in combina-
2017), and the set architecture is also general for modelling torial optimization. Princeton University Press.
COPs as they can take an arbitrary number of variables as
input. DP encodings are also pervasive in the optimization Antuori, V.; Hébrard, E.; Huguet, M.-J.; Essodaigui, S.; and
literature and, similar to integer programming, have been Nguyen, A. 2020. Leveraging Reinforcement Learning, Con-
traditionally used to model a wide range of problem classes straint Programming and Local Search: A Case Study in Car
(Godfrey and Powell 2002; Topaloglou, Vladimirou, and Manufacturing. In International Conference on Principles
Zenios 2008; Tang, Mu, and He 2017). and Practice of Constraint Programming, 657–672. Springer.
An important assumption that is done is that we need a Arulkumaran, K.; Deisenroth, M. P.; Brundage, M.; and
generator able to create random instances that follows the Bharath, A. A. 2017. A Brief Survey of Deep Reinforce-
same distribution that the ones we want to solve, or enough ment Learning. CoRR abs/1708.05866. URL http://arxiv.org/
historical data of the same distribution, in order to train the abs/1708.05866.
models. This can be hardly achieved for some real-world Atamtürk, A.; and Narayanan, V. 2008. Polymatroids and
problems where the amount of available data may be less mean-risk minimization in discrete optimization. Operations
important. Analyzing how this assumption can be relaxed is Research Letters 36(5): 618–622.
an interesting and important direction for future work.
Bellman, R. 1966. Dynamic programming. Science
Conclusion 153(3731): 34–37.
The goal of combinatorial optimization is to find an opti- Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, S.
mal solution among a finite set of possibilities. There are 2016. Neural combinatorial optimization with reinforcement
many practical and industrial applications of COPs, and ef- learning. arXiv preprint arXiv:1611.09940 .
ficiently solving them directly results in a better utilization Bengio, Y.; Lodi, A.; and Prouvost, A. 2018. Machine Learn-
of resources and a reduction of costs. However, since the ing for Combinatorial Optimization: a Methodological Tour
number of possibilities grows exponentially with the prob- d’Horizon. arXiv preprint arXiv:1811.06128 .

3685
Bénichou, M.; Gauthier, J.-M.; Girodet, P.; Hentges, G.; Jakob, W.; Rhinelander, J.; and Moldovan, D. 2017.
Ribière, G.; and Vincent, O. 1971. Experiments in mixed- pybind11–Seamless operability between C++ 11 and Python.
integer linear programming. Mathematical Programming Joshi, C.; Laurent, T.; and Bresson, X. 2019. An efficient
1(1): 76–94. graph convolutional network technique for the travelling
Bergman, D.; and Cire, A. A. 2018. Discrete nonlinear opti- salesman problem. arXiv preprint arXiv:1906.01227 .
mization by state-space decompositions. Management Sci- Kambhampati, S. 1998. On the relations between intelligent
ence 64(10): 4700–4720. backtracking and failure-driven explanation-based learning
Cappart, Q.; Goutierre, E.; Bergman, D.; and Rousseau, L.- in constraint satisfaction and planning. Artificial Intelligence
M. 2019. Improving optimization bounds using machine 105(1-2): 161–208.
learning: Decision diagrams meet deep reinforcement learn- Katsirelos, G.; and Bacchus, F. 2005. Generalized nogoods
ing. In Proceedings of the AAAI Conference on Artificial in CSPs. In AAAI, volume 5, 390–396.
Intelligence, volume 33, 1443–1451.
Khalil, E.; Dai, H.; Zhang, Y.; Dilkina, B.; and Song, L. 2017.
Chu, G.; de La Banda, M. G.; and Stuckey, P. J. 2010. Au- Learning combinatorial optimization algorithms over graphs.
tomatically exploiting subproblem equivalence in constraint In Advances in Neural Information Processing Systems, 6348–
programming. In International Conference on Integration 6358.
of Artificial Intelligence (AI) and Operations Research (OR)
Techniques in Constraint Programming, 71–86. Springer. Khichane, M.; Albert, P.; and Solnon, C. 2010. Strong com-
bination of ant colony optimization with constraint program-
Cplex, I. I. 2009. V12. 1: User’s Manual for CPLEX. Inter- ming optimization. In International Conference on Inte-
national Business Machines Corporation 46(53): 157. gration of Artificial Intelligence (AI) and Operations Re-
Deudon, M.; Cournut, P.; Lacoste, A.; Adulyasak, Y.; and search (OR) Techniques in Constraint Programming, 232–
Rousseau, L.-M. 2018. Learning heuristics for the tsp by 245. Springer.
policy gradient. In International conference on the integra- Kool, W.; Van Hoof, H.; and Welling, M. 2018. At-
tion of constraint programming, artificial intelligence, and tention, learn to solve routing problems! arXiv preprint
operations research, 170–181. Springer. arXiv:1803.08475 .
Di Gaspero, L.; Rendl, A.; and Urli, T. 2013. A hybrid ACO+ Laborie, P. 2018. Objective landscapes for constraint pro-
CP for balancing bicycle sharing systems. In International gramming. In International Conference on the Integration of
Workshop on Hybrid Metaheuristics, 198–212. Springer. Constraint Programming, Artificial Intelligence, and Opera-
Fages, J.-G.; and Prud’Homme, C. 2017. Making the first so- tions Research, 387–402. Springer.
lution good! In 2017 IEEE 29th International Conference on Lawler, E. L.; and Wood, D. E. 1966. Branch-and-bound
Tools with Artificial Intelligence (ICTAI), 1073–1077. IEEE. methods: A survey. Operations research 14(4): 699–719.
Gasse, M.; Chételat, D.; Ferroni, N.; Charlin, L.; and Lodi, Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A. R.; Choi, S.; and Teh,
A. 2019. Exact combinatorial optimization with graph convo- Y. W. 2018. Set transformer: A framework for attention-
lutional neural networks. In Advances in Neural Information based permutation-invariant neural networks. arXiv preprint
Processing Systems, 15554–15566. arXiv:1810.00825 .
Gendreau, M.; and Potvin, J.-Y. 2005. Metaheuristics in Luby, M.; Sinclair, A.; and Zuckerman, D. 1993. Optimal
combinatorial optimization. Annals of Operations Research speedup of Las Vegas algorithms. Information Processing
140(1): 189–213. Letters 47(4): 173–180.
Ghasempour, T.; and Heydecker, B. 2019. Adaptive rail- Meyer, B. 2008. Hybrids of constructive metaheuristics and
way traffic control using approximate dynamic programming. constraint programming: A case study with aco. In Hybrid
Transportation Research Part C: Emerging Technologies . Metaheuristics, 151–183. Springer.
Godfrey, G. A.; and Powell, W. B. 2002. An adaptive dy- Optimization, G. 2014. Inc.,“Gurobi optimizer reference
namic programming algorithm for dynamic fleet manage- manual,” 2015.
ment, I: Single period travel times. Transportation Science Palmieri, A.; Régin, J.-C.; and Schaus, P. 2016. Parallel strate-
36(1): 21–39. gies selection. In International Conference on Principles and
Harvey, W. D.; and Ginsberg, M. L. 1995. Limited discrep- Practice of Constraint Programming, 388–404. Springer.
ancy search. In IJCAI (1), 607–615. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
He, H.; Daume III, H.; and Eisner, J. M. 2014. Learning to Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;
Search in Branch and Bound Algorithms. In Ghahramani, Z.; et al. 2019. PyTorch: An imperative style, high-performance
Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, deep learning library. In Advances in Neural Information
K. Q., eds., Advances in Neural Information Processing Sys- Processing Systems, 8024–8035.
tems 27, 3293–3301. Curran Associates, Inc. Pesant, G. 2004. A regular language membership constraint
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill- for finite sequences of variables. In International conference
ing the knowledge in a neural network. arXiv preprint on principles and practice of constraint programming, 482–
arXiv:1503.02531 . 495. Springer.

3686
Pisinger, D. 2005. Where are the hard knapsack problems?
Computers & Operations Research 32(9): 2271–2284.
Rossi, F.; Van Beek, P.; and Walsh, T. 2006. Handbook of
constraint programming. Elsevier.
Saltzman, M. J. 2002. COIN-OR: an open-source library
for optimization. In Programming languages and systems in
computational economics and finance, 3–32. Springer.
Schulte, C.; Lagerkvist, M.; and Tack, G. 2006. Gecode.
Software download and online material at the website:
http://www. gecode. org 11–13.
Serra, T.; Kumar, A.; and Ramalingam, S. 2020. Lossless
Compression of Deep Neural Networks. arXiv preprint
arXiv:2001.00218 .
Sutton, R. S.; Barto, A. G.; et al. 1998. Introduction to
reinforcement learning, volume 135. MIT press Cambridge.
Tang, Y.; Mu, C.; and He, H. 2017. Near-space aerospace
vehicles attitude control based on adaptive dynamic program-
ming and sliding mode control. In 2017 International Joint
Conference on Neural Networks (IJCNN), 1347–1353. IEEE.
Topaloglou, N.; Vladimirou, H.; and Zenios, S. A. 2008.
A dynamic stochastic programming model for international
portfolio management. European Journal of Operational
Research 185(3): 1501–1524.
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio,
P.; and Bengio, Y. 2017. Graph attention networks. arXiv
preprint arXiv:1710.10903 .
Wang, M.; Yu, L.; Zheng, D.; Gan, Q.; Gai, Y.; Ye, Z.; Li, M.;
Zhou, J.; Huang, Q.; Ma, C.; et al. 2019. Deep graph library:
Towards efficient and scalable deep learning on graphs. arXiv
preprint arXiv:1909.01315 .
Wolsey, L. A.; and Nemhauser, G. L. 1999. Integer and
combinatorial optimization, volume 55. John Wiley & Sons.
Xia, W.; and Yap, R. H. 2018. Learning robust search
strategies using a bandit-based approach. arXiv preprint
arXiv:1805.03876 .

3687

You might also like