0% found this document useful (0 votes)
3 views

Deep Neural Network Approximated Dynamic Programming for Combinatorial Optimization

Uploaded by

Hương Lê
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Deep Neural Network Approximated Dynamic Programming for Combinatorial Optimization

Uploaded by

Hương Lê
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Deep Neural Network Approximated


Dynamic Programming for Combinatorial Optimization

Shenghe Xu,1 Shivendra S. Panwar,1 Murali Kodialam,2 T.V. Lakshman2


1
Department of Electrical and Computer Engineering, NYU Tandon School of Engineering, Brooklyn, NY
2
Nokia Bell Labs, Crawford Hill, NJ
{shenghexu, panwar}@nyu.edu, {murali.kodialam, tv.lakshman}@nokia-bell-labs.com

Abstract problem (TSP), simple exhaustive search has time complex-


ity of O(n!). The Bellman-Held-Karp algorithm (Bellman
In this paper, we propose a general framework for combining 1962; Held and Karp 1962) based on DP can achieve time
deep neural networks (DNNs) with dynamic programming complexity of O(2n n2 ) and space complexity of O(2n n).
to solve combinatorial optimization problems. For problems
that can be broken into smaller subproblems and solved by
The exponential growing complexity can still be relatively
dynamic programming, we train a set of neural networks to high for large problem sizes, making it unsuitable for time
replace value or policy functions at each decision step. Two or memory critical applications.
variants of the neural network approximated dynamic pro- Methods based on neural networks (NNs) have been pro-
gramming (NDP) methods are proposed; in the value-based posed as solutions to combinatorial problems since decades
NDP method, the networks learn to estimate the value of ago (Looi 1992; Smith 1999). Recent advancements in deep
each choice at the corresponding step, while in the policy- neural networks (DNNs) have lead to more efficient schemes
based NDP method the DNNs only estimate the best deci- using novel network architectures and new training pro-
sion at each step. The training procedure of the NDP starts cedures of NNs (Bello et al. 2016; Khalil et al. 2017;
from the smallest problem size and a new DNN for the next
size is trained to cooperate with previous DNNs. After all the
Yang et al. 2018). However, many of the previously pro-
DNNs are trained, the networks are fine-tuned together to fur- posed methods focus on specific classes of problems such
ther improve overall performance. We test NDP on the linear as graph based problems (Khalil et al. 2017) and routing
sum assignment problem, the traveling salesman problem and problems (Kool, van Hoof, and Welling 2018). Alternatively,
the talent scheduling problem. Experimental results show that they rely on specific network architectures and specific rein-
NDP can achieve considerable computation time reduction on forcement learning training procedures (Bello et al. 2016).
hard problems with reasonable performance loss. In general, Though in (Yang et al. 2018) a NN based dynamic program-
NDP can be applied to reducible combinatorial optimization ming method was proposed, it requires training on each test-
problems for the purpose of computation time reduction. ing instance of the problem, which is impractical for time
critical tasks.
Introduction In this paper, we propose a deep neural network approx-
imated dynamic programming approach to solve general
Dynamic programming (DP) is a widely used method for combinatorial optimization problems. The main contribu-
solving various optimization problems (Bellman 1966). For tions of this paper are as follows:
a problem that can be reduced to sub-problems with sim-
ilar structures, each corresponding to a stage of decision • We propose a general framework of replacing policy or
making, DP finds the optimal solution for each sub-problem value function calculation process with NNs called neu-
and achieves global optimal solution. Since the problem is ral network approximated dynamic programming (NDP).
broken into sub-problems, DP can efficiently reduce search The framework is simple and robust and can be combined
space as compared with naive exhaustive search over all the with different NN architectures.
possible combinations. • An unsupervised training procedure is proposed for NDP.
However, for many of the combinatorial optimization It consists of a pre-training step and a fine-tuning step. Ro-
problems, the size of the search space grows exponentially bust performance improvement is achieved in the training
or by factorial order of the problem size. Even if problems procedure.
are broken into sub-problems to reduce search space, the
complexity of algorithms using DP can still be high for • Experimental results on the linear sum assignment prob-
NP-hard problems. For example, for the traveling salesman lem (LSAP) and the TSP show that compared with pre-
vious methods, NDP can be an alternative to achieve a
Copyright  c 2020, Association for the Advancement of Artificial balance between computation time and solution quality.
Intelligence (www.aaai.org). All rights reserved. When applied to the talent scheduling problem, NDP is

1684
able to achieve considerable reduction of computation A greedy heuristic based approach was proposed to resolve
time, with a reasonable gap from the optimal solution. assignment collisions.
To overcome the curse of dimensionality, neuro-dynamic
Related Work programming was proposed for space and computation
Recent developments of deep neural networks has enabled time reduction for the original DP (Bertsekas and Tsitsik-
machine learning based methods to achieve state-of-the-art lis 1996). However, the previous works on neuro-dynamic
results in various tasks(LeCun, Bengio, and Hinton 2015; programming mainly focus on using simple approximation
He et al. 2016). By adopting various techniques and NN functions such as linear functions or polynomial regres-
architectures, several methods have also been developed sion to approximate value functions (Powell 2007). Most
for combinatorial optimization problems and achieved close of the works on neuro-dynamic programming also focus on
to optimal results (Bello et al. 2016; Khalil et al. 2017; stochastic control problems (Bertsekas and Tsitsiklis 1996;
Yang et al. 2018; Kool, van Hoof, and Welling 2018). Van Roy et al. 1997; Lam, Lee, and Tang 2007).
In (Bello et al. 2016), a pointer network (Vinyals, For- Different from above mentioned previous works, our
tunato, and Jaitly 2015) based method was proposed to method is proposed for general combinatorial optimization
solve the traveling salesman problem (TSP) and the knap- problems that can be solved by traditional DP. Unlike the
sack problem. To overcome the difficulty of generating high- method in (Bello et al. 2016), NDP does not require sophis-
quality labeled data for NP-hard problems, the pointer net- ticated network structures like the pointer network or the at-
work was trained with reinforcement learning, with an actor- tention model in (Kool, van Hoof, and Welling 2018). NDP
critic based procedure. Combined with an active search pro- does not require a graph structure in the problem like S2V-
cess at testing stage, the pointer network based method was DQN(Khalil et al. 2017), which may not exist in many prob-
able to achieve near optimal solutions with less computation lems that can be solved by DP. Unlike the method proposed
time compared with previous methods. in (Yang et al. 2018), which requires training on each of the
An algorithm called S2V-DQN using graph embedding testing samples for at least 1 second, in this paper we assume
networks was proposed in (Khalil et al. 2017). The algo- that only the distribution of the problem instances is known.
rithm focuses on solving combinatorial optimization prob- The training set and testing set are generated according to
lems with graph structures, especially TSP. S2V-DQN uses the known distribution with different random seeds. NDP is
a strcture2vec network(Dai, Dai, and Song 2016) to repre- trained on a training set and does not require training (Yang
sent information in a policy. Then a DQN (Mnih et al. 2015) et al. 2018) or searching (Bello et al. 2016) for optimization
is trained to provide a greedy policy on the representation. on new instances of the problem. The method proposed in
This method is able to achieve close to optimal solutions, (Lee et al. 2018) requires training of n problem-size specific
with the ability to generalize to problems with sizes over classifiers for a size n problem. While in NDP each DNN
1000. is responsible for solving a sub-problem with different size,
To solve routing related problems, an attention model DNNs for smaller problem sizes are used for training and
(AM) based method was proposed in (Kool, van Hoof, and testing on problems with larger sizes. In addition, certain
Welling 2018). The attention model contains an encoder that amount of the optimal solutions are needed for the training
produces embedding of the context of the problem, and a de- of the classifier based approach, this may be time consum-
coder that produces the solution sequentially. For TSP, the ing or even infeasible for hard problems. Unlike previously
attention based method is able to achieve solutions that are proposed neuro-dynamic programming methods (Bertsekas
closer to optimum compared with S2V-QDN and the method and Tsitsiklis 1996), in this paper we focus on using DNNs
in (Bello et al. 2016). for value or policy function approximation. We apply NDP
In (Yang et al. 2018), the authors proposed a method on general combinatorial optimization problems instead of
called neural network dynamic programming (NNDP) to stochastic control problems.
boost the performance of DP with NNs. For each instance There may be concern that with multiple NNs, NDP may
of TSP, their method trains a new set of parameters with a be more time consuming than other methods. However, ex-
solution reconstruction process that samples solutions. The cept for the classifier based approach mentioned in (Lee et
neural network is trained to estimate quality of each solu- al. 2018), the other methods all work in a step by step way.
tion. The main difference between our approach and NNDP For problems with n steps the NNs are used n times. So the
is that our approach does not require training on test sam- main extra cost introduced by NDP is the space for storage
ples, and instead of training one NN, we train a series of of NNs in memory, which is usually abundant in modern
NNs for each problem size. Note that NNDP still needs to computers.
run the NN multiple times for a multi-step DP problem. With We emphasize that in this paper, the main contribution
the training time on each instance, NNDP may be unsuitable is not to gain performance improvement on previously well
for time-critical tasks. studied problems. We focus on developing a simple and gen-
A classifier based approach was proposed in (Lee et al. eral framework to speedup DP for combinatorial optimiza-
2018) to solve LSAP. The LSAP requires a solution to assign tion problems. In addition, the flexibility of this framework
n jobs to n people that maximize reward or minimizes cost. allows it to be combined with more powerful network ar-
In their approach for each person a single classifier is trained chitectures or training procedures for better performance, or
to get a suitable job assignment for that person. Inevitably simpler network structures for less complexity. For general
there may be jobs that are assigned to more the one person. problems, especially problems without existing high quality

1685
solver or heuristic based methods such as talent scheduling, second phase, all the DNNs are fine-tuned together. DNNs
NDP is a simple and efficient approach for obtaining a close for smaller problem sizes use data generated by the poli-
to optimal solution with computation time reduction. cies of DNNs from earlier steps. The pre-training procedure
helps the networks to converge faster to suitable policies.
Dynamic Programming and Approximation The fine-tuning procedure helps to further improve perfor-
Methods mance of the trained policy.
Dynamic programming was developed for a class optimiza- Training for Value Approximation
tion problems that can be converted into a process of making
For value function approximation, training starts from the
a decision in several steps. The optimal solution of the over-
DNN for the smallest problem size, or the states in the last
all problem should be obtainable by making optimal choices
decision step. In the case of value based NDP, for the small-
at each step; this is called the principle of optimality (Bell-
est subproblem P1 with size N1 and A1 possible actions,
man and others 1954). By dividing the problem into smaller
a DNN denoted by G1 is trained to minimize the mean
subproblems, DP can effectively reduce the search space of
squared error (MSE) of estimation results of the value func-
combinatorial optimization problems. For a given state s,
tion:
which corresponds to a subproblem, and an action from the 1  
feasible set of actions a ∈ A(s), the Bellman equation for a M SE = (G1 (s1 , θ1 , a) − R(s1 , a))2 , (2)
reward maximization problem can be written as: T
s1 ∈S1 a∈A(s1 )

V (s) = max (R(s, a) + V (s )), (1) where T is the total number of the combination of states
x∈A(s)
and actions, S1 is the set of all the possible states for the
where s is the next state followed after choosing action subproblem P1 , θ1 is the coefficient for G1 , R(s1 , a) is the
a, and R(s, a) is the obtained reward by choosing action instantaneous reward obtained by choosing action a at state
x, V (s) is the value function for state s. The Bellman s1 . Since P1 is the final decision step, there is no state tran-
equation can be solved by backward induction, by com- sition so value can be directly calculated from the action in a
puting the value functions for smaller problems and ob- given state. Then for a following DNN corresponding to the
taining the final value function step by step. However, for subproblem with size Nn and An feasible actions, DNN Gn
problems with a large state space, backward induction may is trained to minimize
be time consuming or even infeasible. Several techniques 1  
M SE = (Gn (sn , θn , a) − R(sn , a)−
have been proposed to address the curse of dimensionality T
sn ∈Sn a∈A(sn )
(Bertsekas and Tsitsiklis 1995; 1996; Powell 2007; Buşoniu,
De Schutter, and Babuška 2010; Mes and Rivera 2017; (3)
G
Yang et al. 2018). Many approaches have been proposed to Vn−1 (sn−1 ))2 , (4)
approximate the value function, including using basis func- G
where Sn is the set of possible states, Vn−1 (sn−1 ) is the
tions (Buşoniu, De Schutter, and Babuška 2010), linear mod-
value function obtained by following policy generated by
els, polynomial regression (Powell 2007) and DNNs as pro-
previous trained DNNs
posed in (Yang et al. 2018; van Heeswijk and La Poutré n
2019). As far as we know, apart from (Yang et al. 2018; 
VnG (sn ) = R(si , aG
i ), (5)
van Heeswijk and La Poutré 2019), this is the only other
i=1
work that uses DNNs for function approximation in DP. As
previously stated, different from (Yang et al. 2018), our ap- where
proach requires no training on testing samples, and uses dif- aG
i = arg max Gi (si , θi , a). (6)
a
ferent DNNs for different decision steps instead of one DNN si is the state for subproblem i caused by following the pol-
for all the steps. (van Heeswijk and La Poutré 2019) focuses icy generated by the DNNs. Since the number of possible
more on value function approximation. They only studied states can be infinite, it is infeasible to calculate MSE on all
the nomadic trucker problem, under a Markov decision pro- the states. We follow the common procedure of performing
cess setting, with smaller problem sizes, which is more like gradient descent on mini-batches of data to train the NNs.
a reinforcement learning approach. They used a single NN The pre-training procedure is shown in algorithm 1, where
for a fixed problem size, which is different from this paper. M is the number of steps in the problem, E is the number of
Dynamic programming can be used to solve optimization epochs to train a DNN.
problems in two ways. For the value function based varia- In the second phase, the DNNs are fine-tuned together.
tion, the value function is solved for the states, then policy The intuition behind this phase is that the distribution of the
is chosen based on maximization of the value function. The states lead by a given policy may be different from the states
other approach is to derive optimal policy for each state, and used in the pre-training phase, and it is hard to estimate the
perform optimal actions at each state. distribution since it also changes with the updates of policy.
To help the DNNs better approximate the value functions,
Training Procedure in the fine-tuning phase the states for subproblem k are ob-
We propose a two phase training procedure for NDP. In the tained by following policies generated by Gk+1 ,...GM , for
first phase, the DNNs are pre-trained for only a few iter- solving the overall problem of M steps. The fine-tuning pro-
ations with data generated from given distributions. In the cedure is shown in algorithm 2.

1686
Algorithm 1 Pre-training Process of NDP the training process of NDP is also similar to multi-agent re-
1: for i = 1; i < M ; i + + do inforcement learning (Buşoniu, Babuška, and De Schutter
2: for j = 1; j < E; j + + do 2010). But with clearly defined problem structure, at each
3: Generate B batches of states for problem i from a training step, the agents are able to obtain feedback for all
given distribution. possible actions, instead of performing a single action in the
G
4: Calculate Vi−1 (sn−1 ). reinforcement learning context. Similar to fixing the policy
5: for k = 1; k < B; k + + do by using a target network in the training process of DQN
6: Update θi with data batch k in si . (Mnih et al. 2015), the coefficients for each agent are also
7: end for updated consecutively in several batches, with the coeffi-
8: end for cients of other DNNs fixed, which helps stabilize training
9: end for results with gradient descent. This is also confirmed by our
experiments, updating each DNN for several batches, while
keeping other DNNs the same, achieves more robust testing
Algorithm 2 Fine Tuning Process of NDP performance compared with updating the DNNs simultane-
1: for i = 1; i < E; i + + do ously.
2: Generate B batches of states for problem M .
3: for j = 1; j < M ; j + + do Solving Optimization Problems with Neural
4: Obtain B batches of data for sj for problem j, fol- Network Approximated Dynamic
lowing policy given by previous DNNs.
5: G
Calculate Vj−1 (sn−1 ). Programming
6: for k = 1; k < B; k + + do In this section, we describe how NDP is applied to the LSAP,
7: Update θj with data batch k in sj . TSP and the talent scheduling problem.
8: end for
9: end for The Linear Sum Assignment Problem
10: end for We first start with the LSAP, this problem is can be solved
optimally with the Hungarian algorithm with complexity of
O(n3 ) (Kuhn 1955). Meanwhile, it is also reducible to sub-
Training for Policy Approximation problems and thus can be solved by DP. In LSAP, n jobs
have to be assigned to n people in an optimal way. The re-
Similar to traditional DP, DNNs in NDP can also be trained ward of assigning job i to person j is cij . The objective is
to directly provide policies at each decision step. Instead of to maximize the sum of the rewards, with the constraint that
training the DNNs to estimate value functions for each ac- each job should be assigned to one and only one person.
tions at given state, the DNNs are trained to directly estimate We formulate the DP solution to this problem as a multi-
the best action to be taken at a given state. In this case, the stage decision process. For a subproblem with size n, a de-
DNNs are used for classification of the best action. Since cision of assigning the first job to a person has to be made.
this can be seen as a classification task, cross entropy loss is After assigning the job to person k, by removing rewards
used for training of the DNNs. c1,1...n and c1...n,k , the problem is transitioned into a sub-
problem with size n − 1.
The Two Phase Training Procedure The value function for subproblem with size 2 can be di-
The two phase training procedure is adopted mainly for three rectly written in matrix multiplication form, so training of
reasons. To stabilize training performance, pre-training is DNNs start from subproblems with size 3.
used to help the DNNs to get close to a good local opti-
mum. On the other hand, if there is need to solve problems The Traveling Salesman Problem
of various sizes, pre-training the DNNs for different prob- The traveling salesman problem (TSP) is one of the well
lems sizes and saving the models for fine-tuning can help investigated NP-hard problems. Various heuristic methods
save overall training time. Finally, the fine-tuning phase is have also been proposed to solve TSP (Applegate et al.
adopted to train the DNNs with more accurate distribution 2006). In TSP an agent has to find a route to visit all given
of data. Ideally, if the DNNs can achieve optimal solutions at cities exactly once and return to the starting city. The objec-
each step, the pre-training procedure should be sufficient for tive is to minimize the total distance traveled in the route.
obtaining the optimal policy. However, the DNNs’ limited We formulate the subproblems as finding the shortest
capacity and sensitivity to data distribution requires further route from a given starting city to a given ending city which
training with data closer to the real distribution. visits all the cities exactly once. dij denotes the distance
The two phase training procedure is partially motivated from city i to city j. The state of each subproblem can be
by (Hinton and Salakhutdinov 2006), however in their case represented by a matrix, and for convenience we assume the
the term pre-training and fine-tuning are used for layers of a starting city is the city with index 1, while the ending city has
NN, in this paper we are using the terms for a series of DNNs index n. State transitions involve removing d1,1...n , d1,...n,1
that are trained and used sequentially. In experiments, we and assigning the chosen city with index one. By assigning
find that the pre-training process helps the DNNs converge the same city to index one and index n, the returning require-
to efficient solutions in less training steps. On the other hand, ment can be enforced at the first decision step.

1687
The Talent Scheduling Problem fc1, relu, bn
The talent scheduling problem is also an NP-hard problem
that can be solved by DP (Garcia de la Banda, Stuckey, and
Chu 2011; Qin et al. 2016). In the problem a suitable sched- fc2, relu, bn
ule of shooting a number of movie scenes has to be found.
Each scene may involve a number of actors and last for a
number of days. Each actor incurs a certain amount of cost fc3, relu, bn
per day. The actor has to be paid for the duration from the
first involved scene to the last involved scene, including the
time that scenes without the actor is scheduled.
The subproblems are formulated as follows; given a num-
fc4
ber of scenes, actors that are currently on hold, and costs for
each actor, find the next best scene to be scheduled. State Figure 1: Network Architecture for TSP
transitions involves removing the scheduled scene and up-
dating the list of actors that are on hold. At the first decision
stage there is no actor on hold. the principle of selecting equivalent scenes first mentioned
in (Garcia de la Banda, Stuckey, and Chu 2011); we define
waiting cost as the cost of actors waiting on the site, at each
Experimental Setup step, the scenes with the least amount of waiting cost are se-
LSAP. For LSAP we test the performance of NDP with lected. We denote this method as least waiting cost (LWC).
problem size up to 20. The performance of NDP is compared For problem size, we select 20 actors and 20, 25 and 30
with the Hungarian method implementation in SciPy (Jones scenes. The cost for actors are random integers from 1 to
et al. 2016). A simple greedy heuristic, in which at each step 100 for evaluation, and floating numbers from 1 to 100 for
the assignment with highest reward among all jobs and peo- training.
ple is chosen is used as a baseline method for comparison. Comparison of computation time. Since our method
For uniformly generated rewards with 20 jobs, compared can be run in parallel for batches of instances, we evalu-
with the Hungarian method, the greedy method achieves a ate the computation time of the methods in the same way as
performance gap of 5.02% and NDP-policy achieves a per- in (Kool, van Hoof, and Welling 2018). For all the experi-
formance gap of 3.14%. So instead of using uniformly gen- ments we include the computation time of 10,000 problems
erated rewards, for which the greedy baseline can easily for TSP and LSAP, and 1,000 problems for talent schedul-
achieve close to the optimal solution, we focus on a sce- ing. Experiments are conducted on a server with one P100
nario where there is no obvious simple heuristic. The re- GPU and two Xeon Silver 4114 CPUs. For baseline methods
wards are generated from a Beta distribution, with α = 0.07 with only CPU implementation, we test on the same server
and β = 0.17 for evaluation purposes. using all the cores of the CPU in parallel. We use a batch
TSP. For TSP we follow the same practice as mentioned size of 10,000 for both the attention model (AM) in (Kool,
in (Kool, van Hoof, and Welling 2018). The cities are gen- van Hoof, and Welling 2018) and NDP.
erated uniformly from the unit square. Euclidean distance is Network architectures. For all the problems, simple
used as cost. The same test dataset and baseline methods as fully connected DNNs are used in NDP. Figure 1 shows
in (Kool, van Hoof, and Welling 2018) are used for evalua- the network architecture used for TSP. Each fully connected
tion. layer is followed by a relu activation function and a batch
The talent scheduling problem. As mentioned in normalization (BN) function (Ioffe and Szegedy 2015). Us-
(Cheng, Diamond, and Lin 1993), the talent scheduling ing more advanced network architectures such as Resnet (He
problem with each actor required for two scenes and a et al. 2016) can further improve performance, however in
universal daily wage of one is already NP-hard. However this paper we do not focus on finding the best parameters for
we still assume the actors can have random uniformly dis- the NNs. Relu is used as the activation function for all the
tributed daily wages. Since the duration of most of the test DNNs.
cases are all ones in (Garcia de la Banda, Stuckey, and Chu For LSAP with subproblem of n jobs, DNNs with one
2011), we focus on the scenario where all scenes have the hidden layer of size 8n are used. Batch normalization is used
same duration. Instead of integer wages used in previous for the policy-NDP. Using a larger hidden layer size 16n
works (Qin et al. 2016; Garcia de la Banda, Stuckey, and brings less than 0.5 percent performance gap reduction but
Chu 2011), we use floating point wage value generated from introduces longer computation time.
a uniform distribution to train the DNNs, which is beneficial For TSP with 20 cities, DNNs with three hidden layers
for training. For testing we still use integer wages, so that of size 2n2 , 4n2 and 16n is used for each subproblem. For
the previous methods can be applied directly to get the opti- TSP with 50 cities, smaller DNNs with hidden layers 8n,
mal solution. When generating scenes, for each actor we ran- 4n and 2n are used for each subproblem to reduce training
domly choose the number of scenes from 2 to the maximum and testing time. Batch normalization is used for both value
number the actor is in, and randomly allocate the scenes. based NDP and policy based NDP.
Currently there is no well-known heuristic baseline for the For talent scheduling problem with 20 actors and n
talent scheduling method. We propose a heuristic similar to scenes, DNNs with two hidden layers with size 1200 and

1688
n ∗ 4 are used. The cost for each actor can be represented as 0.12
a vector c, with ci for cost of actor i. Instead of represent- LSAP
ing the scenes in a binary matrix O with oij = 1 indicating 0.11
TSP
actor i is in scene j, and zero otherwise, we set oij = ci if 0.1 Talent Scheduling
actor i is in scene j and concatenate the matrix with c and

Performance Gap
0.09
a binary vector indicating the actors waiting on site. Batch
0.08
normalization is only used for policy based NDP.
For DNNs used in NDP, in general it is beneficial to select 0.07
a relatively large first hidden layer. However to achieve a 0.06
balance between computation time and the performance gap,
0.05
for large problems such as TSP with 50 cities we are using
smaller DNNs. While performance varies with the choice of 0.04
parameters, in our experiment the performance of the DNNs 0.03
always improves with training time until they converge. 0 100 200 300 400 500 600
Epoch
Training settings. All the models of NDP are imple-
mented with Pytorch (Paszke et al. 2017) and the DNNs are
trained with the Adam optimizer (Kingma and Ba 2014). A Figure 2: Validation Accuracy in Fine-tuning Phase
learning rate of 0.001 and batch size of 100 is used for both
pre-training and fine-tuning.
For pre-training, all the DNNs for LSAP and TSP are ting. All curves are for results of NDP-policy. For LSAP and
trained with 3000,000 samples of data generated on the fly, TSP the problem size is 20, for talent scheduling the problem
each sample is used only once. For talent scheduling, since setting is 20 actors with 20 scenes. It can be seen from the
making sure the scenes in each problem are all different is figure that in the fine-tuning phase, the performance gaps are
time consuming, 2000,0000 samples are generated and each consistently reduced during the training process. For TSP
sample is used for three times for pre-training. the DNNs converge after about 1500 epochs of fine-tuning.
While for talent scheduling training for another 300 epochs
For LSAP with 20 jobs, results from 1000 epochs of fine-
only reduces the performance gap by 0.2%. The relatively
tuning are selected for comparison. For fine-tuning each
high initial performance gap of the talent scheduling prob-
epoch consists of training of all the DNNs each with 100,000
lem may be caused by the difference in the distribution of
samples of data. It takes approximately one hour to pre-
data. For pre-training phase, the actors on site are generated
train the DNNs from problem size 3 to 20. While for the
randomly, while in the fine-tuning phase, distribution of on
fine-tuning phase, with NDP-policy, each epoch takes about
site actors may be different due to previous policy of the
100s. Fine-tuning all the DNNs for 1000 epochs take about
DNNs.
28 hours on one GPU. For LSAP with 50 jobs, results af-
ter 65 epochs of fine-tuning are used for comparison. In this
case fine-tuning for 65 epochs takes around 30 hours on a Performance Evaluation
single GPU. We evaluate the performance of NDP in terms of the solu-
For TSP with 20 cities, we show results for NDP-policy tion’s performance gaps from the optimal or best ones, and
and NDP-value after 1500 epochs of fine-tuning, which computation time.
takes about 70 hours on one GPU. For TSP with 50 cities, For LSAP, the Hungarian method is used as the bench-
the smaller DNNs converge after 100 epochs of fine-tuning, mark method. The comparison results are shown in table
which takes about 75 hours for NDP-policy and 55 hours for 1. The simple greedy heuristic, in which at each step the
NDP-value on a single GPU. assignment with highest reward among all jobs and people
For the talent scheduling problem, since the DNNs are is chosen as a baseline method for comparison. The NDP
larger, fine-tuning takes longer time. For problems with 20 based methods are able to achieve considerable computation
actors and 20 scenes, each epoch of fine-tuning takes around time reduction, due to the parallel operation on GPU. Both
400s. For problems with 20 actors and 25 scenes, each epoch NDP methods achieve about three percent performance gap
takes about 800s. For problems with 20 actors and 30 scenes, from the optimal solution in the 20 job case. For problems
each epoch takes about 1200s. We fine-tune the models with with 50 jobs, due to diversity gain, making a sub-optimal
different number of epochs. With fixed dataset for training, decision in one step has less impact on the overall perfor-
the DNNs converge within less number of epochs but pos- mance. So the non-optimal methods achieve closer to opti-
sibly to policies further from the optimal ones. For the 20 mal solutions. NDP methods still achieve better performance
scene case, we show results of NDP-policy fine-tuned for than the greedy method.
600 epochs, NDP-value fine-tuned for 400 epochs. For the For TSP we compare the performance on problem size
25 scene case we fine-tune the DNNs for 150 epochs. For 20 and 50. Results are shown in table 2. We include results
the 30 scene case the DNNs for both NDP-value and NDP- of Gurobi since according to (Kool, van Hoof, and Welling
policy are fine-tuned for 65 epochs. 2018), it achieves the best solution in the least amount of
Figure 2 shows the change of validation performance gap time. We also include results of AM from (Kool, van Hoof,
during the fine-tuning phase of NDP. We select the results of and Welling 2018), as far as we know it is the NN based
first 600 epochs of fine-tuning for all three problems for plot- method with closest to optimum solutions for TSP. Random

1689
Table 1: Performance Comparison on LSAP Table 3: Performance Comparison on Talent Scheduling
20 Jobs 20 Scenes
Method Reward Gap Time Method Cost Gap Time
Hungarian 19.77 0.00% 1.47s EBB 13660.32 0.00% 8.00s
Greedy 17.49 11.55% 0.28s LWC 16420.23 20.2% 0.08s
NDP-Policy 19.07 3.47% 0.01s NDP-Policy 14231.52 4.18% 0.08s
NDP-Value 19.16 3.07% 0.01s NDP-Value 15818.16 15.80% 0.07s
50 Jobs 25 Scenes
Hungarian 49.99 0.00% 7.19s EBB 17278.15 0.00% 49.28s
Greedy 47.25 5.49% 1.27s LWC 21142.65 22.37% 0.08s
NDP-Policy 49.12 1.77% 0.10s NDP-Policy 18599.89 7.65% 0.13s
NDP-Value 49.27 1.45% 0.10s NDP-Value 20675.90 19.67% 0.08s
30 Scenes
Method Cost Gap Time
Table 2: Performance Comparison on TSP EBB 20639.77 0.00% 1030.80s
20 Cities LWC 25677.81 24.41% 0.10s
Method Cost Gap Time NDP-Policy 23659.64 14.63% 0.15s
Gurobi 3.84 0.00% 5.42s NDP-Value 25132.72 21.77% 0.10s
Random Insertion 4.00 4.36% 0.37s
Farthest Insertion 3.93 2.36% 0.58s
AM 3.84 0.08% 0.31s the performance gap of NDP methods are higher for talent
NDP-Policy 3.93 2.54% 0.06s scheduling problems. This may be because the dataset used
for training consists of fixed samples, while for other prob-
NDP-Value 3.98 3.84% 0.06s
lems each update of the DNNs are performed with new ran-
50 Cities domly generated data. The experiment results are shown in
Gurobi 5.70 0.00% 74.84s table 3.
Random Insertion 6.13 7.65% 1.23s
Farthest Insertion 6.01 5.53% 1.82s Conclusions and Future Work
AM 5.80 1.76% 1.38s
NDP-Policy 6.38 12.02% 0.14s In this paper, we proposed a deep neural network based dy-
NDP-Value 7.10 24.69% 0.14s namic programming approximation method to solve com-
binatorial problems. Experiment results show that the pro-
posed method is able to achieve considerable computation
time reduction, with less than 5% loss on TSP with 20 cities
insertion (RI) and farthest insertion (FI) are also included and LSAP with 20 jobs. When applied to the talent schedul-
for comparison. Due to the simple architectures of DNNs in ing problem, tested with 1000 problems with 25 scenes and
NDP, both policy based NDP and value based NDP are able 20 actors, NDP-policy is able to achieve solutions within
to obtain solutions faster than AM. However the NDP based 10% gap from the optimal cost, within 200ms, while current
methods perform worse in terms of performance gap from best solver takes almost 50s to obtain the solutions. NDP
the best solution. In terms of solution quality, policy based can be a promising method to reduce computation time of
NDP performs better than random insertion, but worse than traditional DP for time critical applications. With NDP’s un-
farthest insertion in the 20 city case. For the problem size supervised training procedures, it can also be an alternative
of 50, NDP methods suffer a larger performance gap but run to solve relative large problems that DP can not solve in fea-
with less computation time. sible time.
For the talent scheduling problem, according to (Qin et Meanwhile, there are still several open research directions
al. 2016) the enhanced branch and bound (EBB) method is for NDP. The training of a series of DNNs may be time con-
so far the fastest method that achieves optimal solution, we suming, especially in the fine-tuning stage. Simply increas-
include results generated by the C++ implementation pro- ing the learning rate does not lead to faster performance and
vided by the authors. Even though EBB is run to solve the may cause instability during training. Hopefully with the de-
problems in parallel on all cores of the CPU, the computa- velopment of hardware and faster DNN training techniques,
tion time is still high for problems with 30 scenes. Note that this problem can be mitigated. Alternatively a more effi-
when EBB is used to solve each problem sequentially, on cient fine-tuning procedure for NDP can be found. For pol-
average it takes 72ms, 102ms and 172ms to solve one prob- icy based NDP, we found that using more advanced network
lem with 20, 25 and 30 scenes. For the talent scheduling architectures such as Resnet (He et al. 2016) can further re-
problem, the policy based NDP achieves much better per- duce the performance gap from the optimal value. However,
formance than the value based NDP. This may be because training such networks is more time consuming. One possi-
due to the limited capacity of the DNNs, the DNNs were not ble solution is to share some of the parameters such as the
able to sufficiently approximate the value functions. Overall, convolution kernels in Resnet. We leave the work of finding

1690
more suitable network architectures and parameter sharing Lam, S.-W.; Lee, L.-H.; and Tang, L.-C. 2007. An approximate
methods for future research. dynamic programming approach for the empty container allocation
problem. Transportation Research Part C: Emerging Technologies
References 15(4):265–277.
Applegate, D. L.; Bixby, R. E.; Chvatal, V.; and Cook, W. J. 2006. LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature
The traveling salesman problem: a computational study. Princeton 521(7553):436.
University Press. Lee, M.; Xiong, Y.; Yu, G.; and Li, G. Y. 2018. Deep neural net-
Bellman, R., et al. 1954. The theory of dynamic programming. works for linear sum assignment problems. IEEE Wireless Com-
Bulletin of the American Mathematical Society 60(6):503–515. munications Letters 7(6):962–965.
Bellman, R. 1962. Dynamic programming treatment of the travel- Looi, C.-K. 1992. Neural network methods in combinatorial opti-
ling salesman problem. Journal of the ACM (JACM) 9(1):61–63. mization. Computers & Operations Research 19(3-4):191–208.
Bellman, R. 1966. Dynamic programming. Science 153(3731):34– Mes, M. R., and Rivera, A. P. 2017. Approximate dynamic pro-
37. gramming by practical examples. In Markov Decision Processes in
Practice. Springer. 63–101.
Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, S. 2016.
Neural combinatorial optimization with reinforcement learning. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.;
arXiv preprint arXiv:1611.09940. Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.;
Ostrovski, G.; et al. 2015. Human-level control through deep rein-
Bertsekas, D. P., and Tsitsiklis, J. N. 1995. Neuro-dynamic pro- forcement learning. Nature 518(7540):529.
gramming: an overview. In Proceedings of 1995 34th IEEE Con-
ference on Decision and Control, volume 1, 560–564. IEEE. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito,
Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Auto-
Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-dynamic pro-
matic differentiation in PyTorch. In NIPS Autodiff Workshop.
gramming. Athena Scientific Belmont, MA.
Powell, W. B. 2007. Approximate Dynamic Programming: Solving
Buşoniu, L.; Babuška, R.; and De Schutter, B. 2010. Multi-agent
the curses of dimensionality, volume 703. John Wiley & Sons.
reinforcement learning: An overview. In Innovations in multi-agent
systems and applications-1. Springer. 183–221. Qin, H.; Zhang, Z.; Lim, A.; and Liang, X. 2016. An enhanced
branch-and-bound algorithm for the talent scheduling problem. Eu-
Buşoniu, L.; De Schutter, B.; and Babuška, R. 2010. Approximate
ropean Journal of Operational Research 250(2):412–426.
dynamic programming and reinforcement learning. In Interactive
collaborative information systems. Springer. 3–44. Smith, K. A. 1999. Neural networks for combinatorial optimiza-
tion: a review of more than a decade of research. INFORMS Jour-
Cheng, T.; Diamond, J.; and Lin, B. 1993. Optimal scheduling in
nal on Computing 11(1):15–34.
film production to minimize talent hold cost. Journal of Optimiza-
tion Theory and Applications 79(3):479–492. van Heeswijk, W., and La Poutré, H. 2019. Approximate dynamic
programming with neural networks in linear discrete action spaces.
Dai, H.; Dai, B.; and Song, L. 2016. Discriminative embeddings
arXiv preprint arXiv:1902.09855.
of latent variable models for structured data. In International con-
ference on machine learning, 2702–2711. Van Roy, B.; Bertsekas, D. P.; Lee, Y.; and Tsitsiklis, J. N. 1997. A
neuro-dynamic programming approach to retailer inventory man-
Garcia de la Banda, M.; Stuckey, P. J.; and Chu, G. 2011. Solving
agement. In Proceedings of the 36th IEEE Conference on Decision
talent scheduling with dynamic programming. INFORMS Journal
and Control, volume 4, 4052–4057. IEEE.
on Computing 23(1):120–137.
Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer net-
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-
works. In Advances in Neural Information Processing Systems,
ing for image recognition. In Proceedings of the IEEE conference
2692–2700.
on computer vision and pattern recognition, 770–778.
Yang, F.; Jin, T.; Liu, T.-Y.; Sun, X.; and Zhang, J. 2018. Boosting
Held, M., and Karp, R. M. 1962. A dynamic programming ap-
dynamic programming with neural networks for solving np-hard
proach to sequencing problems. Journal of the Society for Indus-
problems. In Asian Conference on Machine Learning, 726–739.
trial and Applied Mathematics 10(1):196–210.
Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimen-
sionality of data with neural networks. Science 313(5786):504–
507.
Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. arXiv
preprint arXiv:1502.03167.
Jones, E.; Oliphant, T.; Peterson, P.; et al. 2016. Scipy: Open source
scientific tools for python, 2001.
Khalil, E.; Dai, H.; Zhang, Y.; Dilkina, B.; and Song, L. 2017.
Learning combinatorial optimization algorithms over graphs. In
Advances in Neural Information Processing Systems, 6348–6358.
Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic
optimization. arXiv preprint arXiv:1412.6980.
Kool, W.; van Hoof, H.; and Welling, M. 2018. Attention, learn to
solve routing problems! arXiv preprint arXiv:1803.08475.
Kuhn, H. W. 1955. The hungarian method for the assignment
problem. Naval Research Logistics Quarterly 2(1-2):83–97.

1691

You might also like