A graph convolutional encoder and multi-head attention decoder network for TSP via reinforcement learning
A graph convolutional encoder and multi-head attention decoder network for TSP via reinforcement learning
1. Introduction exponential complexity in the worst case. For some specific problems,
approximation methods (Williamson and Shmoys, 2011) can find sub-
Combinatorial Optimization (CO) problems have always gained optimal solutions with probable worst-case guarantees in polynomial
widespread attention in applied mathematics and operations research, time, but still be of poor approximation ratios (Rego et al., 2011).
and exit in many real-life industries such as manufacturing, supply Although, heuristics can find satisfactory results within reasonable
chain management, urban transportation, and lately in drone routing computational time, they lack theoretical guarantee on the solution
(Davendra, 2010; MirHassani and Habibi, 2013; Huang et al., 2020; quality, require substantial trial-and-error and highly depend on the
Tran et al., 2020). Although wide research papers present new ap-
intuition and experience of human experts to improve solution quality
proaches to this field, it is still a challenge to obtain satisfactory results
(Khan and Maiti, 2019; Pandiri and Singh, 2019; Ebadinezhad, 2020;
due to the NP-hardness of those CO problems especially in practical
Al-Gaphari et al., 2021; Saji and Barkatou, 2021).
application scenarios (Paschos, 2014). The Traveling Salesman Problem
To make a better trade-off between a good quality solution and a
(TSP) is among the most extensively solved CO problems in practice,
and has been studied for its simple problem description (Hromkovič, brief solving time to solve TSP, the learning-based methods, have been
2013; Osaba et al., 2020). The state-of-the-art methodologies to TSP investigated and achieve competitive performance to the above non-
could be classified into exact methods, approximation methods, and learning-based methods (Bengio et al., 2021). The first challenge is
heuristics methods that either require too much time to compute or to introduce learning-based algorithm for TSP. Vinyals et al. (2015b),
not mathematically well defined. Exact methods can find the optimal for the first time, proposed a Ptr-Net for TSP with Recurrent Neu-
solution under the theoretical guarantee, e.g., branch-and-bound (B&B) ral Networks (RNNs) which is trained by supervised learning and
framework (Wang et al., 2012; Subramanyam and Gounaris, 2016; Kin- achieved significant improvement over no-learning-based methods on
able et al., 2017), but poorly on large-scale routing problems for their computing-time. However, it is hard to obtain the label dada when the
∗ Corresponding author.
E-mail address: [email protected] (C. Li).
https://doi.org/10.1016/j.engappai.2022.104848
Received 18 September 2021; Received in revised form 20 January 2022; Accepted 22 March 2022
Available online 7 April 2022
0952-1976/© 2022 Elsevier Ltd. All rights reserved.
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
number of nodes becomes large. To ease the difficulty of training the Section 2. Section 3 gives the definition of TSP and the Markov Decision
model without label data, Bello et al. (2016) further applied reinforce- Process of the GCE-MAD Net. Section 4 introduces the proposed GCE-
ment learning to train the Ptr-Net for TSP. Although RL pretraining MAD Ne in detail. Section 5 introduces the training method of the
updated the model parameters with the actor–critic algorithm (Mnih proposed deep reinforcement learning method. We evaluate GCE-MAD
et al., 2015), it failed to utilize the graph-structured features of TSP Net on TSP instances and compare it against state-of-the-art machine
instances which can be embedded in node representations for different learning methods, optimal solver CPLEX and traditional heuristics in
downstream tasks by graph embedding or network embedding tech- Section 6. Finally, conclusions and prospects are listed in Section 7.
niques (Goyal and Ferrara, 2018). As a result, the pretrained model
cannot make full use of node features. 2. Related work
What is more, same or similar actions would be taken according to
the policy at decoding steps. Multiple decoders can generate different Existing methods for TSP mainly include exact algorithms, ap-
subsequences, which contributes to better complete solution. However, proximation methods, and heuristics methods. Those methods can be
the existing learning-based methods, e.g. AM (Kool et al., 2018), S2V- concluded as model-based (Wu et al., 2019; Ali et al., 2020; Al-Gaphari
DQN (Dai et al., 2017), Ptr-Net (Vinyals et al., 2015b), neglect to et al., 2021; Kanna et al., 2021; Saji and Barkatou, 2021; Wang and
pursue more sequences when executing decoding strategy. Although Han, 2021) methods which need to build MIP model first and only fit
Joshi et al. (2019) proposed a model which can output TSP solutions to a specific instance. The details of those model-based methods for TSP
in one shot, but it needed additional procedures, such as beam search, are summarized in Davendra (2010). Here, we focus on the emerging
to generate reasonable solutions when the period of training model learning-based methods (Bengio et al., 2021; Li et al., 2021; Talbi,
is finished. Those additional procedures are used to find the optimal 2021) which have achieved dramatic advantages against conventional
solution based on the same pretrained model, which will also reduce model-based methods in solution quality and computing-time.
the diversity of the solution space. Recent success of applying deep learning to solve CO problems can
To tackle the issues and limitations above, we propose a novel graph be traced back to the pointer network (Ptr-Net) (Vinyals et al., 2015b),
convolutional encoder and multi-head attention decoder network (GCE- it took three challenging CO problems as sequence-to-sequence prob-
MAD Net) by extracting the hierarchical features from the original TSP lems, and overcame a drawback that output length depends on input by
graph input and decoding multiple sequences to increase the diversity a pointer. Ptr-Net is a variant of attention mechanism and uses attention
of solution space. The encoder based on GCN with node and edge as a probability distribution called ‘pointer’ to select an element of the
features as input. The node features are 2-dimensional city coordinates input sequence as the output. Ptr-Net is trained by supervise learning
and edge features are binary elements about any two cities connecting and the ground-truth output permutations are given by the Concorde
or not. The outputs of encoder are then passed into a decoder using solver. Because Ptr-Net is sensitive and expensive to the quality of
attention mechanism (Vaswani et al., 2017) to predict the probabil- label data, Bello et al. (2016) created an actor–critic reinforcement
ity distribution of unselected nodes. The multiple decoders scheme learning based algorithm, in which Ptr-Net is the actor network, three
is utilized to increase diversity of solution space at each timestep, other network modules consist of the Critic network. Although, Ptr-Net
specifically, each decoder generates a TSP solution and the optimal architecture can learn a good solution for CO problems, it does not
solution is selected from those solutions. The entire encoder–decoder reflect graph structure of CO problems. The original network (Vinyals
network is trained by an improved reinforce learning (Williams, 1992) et al., 2015b) is designed for NLP, not for TSP, so there is a limitation
algorithm. which neglects the permutation invariance of the input cities. The work
We propose a novel graph convolutional encoder and multi-head Nazari (2018) presented a permutation-invariance encoder to let the
attention decoder network (GCE-MAD Net) for TSP. The contributions network learn the input order invariance. Kool et al. (2018) did not
of this work are as follows: use positional encoding in the Transformer (Vaswani et al., 2017), and
(1) We propose a graph convolutional network as an encoder to produced resulting node embeddings which were invariant to the input
aggregate neighbor features of each node. The node and edge features order.
affect each other. The relative weight between two neighboring nodes is Considering no unique representation of a TSP graph, Graph Neu-
computed by the edge features, and edge features are updated through ral Networks (GNNs) have potential to play the role as an encoder
two connected node features. Furthermore, shallow features from the because of their permutation-invariance and sparsity-awareness (Wu
original graph input are obtained through residual block, which results et al., 2020; Zhou, 2020). Nazaria et al. encodes CO problems by a
in each node can aggregate features from all graph convolutional structure2vec graph embedding network and constructs solutions incre-
layers. mentally (Dai et al., 2017). Replacing structure2vec graph embedding
(2) We propose a multiple decoders strategy which can generate model, Graph convolutional networks (GCNs) (Duvenaud et al., 2015;
several complete sequences at once. The probability distribution of Defferrard et al., 2016; Gehring et al., 2017; Marcheggiani and Titov,
selecting next node is calculated through the multi-head attention 2017; Chen et al., 2018; Li et al., 2018) play an important role to
mechanism-based decoder, which take the graph features, the first encode node representations for estimating the likelihood of whether
selected node features and the last selected node features as input. a node is part of optimal solution. Deudon et al. (2018) took the graph
Furthermore, the multiple decoders scheme realizes to select several attention network as the encoder to aggregate neighbors features of
nodes at each time step to produce several complete sequences, which each node, and utilized the same PN for selecting the node inserted
increases diversity of solution space. into the subtour. The Sinkhorn Policy Gradient (SPG) algorithm were
(3) We propose a tailored deep reinforcement learning-based al- proposed to learn policies on permutation matrices. One sinkhorn layer
gorithm is designed to train the GCE-MAD Net. During training, the followed the GRU Cho et al. (2014) was used to produce continuous
baseline is fixed until a stronger baseline appeared. New baseline is relaxations of permutation matrices.
the minimal cost of several solutions generated by the multiple de- Different with the above deep reinforcement learning methods,
coders scheme at each epoch. This baseline updated policy can ensure some researches achieved the parameters of the GCN encoder to gen-
GCE-MAD Net is always improved over itself. erate node embeddings for downstream task, i.e., large number of
(4) Our experiments show the GCE-MAD Net is efficient and has ground-truth output permutations were needed in advance to optimize
stronger generalization ability than the state-of-the-art learning-based the parameters. The graph learning network (GLN) directly learned the
algorithms. patterns of generated TSP solutions, in a certain sense, this model was
This paper is organized as follows. First, related work about tackling trained by some ground-truth circles (Nammouchi et al., 2020). Joshi
CO problems by machine learning based algorithms is discussed in et al. (2019) took a GCN as the encoder to aggregate neighbors features,
2
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
and utilized a MLP to output the heatmap of possible connected edges the query with all nodes, and it can be used to obtain the probability
of each node at one shot. Although two methods were trained by the distribution of picking next unvisited node by softmax function. It is
supervised learning way, additional strategies, e.g. beam search, for notable, the multiple decoder strategy in this paper allows to select
searching optimal solution still increased the computing-time. several nodes for different permutations at a time, and the optimal
The above learning-based have achieved competitive performance solution comes from shoes permutations.
on solving TSP, but most of them seldom reflects hierarchical fea-
tures from the original graph input. Motivated by this, a residual 4.1. The encoder
graph convolutional network is used to exact and fuse features from
all layers. Furthermore, single decoder in the existing learning-based Our encoder is based on GCN which exploits neural network op-
methods is easy to make similar decision of choosing next node, so, erations over graph-structured data. For TSP instances, the input of
we design a multiple decoders scheme to gain various probability the encoder includes two parts: node and edge features. Node feature
distributions at each time step, which improves solution quality and 𝐱𝑖 ∈ [0, 1]2 is a vector representing 2-dimensional coordinate of the 𝑖th
enhance generalization ability. node. Meanwhile, edge feature is a binary element about node 𝑖 and 𝑗
connecting or not and is defined by Eq. (4).
3. Problem definition {
1, if node 𝑖 connects with node 𝑗
𝑒𝑖𝑗 = (4)
0, otherwise
In this paper, we focus on solving any random instance 𝑠 which is
the symmetric two-dimensional Euclidean TSP and formulated as an Because pairwise computation for all nodes is intractable when gen-
undirected graph 𝐺 (𝑉 , 𝐸). In the graph, 𝑉 = {1, 2, … , 𝑛} (with |𝑉 | = 𝑛) eralizing model to large-scale problem instances, k-nearest neighbors is
{ }
represents a set of nodes, and 𝐸 = 𝑒11 , 𝑒12 , … , 𝑒𝑛𝑛 denotes a set of adopted to make input graph sparse. Specifically, the neighbors of each
edges, 𝑒𝑖𝑗 means relationship between node 𝑖 and 𝑗. 𝐱𝑖 ∈ R2 is a vector node 𝑁𝑒 are computed by Eq. (5) which realizes to diffuse information
representing coordinate of node 𝑖. Given coordinates of all nodes in with the same message steps in different graph sizes.
( )
the graph, 𝐗 = 𝐱1 , … , 𝐱𝑛 , we wish to find the optimal permutation
( ) 𝑁𝑒 = 𝑛 × 𝑘% (5)
𝝅 = 𝜋1 , … , 𝜋𝑛 with minimal tour length 𝑅. The elements 𝜋𝑡 ∈ 𝑉 in
permutation 𝝅 selected at each time step 𝑡 ∈ {1, … , 𝑛} are the orders where, 𝑘 is a hyperparameter that is a multiple of 10.
of those nodes in the graph. Feasible permutation 𝝅 must satisfy two The input node and edge features are firstly respectively embedded
conditions: (1) each node is served exactly once; (2) all nodes can in ℎ dimensional features through two fully connected layers, and this
only be served once, 𝜋𝑡 ≠ 𝜋𝑡′ , ∀𝑡 ≠ 𝑡′ . The tour length of the feasible operation is called ‘primitive embedding’ and represented by Eqs. (6)
permutation 𝝅 is defined as Eq. (1). and (7).
‖ ‖ ∑‖ 𝑛
‖ 𝐡0𝑖 = 𝐀0 𝐱𝑖 + 𝐛0 , ∀𝑖 ∈ {1, … , 𝑛} (6)
𝑅 (𝝅) = ‖𝐱𝜋1 − 𝐱𝜋𝑛 ‖ + ‖𝐱𝜋𝑡 − 𝐱𝜋𝑡−1 ‖ (1)
‖ ‖2 ‖ ‖2
𝑡=1
Given a random TSP instance 𝑠, a stochastic policy 𝑝𝜽 (𝝅|𝑠) used to 𝐞0𝑖𝑗 = 𝐀1 𝐞𝑖𝑗 + 𝐛1 , ∀𝑗 ∈ {1, … , 𝑛} (7)
( )
generate a permutation 𝝅 = 𝜋1 , … , 𝜋𝑛 is defined as: where, 𝐀0 ∈ Rℎ×2
and 𝐀1 ∈ Rℎ
represent learnable weight parameters,
∏
𝑛
( ) 𝐛0 ∈ Rℎ and 𝐛1 ∈ Rℎ are defined as the bias parameters. The primitive
𝑝𝜽 (𝝅|𝑠) = 𝑝𝜽 𝜋𝑡 |𝑠, 𝝅 1 ∶ 𝑡−1 (2) node representation 𝐡0𝑖 and edge representation 𝐞0𝑖𝑗 are passed into
𝑡=1
the first graph convolutional layer of the encoder. In the remaining
here, 𝜽 represents parameters to be learned. section, 𝐡𝑙𝑖 and 𝐞𝑙𝑖𝑗 denote node and edge representations of graph
The iterative process to select node is modeled as the following convolutional layer 𝑙 ∈ {1, … , 𝐿} in the encoder, respectively, and both
Markov Decision Process (MDP). representations are alternatively updated as Fig. 2.
(1) Observation 𝝅 1 ∶ 𝑡 represents the generated subtour 𝝅 1 ∶ 𝑡 = Fig. 2 describes a single graph convolutional layer to update node
( )
𝜋1 , … , 𝜋𝑡 at time step 𝑡 ∈ {1, … , 𝑛}. representations. The graph convolutional layer is regarded as a message
(2) Action 𝑎𝑡 defines one node to be inserted in the subtour at time passing process that information can be passed from one node to one of
step 𝑡 ∈ {1, … , 𝑛}. its neighbors with a certain probability. The probabilities are computed
( )
(3) Transition function 𝑙 𝝅 1 ∶ 𝑡 , 𝑎𝑡 converts observation 𝝅 1 ∶ 𝑡 to by the additional edge representations and they are summed up to one
( )
𝝅 1 ∶ 𝑡+1 , i.e., 𝝅 1 ∶ 𝑡+1 = 𝑙 𝝅 1 ∶ 𝑡 , 𝑎𝑡 . over all neighbors of node 𝑖. Residual connection is also applied to
(4) Reward function is defined as Eq. (3). memorize information over each graph convolutional layer (Bresson
( ) ( ) and Laurent, 2017). As a result, the next graph convolutional layer node
𝑟𝑡 = 𝑟 𝝅 1 ∶ 𝑡 , 𝑎𝑡 , 𝝅 1 ∶ 𝑡+1 = 𝑅 𝝅 1 ∶ 𝑡 (3)
states derived by Eqs. (8) and (9). 𝐡0𝑖 and 𝐞0𝑖𝑗 are the output of ‘primitive
embedding’, which are calculated by Eqs. (6) and (7).
4. Proposed GCE-MAD Net ( ( ))
∑
The details of proposed GCE-MAD Net are explained in the following 𝐡𝑙𝑖 = 𝐡𝑙−1
𝑖 + ReLU BN 𝐖 𝑙−1 𝑙−1
1
𝐡 𝑖 + 𝜼 𝑙−1
𝑖𝑗 ⊙ 𝐖 𝑙−1 𝑙−1
2
𝐡𝑗 ,
𝑗∈N(𝑖)
subsections in terms of encoder architecture, decoder architecture. As ( )
visualized in Fig. 1, the architecture of the GCE-MAD Net follows the so- 𝜎 𝐞𝑙−1
𝑖𝑗
called encoder–decoder paradigm. In the figure, the GCN block consists 𝜼𝑙−1
𝑖𝑗 = (∑ ) (8)
𝑙−1
of several graph convolutional layers, which get stacked on top of each
𝜎 𝑗∈N(𝑖) 𝐞𝑖𝑗
3
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
adopts more than one graph convolutional layers and its structure is timestep 𝑡 ∈ {1, … , 𝑇 } efficiently, a tailored context node is designed
depicted as Fig. 3. Firstly, the primitive embedding module receives to gather messages with node representations 𝐡𝐿 𝑚
𝑖 . The context vector 𝐜𝑡
node features 𝐱𝑖 ∈ R2 and edge features 𝑒𝑖𝑗 ∈ R and projects them can be view as the embedding of the tailored context node and defined
from low dimension to ℎ dimension. Next, those ℎ dimensional rep- formally as follows:
resentations are passed to graph convolutional layers to update node ( )
⎧ concat 𝐡, 𝐯𝑚 , 𝐯𝑚 , 𝑡 = 1
representations. Specifically, the 𝑙th graph convolutional layer takes ⎪
𝐜𝑚 ( 1 2 ) (10)
the output of (𝑙 − 1)th graph convolutional layer as input. Finally, 𝑡 =⎨ 𝐿 𝐿
the output of the 𝐿th graph convolutional layer is the output of the ⎪ concat 𝐡, 𝐡𝝅 𝑚 , 𝐡𝝅 𝑚 , 𝑡 > 1
⎩ 𝑡−1 1
encoder, and is the updated node representations 𝐡𝐿
𝑖 with ℎ dimension.
here concat (⋅) is the horizontal concatenation operator. The context
vector 𝐜𝑚
𝑡 ∈R
3⋅ℎ is calculated through concatenating the mean of node
4.2. The decoder ∑
embeddings 𝐡 = 1𝑛 𝑛𝑖=1 𝐡𝐿 𝐿
𝑖 , embedding of the last selected node 𝐡𝝅 𝑚
𝑡−1
The multiple decoders scheme represents more than one decoder and embedding the first selected node 𝐡𝐿
𝝅𝑚
. For 𝑡 = 1, we use learned
1
with identical structures but unshared parameters, and the decod- parameters 𝐯𝑚 ∈ Rℎ and 𝐯𝑚 ∈ Rℎ as input placeholders of each decoder.
1 2
ing procedure of single decoder follows Kool et al. (2018). Let 𝑚 ∈ Following multi-head attention mechanism (Vaswani et al., 2017),
{1, … , 𝑀} be the index of decoder (the superscript (m) in following a query and a set of key–value pairs are needed to map the output.
work means the index of decoder), every decoder constructs a solution
𝝅 𝑚 sequentially. To produce probabilities of visiting each valid node at 𝐪𝑚 𝑚 𝑚 𝑚 𝑚 𝑚 𝐿 𝑚 𝐿
𝑡 , 𝐤𝑖 , 𝐯𝑖 = 𝐖𝑄 𝐜𝑡 , 𝐖𝐾 𝐡𝑖 , 𝐖𝑉 𝐡𝑖 (11)
4
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
here, parameters 𝐖𝑚 𝑄
is a R𝑘×3ℎ matrix and 𝐖𝑚 𝐾
is a R𝑘×ℎ matrix, 𝑘 = 𝐹ℎ , As we take the multi-head attention mechanism, let we denote the
𝐹 is the number of heads, 𝐖𝑉 is a R𝑚 𝑣×ℎ matrix. In this work, keys above result vector by 𝐚𝑚,𝑓
𝑡 for 𝑓 ∈ {1, 2, … , 𝐹 }. Using learnable matrix
𝐤𝑚 𝑘 𝑚 𝑣
𝑖 ∈ R and values 𝐯𝑖 ∈ R keep unchanged during decoding period,
𝐖𝑚,𝑓
𝑂
∈ Rℎ×𝑣 can project back to a vector 𝐜𝑡𝑚∗ ∈ Rℎ . The final multi-head
so, keys 𝐤𝑚 𝑚
𝑖 and values 𝐯𝑖 of each decoder are only computed as Eq. (11)
attention value for context node is calculated as Eq. (14).
once, but iterate a single query 𝐪𝑚 𝑡 from the context vector at every ∑
𝐹
timestep 𝑡. 𝐜𝑚∗
𝑡 = 𝐖𝑚,𝑓
𝑂
𝐚𝑚,𝑓
𝑡 (14)
The compatibility 𝑢𝑚 𝑚
𝑗,𝑡 ∈ R of query 𝐪𝑡 with all nodes are computed
𝑓 =1
according to Eq. (12). For TSP, masking nodes when it is visited, Finally, to compute probability distribution, a layer with single
i.e., setting 𝑢𝑚
𝑗,𝑡 = −∞ at time step 𝑡: attention head is added on the layer with multi-heads. Following Bello
et al. (2016), we clip the compatibilities within [−𝐷, 𝐷] using tanh:
⎧ (𝐪𝑚 )𝑇 𝐤𝑚
⎪ 𝑡
√
𝑗
, if 𝑗 ≠ 𝜋𝑡𝑚′ , ∀𝑡 ≠ 𝑡′ ⎧ ( ( 𝑚 )𝑇 𝑚 )
𝑢𝑚 = ⎨ 𝑘 (12) 𝐫𝑡 𝐠𝑗
𝑗,𝑡
⎪ −∞, otherwise ⎪ 𝐷∗ tanh √ , if 𝑗 ≠ 𝜋𝑡𝑚′ , ∀𝑡 ≠ 𝑡′
⎩ 𝑢𝑚
𝑗,𝑡 =⎨ 𝑘 (15)
⎪ −∞, otherwise
⎩
Next, we can get the vector 𝐚𝑚
𝑡 through convex combination of
messages 𝐯𝑚 𝐫𝑡𝑚 , 𝐠𝑚 𝑚 𝑚∗ 𝑚 𝐿
𝑡 = 𝐖𝑅 𝐜𝑡 , 𝐖𝐺 𝐡𝑗 (16)
𝑗 at timestep 𝑡:
∑ ( ) here, parameters 𝐖𝑚 and 𝐖𝑚 are Rℎ×ℎ matrices due to a single head,
𝐚𝑚
𝑡 = softmax 𝑢𝑚 𝑚
𝑗,𝑡 𝐯𝑗 (13) 𝑅 𝐺
𝑗
i.e., 𝑘 = 𝐹ℎ and 𝐹 = 1.
5
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
The compatibilities computed through Eq. (15) as unnormalized log- The model with the best set of parameters among previous epochs is
probabilities is used to compute the probability vector 𝐩𝑚 by softmax used as the baseline model to greedily decode the result as the baseline
function. The elements in 𝐩𝑚 are defined by Eq. (17). 𝑏 (𝑠). The value of baseline 𝑏 (𝑠) in this work is the minimal cost defined
( 𝑚 ) ( ) as Eq. (20).
𝑝𝑚 𝑚 𝑚
𝑖, t = 𝑝𝜃 𝜋𝑡 = 𝑖|𝑠, 𝝅 1 ∶ 𝑡−1 = softmax 𝑢𝑗,𝑡 (17) ( )
𝑏 (𝑠) = min 𝑅 𝝅 𝑚 = 𝜋1𝑚 , … , 𝜋𝑛𝑚 |𝑠 (20)
𝑚
The process of single decoder to pick nodes is described as Fig. 4. At ( ( ))
each time step 𝑡, the multi-head attention layer takes the 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 vector 𝜋𝑡𝑚 = argmax 𝑝𝜃 𝜋𝑡𝑚 = 𝑖|𝑠, 𝝅 𝑚 1 ∶ 𝑡−1
(21)
𝑖
𝐜𝑚 𝐿
𝑡 and node embeddings 𝐡𝑖 as input, and outputs the final multi-head
𝑚∗ All steps are listed as Algorithm 1.
attention value 𝐜𝑡 for the 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 node. Then, the single-head attention
layer takes the vector 𝐜𝑚∗ 𝐿
𝑡 and node embeddings 𝐡𝑖 as input to generate 6. Experiments
the compatibilities. The probability distribution of selecting the next
valid node is derived by Eq. (17).
Controlled experiments are designed to probe the performance of
Fig. 5 shows the structure of the decoder, ‘Decoder’ module in it has
the GCE-MAD Net on solving 2D Euclidean TSP instances.
been described in Fig. 4. Each decoder can generate a permutation, the
final solution is selected from those permutations. 6.1. Evaluation metrics
5. Training method Five typical metrics are used to evaluate the performance of GCE-
MAD Net, they are defined as follows.
(1) Average predicted tour length. The average predicted tour
For random input instance 𝑠, each decoder individually samples a
length over validation and test instances, computed as Eq. (22)
trajectory 𝝅 𝑚 to get separate REINFORCE loss with the same greedy
1 ∑
rollout baseline. For training the model, we define the sum of the 𝑁
expectation of the cost 𝑅 (𝝅 𝑚 ) (tour length computed as Eq. (1)) as the 𝑅= 𝑅 (22)
𝑁 𝑖=1 𝑖
loss function presented by Eq. (18).
∑ [ ] (2) Win rate. The win rate 𝑟𝑤𝑖𝑛 is the proportion of winning cases
𝑙𝑜𝑠𝑠 (𝜽|𝑠) = 𝐸𝑝𝜃 (𝝅|𝑠) 𝑅 (𝝅 𝑚 ) (18) over the total number of tested cases 𝑁, as defined at Eq. (23). A case is
𝑚 considered won when its tour length computed by a method is shorter
where, parameters 𝜽 is optimized by gradient descent using the REIN- than the tour length computed by the CPLEX.
FORCE algorithm with rollout baseline 𝑏 (𝑠).
𝑟𝑤𝑖𝑛 = 𝑁𝑤𝑖𝑛 ∕𝑁 (23)
∑ [ ]
∇𝜽 𝑙𝑜𝑠𝑠 (𝜽|𝑠) = 𝐸𝑝𝜃 (𝝅|𝑠) (𝑅 (𝝅 𝑚 ) − 𝑏 (𝑠)) ∇𝜽 log 𝑝𝜃 (𝝅 𝑚 |𝑠) (19)
where 𝑁𝑤𝑖𝑛 is the number of tested winning cases.
𝑚
6
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
(3) Fail rate. The fail rate 𝑟𝑓 𝑎𝑖𝑙 is the proportion of failed cases over Table 1
The details of data set.
the total number of tested cases 𝑁, as defined at Eq. (24). A case is
considered failed when its tour length computed by a method is longer Data Name Number
7
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
Table 2
Comparison on instances with 20 customers.
Method 𝑅̄ 𝑟𝑤𝑖𝑛 𝑟𝑓 𝑎𝑖𝑙 𝐺𝑎𝑝 𝑡 (s)
MIP CPLEX 3.89 + −0.32 – – 0 18.12
Nearest neighbor 4.52 + −0.03 2.81% 97.19% 16.13% 2.78
Nearest insertion 4.34 + −0.02 2.03% 97.97% 12.03% 0
Heuristic
Random insertion 4.02 + −0.02 24.77% 75.23% 3.32% 0
Farthest insertion 3.94 + −0.02 39.22% 60.78% 1.30% 0
AM 3.86 + −0.31 59.69% 26.40% −0.71% 0.07
Learning-based GCN 3.93 + −0.33 31.32% 61.88% 1.24% 0.09
algorithm GLN-TSP 3.85 – – – –
Our GCE-MAD Net 3.85 + −0.31 71.41% 12.73% −0.91% 0.23
Table 3
Comparison on instances with 50 customers.
Method 𝑅̄ 𝑟𝑤𝑖𝑛 𝑟𝑓 𝑎𝑖𝑙 𝐺𝑎𝑝 𝑡 (s)
MIP CPLEX 5.76 + −0.27 – – 0.00% 2.78
Nearest neighbor 7.00 + −0.03 0.08% 99.92% 21.50% 0
Nearest insertion 6.78 + −0.02 0 100% 17.84% 0
Heuristic
Random insertion 6.13 + −0.02 2.9% 97.1% 6.48% 0
Farthest insertion 6.01 + −0.02 9.92% 90.08% 4.36% 0.17
AM 5.78 + −0.28 39.38% 59.3% 0.35% 0.25
Learning-based GCN 6.08 + −0.32 19.92% 80.08% 4.86% –
algorithm GLN-TSP 5.85 – – – 0.70
Our GCE-MAD Net 5.73 + −0.27 58.75% 40.31% −0.40% 2.78
Table 4
Comparison on instances with 100 customers.
Method 𝑅̄ 𝑟𝑤𝑖𝑛 𝑟𝑓 𝑎𝑖𝑙 𝐺𝑎𝑝 𝑡 (s)
MIP CPLEX 9.01 + −0.73 – – 0.00% 121.17
Nearest neighbor 9.67 + −0.03 21.48% 78.52% 7.94% 2.68
Nearest insertion 9.45 + −0.02 23.59% 76.41% 5.54% 0.01
Heuristic
Random insertion 8.52 + −0.02 73.52% 26.48% −4.93% 0
Farthest insertion 8.35 + −0.02 81.80% 18.20% −6.77% 0
AM 8.14 + −0.28 93.60% 6.40% −9.19% 0.26
Learning-based GCN 8.78 + −0.34 57.81% 42.19% −1.98% 0.60
algorithm GLN-TSP – – – – –
Our GCE-MAD Net 8.04 + −0.27 96.33% 3.77% −10.27% 0.28
change those parameters in our model. As we aim to find the optimal and commonly-used solver CPLEX. The details on these approaches are
solution with minimal cost (tour length), the line plots should always introduced as following.
present downward trend. (1) Nearest neighbor
From Fig. 6(a), we find the dimension of embedding in the encoder The nearest neighbor heuristic initializes a path with a random
can greatly influence our mode. Useful information of customer features single node (we always start with the first node in the input). In each
in low dimensional space may be neglected, which can explain the iteration, the next node is selected as it is the nearest one to the end
improvement with increased embedding dimension. However, a little node of the current partial path. The selected in this iteration becomes
change happened when we set dimension from 128 to 256, because the new end node. Finally, after all nodes are selected, the end node
node and edge features of TSP are easy to be represent. should connect to the start node to form a tour.
Similarly, we assess the performance of the number of embed- (2) Nearest/random/farthest insertion
ding layers in the encoder, as Fig. 6(b). The improvement of in- The insertion heuristics initialize a tour with two nodes. In each
creased layers explains GCNs aggregate and pass message by stacking iteration, one node is selected to insert to the tour by some rules.
layers. Different rules generate different insertion heuristics. Let 𝑆 be the set of
Next, we retrain our model by setting the number of decoders in 1, nodes in the tour, 𝑑𝑖𝑗 represents the distance from node 𝑖 to 𝑗. Nearest
2, 3, 4 and 5. The performance curves are depicted in Fig. 6(c). It is insertion inserts node 𝑖 so that it can be nearest to any node in the
obviously that better solutions can be found with increased decoders, tour:
especially, when adding the headers from 1 to 2.
𝑖∗ = argmin min 𝑑𝑖𝑗 (26)
Finally, the evaluation curves of multiple attention heads are plotted 𝑖∉𝑆 𝑗∈𝑆
in Fig. 6(d). More attention heads allow context node to receive more Farthest insertion inserts node 𝑖 so that the distance from node 𝑖 to
types of messages from all nodes. Through those messages, we can the nearest node 𝑗 is maximal:
select the next node inserting to the optimal solution, so more heads
improve the optimal solution quality. 𝑖∗ = argmax min 𝑑𝑖𝑗 (27)
𝑖∉𝑆 𝑗∈𝑆
8
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
For fair comparison, we take pretrained models from Kool et al. (2018) the red line. Fig. 7 (d), (e), and (f) are the comparison of three methods
and GCN (Joshi et al., 2019), and both models have the same 100 and the CPLEX when they are used to solve TSP50_Testing, the farthest
trail runs. The results of GLN-TSP are the same with the original paper insertion behaves badly, because as Fig. 7 (d) shows that almost all blue
(Nammouchi et al., 2020). As shown in Tables 2–4, the results on points are under the red line. In contrast, two learning-based methods
20 and 50 customers are obtained on Model_TSP20 and Model_TSP50 perform well, but the GCE-MAD Net outperforms the AM (Kool et al.,
respectively. Note that our result on 100 customers is also obtained on 2018). Fig. 7 (g), (h), and (i) are the comparison of three methods and
Model_TSP50. As we can see, our GCE-MAD Net can get the minimal the CPLEX when they are used to solve TSP100_Testing. The CPLEX
average tour length in test data among those classical heuristics and performs worse than three comparison methods on most cases, which
learning-based algorithms. It also outperforms the CPLEX, especially also demonstrates that the traditional solver CPLEX is powerless on
the performance on instances with 100 customers. large scale problems, and our GCE-MAD Net is a nice alternative.
For clearly comparing the performance of the heuristics methods
6.7. Impact of graph density
and the CPLEX on 1,280 test instances, except the indicator opti-
mality gap, we also give the value of the win rate 𝑟𝑤𝑖𝑛 and fail
In this paper, we use a fixed graph diameter strategy defined as
rate 𝑟𝑓 𝑎𝑖𝑙 . When solving TSP20_Testing and TSP50_Testing, our GCE-
Eq. (5) to realize different graph sizes with different graph density,
MAD Net can find better solutions on over 50% instances than the
i.e., the neighbors of each node in different graph sizes should be
CPLEX. It is notable that the GCE-MAD Net outperform the CPLEX
various. Fig. 8 shows the change with different graph density. The pre-
on near 100% TSP100_Testing instances, which demonstrates that our
trained Model_TSP20 and Model_TSP50 with different graph density are
GCE-MAD Net keeps a good competitive performance with increasing
tested on three hold-out datasets: TSP20_Testing, TSP50_Testing, and
problem complexity.
TSP100_Testing. From Fig. 8, as we can see, no matter the pretrained
Fig. 7 shows the situation that the farthest insertion, the AM (Kool Model_TSP20 or Model_TSP50, their performances are better with the
et al., 2018), and the GCE-MAD Net compare with the CPLEX on solving graph of a sparse 40%-nearest neighbors than the full graph, i.e., no
TSP20_Testing, TSP50_Testing, and TSP100_Testing. The coordinates of neighbors or too much neighbors will decrease the performance of the
a blue point represent the tour length computed by the three com- pretrained model.
parison methods and the tour length computed by the CPLEX. If the
blue point is under the red line (𝑦 = 𝑥), it represents that the CPLEX 6.8. Ablation study
finds a better solution than the comparison method, and other situation
can be understood in the same way. Fig. 7 (a), (b), and (c) are the In order to show the efficiency of the GCE-MAD Net, two inte-
comparison of three methods and the CPLEX when they are used to gral parts of this network need to be test: (1) the residual block in
solve TSP20_Testing, as we can see, our GCE-MAD Net find better the encoder, (2) the multi-decoder strategy. Three models with the
solution than the CPLEX on most cases, only a few blue points are under corresponding part absent are trained, and their performance on the
9
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
Fig. 7. The comparison of three methods (the farthest insertion, the AM (Kool et al., 2018), and our GCE-MAD Net) with the CPLEX on TSP20_Testing (Fig. 7 (a), (b), and (c)),
TSP50_Testing (Fig. 7 (d), (e), and (f)), and TSP100_Testing (Fig. 7 (g), (h), and (i)).
10
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
11
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
Table 5
Ablation investigation of residual block and multi-decoder.
Model 1st 2nd 3rd 4th
Network Residual block ✗ ✓ ✗ ✓
settings multi-decoder ✗ ✗ ✓ ✓
20 3.867 + −0.315 3.865 + −0.314 3.854 + −0.310 3.850 + −0.309
Average tour
50 5.941 + −0.316 5.929 + −0.313 5.852 + −0.297 5.832 + −0.292
length
100 9.055 + −0.392 9.028 + −0.415 8.641 + −0.333 8.620 + −0.336
same test dataset are used to compare to the complete model (the
GCE-MAD Net). Details of those four models are listed as Table 5.
It is notable that the four networks have the same graph diameter
𝑘 = 40%, and are pretrained on TSP20, then tested on TSP20, TSP50 and
TSP100. The baselines of those problem sizes are obtained by model
1st without residual block and multi-decoder, which are very poor
(the average tour length is 3.867 + −0.315, 5.941 + −0.316, 9.055
+ −0.392, respectively). Then, one of residual block or multi-decoder
is added to the baseline, resulting in 3.865 + −0.314, 5.929 + −0.313,
9.028 + −0.415 and 3.854 + −0.310, 5.852 + −0.297, 8.641 + −0.333
respectively (from model 2nd to 3rd in Table 5), which validates that
each component can efficiently improve the performance of the base-
line. Finally, we further add two components to the baseline, resulting
in 3.850 + −0.309, 5.832 + −0.292, 8.620 + −0.336 (model 4th in
Table 5). It can be concluded that two components perform better
than only one component. We also visualize random samples of three
problem sizes as Fig. 9. The 4th model (the GCE-MAD Net) performs
the best, which demonstrates that the effectiveness and benefits of the Fig. 10. The generalization performance.
residual block and multi-decoder strategy.
6.9. Generalization ability of GCE-MAD Net find optimal solutions in very little time even with variable sizes
customers.
In the real world, delivery requests may come from over 50 cus- In future research, extending the model to solve very large-scale TSP
tomers, but training on larger graphs from scratch is intractable and with a huge number of customers is of great interest. A more challenge
sample inefficient, so, small-scale pretrained model generates well on task is to propose deep reinforcement learning based algorithms to
large-scale problem instance is very important. We demonstrate the solve more realistic problems, e.g., TSP with time window, dynamic
generalization ability of the proposed model in Fig. 10. A hold-out test demands and so on.
out of 124,160 TSP instances, consist of 1280 instances each of TSP3,
TSP4, . . . , TSP100. Model_TSP20 and Model_TSP50 are two pretrained CRediT authorship contribution statement
models which are trained on graph size 20 and 50 nodes respectively
and are used to evaluate on variable sizes (from 3 to 100) instances. Jia Luo: Conceptualization, Methodology, Software, Validation,
Optimality gaps of those instances are shown as Fig. 10. As it is Formal analysis, Investigation, Data curation, Visualization, Writing
difficult for CPLEX to find the best solution of large-scale instances, – original draft. Chaofeng Li: Conceptualization, Validation, Super-
the gap shocks back and forth during graph size from 70 to 100. vision, Project administration, Formal analysis, Writing – review &
When graph size is from 4 to 40, Model_TSP20 and Model_TSP50 both editing. Qinqin Fan: Formal analysis, Writing – review & editing.
outperform the CPLEX. Starting from graph size 40 nodes, Model_TSP50 Yuxin Liu: Investigation, Writing – review & editing.
still outperforms the CPLEX. In general, we can conclude that the GCE-
Declaration of competing interest
MAD Net pretrained on larger graph size achieves better generalization
ability.
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
7. Conclusion and future direction
influence the work reported in this paper.
12
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
Fig. 11. Random TSP instances of variable sizes are solved by CPLEX (left column) and our model (right column).
13
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
14
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
References Huang, H., Savkin, A.V., Huang, C., 2020. A new parcel delivery system with drones
and a public train. J. Intell. Robot. Syst. 100 (3), 31341–31354.
Al-Gaphari, G.H., Al-Amry, R., Al-Nuzaili, A.S., 2021. Discrete crow-inspired algorithms Joshi, C.K., Laurent, T., Bresson, X., 2019. An efficient graph convolutional network
for traveling salesman problem. Eng. Appl. Artif. Intell. 97, 104006. technique for the travelling salesman problem. URL arXiv:1906.01227.
Ali, I.M., Essam, D., Kasmarik, K., 2020. A novel design of differential evolution for Kanna, S.K.R., Sivakumar, K., Lingaraj, N., 2021. Development of deer hunting
solving discrete traveling salesman problems. Swarm Evol. Comput. 52, 100607. linked earthworm optimization algorithm for solving large scale traveling salesman
Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S., 2016. Neural combinatorial problem. Knowl.-Based Syst. 227, 107199.
optimization with reinforcement learning. In: International Conference on Learning Khan, I., Maiti, M.K., 2019. A swap sequence based artificial bee colony algorithm for
Representations. San Juan. traveling salesman problem. Swarm Evol. Comput. 44, 428–438.
Bengio, Y., Lodi, A., Prouvost, A., 2021. Machine learning for combinatorial op- Kinable, J., Smeulders, B., Delcour, E., Spieksma, F.C.R., 2017. Exact algorithms for the
timization: a methodological tour d’Horizon. European J. Oper. Res. 290, equitable traveling salesman problem. European J. Oper. Res. 261 (2), 475–485.
405–421. Kool, W., Hoof, H.V., Welling, M., 2018. Attention, learn to solve routing problems!
Bresson, X., Laurent, T., 2017. Residual gated graph ConvNets. arXiv preprint arXiv: in: International Conference on Learning Representations. Vancouver, BC.
1711.07553. Li, Q., Han, Z., Wu, X.-M., 2018. Deeper insights into graph convolutional networks
Chen, J., Ma, T., Xiao, C., 2018. Fastgcn: fast learning with graph convolutional for semi-supervised learning. In: Thirty-Second AAAI Conference on Artificial
networks via importance sampling. arXiv preprint arXiv:1801.10247. Intelligence.
Cho, K., Gulcehre, B.v.M.C., Bahdanau, D., Schwenk, F.B.H., Bengio, Y., 2014. Learn- Li, W., Wang, G.-G., Gandomi, A., 2021. A survey of learning-based intelligent
ing phrase representations using RNN encoder–decoder for statistical machine optimization algorithms. In: Archives of Computational Methods in Engineering.
translation. In: EMNLP. pp. 1–19.
Dai, H., Khalil, E.B., Zhang, Y., Dilkina, B., Song, L., Learning combinatorial opti-
Marcheggiani, D., Titov, I., 2017. Encoding sentences with graph convolutional
mization algorithms over graphs. In: Advances in Neural Information Processing
networks for semantic role labeling. In: EMNLP.
Systems, vol. 30. Long Beach, CA, pp. 6348–6358.
MirHassani, S.A., Habibi, F., 2013. Solution approaches to the course timetabling
Davendra, D., 2010. Traveling Salesman Problem: Theory and Applications. BoD–Books
problem. 39, (2), pp. 133–149.
on Demand.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,
Defferrard, M., Bresson, X., Vandergheynst, P., 2016. Convolutional neural networks
Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., 2015. Human-level control
on graphs with fast localized spectral filtering. In: Advances in Neural Information
through deep reinforcement learning. Nature 518 (7540), 529–533.
Processing Systems, vol. 29. Barcelona, SPAIN, pp. 3844–3852.
Nammouchi, A., Ghazzai, H., Massoud, Y., 2020. A generative graph method to solve
Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., Rousseau, L., 2018. Learning
the travelling salesman problem. In: IEEE 63rd International Midwest Symposium
heuristics for the TSP by policy gradient. In: International Conference on the
on Circuits and Systems. pp. 89–92.
Integration of Constraint Programming, Artificial Intelligence, and Operations
Nazari, M., 2018. Reinforcement learning for solving the vehicle routing problem. In:
Research. Delft, The Netherlands, pp. 170–181.
Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru- Advances in Neural Information Processing Systems. pp. 9839–9849.
Guzik, A., Adams, R.P., 2015. Convolutional networks on graphs for learning Osaba, E., Yang, X.-S., Ser, J.Del., 2020. Traveling Salesman Problem: A Perspective
molecular fingerprints. In: Advances in Neural Information Processing Systems, vol. Review of Recent Research and New Results with Bio-Inspired Metaheuristics. pp.
28. pp. 2224–2232. 135–164.
Ebadinezhad, S., 2020. DEACO: ADopting dynamic evaporation strategy to enhance Pandiri, V., Singh, A., 2019. An artificial bee colony algorithm with variable degree of
ACO algorithm for the traveling salesman problem. Eng. Appl. Artif. Intell. 92, perturbation for the generalized covering traveling salesman problem. Appl. Soft
103649. Comput. 78, 481–495.
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N., 2017. Convolutional Paschos, V.T., 2014. Applications of Combinatorial Optimization, vol. 3. John Wiley &
sequence to sequence learning. In: International Conference on Machine Learning, Sons.
pp. 1243–1252. Rego, C., Gamboa, D., Glover, F., Osterman, C., 2011. Traveling salesman problem
Goyal, P., Ferrara, E., 2018. Graph embedding techniques, applications, and heuristics: Leading methods. Implement. Lat. Adv. 211 (3), 427–441.
performance: A survey. Knowl.-Based Syst. 151, 78–94. Saji, Y., Barkatou, M., 2021. A discrete bat algorithm based on Lévy flights for Euclidean
Hromkovič, J., 2013. Algorithmics for Hard Problems: Introduction to Combinatorial traveling salesman problem. Expert Syst. Appl. 172, 114639.
Optimization, Randomization, Approximation, and Heuristics. Springer Science & Subramanyam, A., Gounaris, C.E., 2016. A branch-and-cut framework for the consistent
Business Media. traveling salesman problem. European J. Oper. Res. 248 (2), 384–395.
15
J. Luo, C. Li, Q. Fan et al. Engineering Applications of Artificial Intelligence 112 (2022) 104848
Talbi, E.-G., 2021. Machine learning into metaheuristics: A survey and taxonomy. ACM Wang, Z., Zhang, Y., Zhou, W., Liu, H., 2012. Solving traveling salesman problem in
Comput. Surv. 54 (6), 1–32. the Adleman–Lipton model. Appl. Math. Comput. 219 (4), 2267–2270.
Tran, D.-D., Vafaeipour, M., Baghdadi, M.El., Barrero, R., Mierlo, J.Van., Hegazy, O., Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist
2020. Thorough state-of-the-art analysis of electric and hybrid vehicle powertrains: reinforcement learning. Mach. Learn. 8 (3), 229–256.
Topologies and integrated energy management strategies. Renew. Sustain. Energy Williamson, D.P., Shmoys, D.B., 2011. The Design of Approximation Algorithms.
Rev. 119, 109596. Cambridge University Press.
Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y., 2020. A comprehensive
Polosukhin, I., Attention is all you need. In: Advances in Neural Information survey on graph neural networks. 32, (1), pp. 4–24.
Processing Systems. Long Beach, CA, pp. 5998–6008. Wu, J.Muren., Zhou, L., Du, Z., Lv, Y., 2019. Mixed steepest descent algorithm for the
Vinyals, O., Fortunato, M., Jaitly, N., 2015b. Pointer networks. In: Advances in Neural traveling salesman problem and application in air logistics. Transp. Res. E 126,
Information Processing Systems, vol. 28. Montréal, Canada, pp. 2692–2700. 87–102.
Wang, Y., Han, Z., 2021. Ant colony optimization for traveling salesman problem based Zhou, J., 2020. Graph neural networks: A review of methods and applications. AI Open
on parameters optimization. Appl. Soft Comput. 107, 107439. 1, 57–81.
16