25982-Article Text-30045-1-2-20230626

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)
Pointerformer: Deep Reinforced Multi-Pointer Transformer

for the Traveling Salesman Problem
Yan Jin1 , Yuandong Ding1 , Xuanhao Pan1 , Kun He1,3 * , Li Zhao2 , Tao Qin2 , Lei Song2 , Jiang Bian2
1
School of Computer Science, Huazhong University of Science and Technology, China
2
Microsoft Research Asia
3
HopcroftCenter on Computing Science, Huazhong University of Science and Technology, China
{jinyan, yuandong, xhpan, brooklet60 }@hust.edu.cn, {lizo, taoqin, lei.song, jiang.bian}@microsoft.com
Abstract Due to both of its theoretical and practical importance,

TSP has attracted a great number of research efforts in the
Traveling Salesman Problem (TSP), as a classic routing op- past decades that attempted to address it using either exact
timization problem originally arising in the domain of trans- or heuristic algorithms. In fact, the NP-hardness nature of
portation and logistics, has become a critical task in broader
domains, such as manufacturing and biology. Recently, Deep
TSP makes it computationally intractable to leverage exact
Reinforcement Learning (DRL) has been increasingly em- algorithms to find the optimal solutions over a large-scale
ployed to solve TSP due to its high inference efficiency. Nev- TSP, since the corresponding computation complexity in-
ertheless, most of existing end-to-end DRL algorithms only creases exponentially with respect to the number of nodes.
perform well on small TSP instances and can hardly generalize Hence, facing most real-world TSP applications, heuristic al-
to large scale because of the drastically soaring memory con- gorithms are usually adopted to obtain near-optimal solutions.
sumption and computation time along with the enlarging prob- However, to ensure achieving high-quality solutions, a few
lem scale. In this paper, we propose a novel end-to-end DRL heuristic algorithms are designed to further rely on fine-tuned
approach, referred to as Pointerformer, based on multi-pointer search strategies, which may significantly increase the time
Transformer. Particularly, Pointerformer adopts both reversible complexity for solving large-scale TSP.
residual network in the encoder and multi-pointer network in
the decoder to effectively contain memory consumption of Recently, there have been a soaring number of studies try-
the encoder-decoder architecture. To further improve the per- ing to solve TSP using Deep Learning (DL) algorithms with
formance of TSP solutions, Pointerformer employs a feature either Supervised Learning (SL) or Reinforcement Learning
augmentation method to explore the symmetries of TSP at (RL) (Vinyals, Fortunato, and Jaitly 2015; Nowak et al. 2017;
both training and inference stages as well as an enhanced Kool, van Hoof, and Welling 2018; Kwon et al. 2020; Zheng
context embedding approach to include more comprehensive et al. 2021; Fu, Qiu, and Zha 2021; Jiang et al. 2022; Kwon
context information in the query. Extensive experiments on et al. 2021; Ma et al. 2021; Kim, Park et al. 2021). Depending
a randomly generated benchmark and a public benchmark on the specific ways to construct solutions, DL algorithms
have shown that, while achieving comparative results on most
small-scale TSP instances as state-of-the-art DRL approaches
can be roughly divided into two main categories: search-
do, Pointerformer can also well generalize to large-scale TSPs. based DL (d O Costa et al. 2020; Fu, Qiu, and Zha 2021; Ma
et al. 2021) and end-to-end DL (Kool, van Hoof, and Welling
2018; Kwon et al. 2020, 2021; Jiang et al. 2022; Kim, Park
Introduction et al. 2021). By incorporating heuristic search operators with
learning-based policy, search-based DL can solve larger-scale
The Traveling Salesman Problem (TSP) is a well-known com- TSP instances. However, they usually suffer from two major
binatorial optimization problem. It can be stated as follows: limitations. The first one lies in the inference efficiency is-
given a set of cities/nodes, a salesman departing from one city sue, meaning that the search component usually takes a long
needs to traverse all other cities exactly once and finally re- time to terminate to obtain high quality solutions. Moreover,
turns to the start city. The objective of TSP is to find the short- the obtained solution performance is very sensitive to the
est route for the salesman. In addition to its well-recognized selection of search operators which is highly dependent on
theoretical importance as a classic combinatorial optimiza- sophisticated domain knowledge. In contrast, end-to-end DL
tion problem, TSP also has a wide range of real-world ap- algorithms are very efficient in generating solutions and bear
plications, such as drilling of printed circuit boards (Alkaya much lower dependencies on domain knowledge. Therefore,
and Duman 2013), X-Ray crystallography (Bland and Shall- end-to-end DL algorithms are more suitable for many emerg-
cross 1989), warehouse order picking (Madani, Batta, and ing TSP application scenarios, such as on-call routing (Ghiani
Karwan 2020), transport routes optimization (Hacizade and et al. 2003) and ride hailing service (Xu et al. 2018), that
Kaya 2018), and many others (Matai, Singh, and Lal 2010). require to generate solutions in almost real-time.
* Corresponding author. Compared with supervised learning relying on the optimal
Copyright © 2023, Association for the Advancement of Artificial solution as the learning labels, which is usually unknown
Intelligence (www.aaai.org). All rights reserved. when facing the large-scale TSP, RL yields the advantage
8132
since it can be applied to attain near-optimal solutions with- Traditional TSP algorithms. TSP is one of the most typ-
out requiring the existence of ground truth. Therefore, most ical combinatorial optimization problems, and numerous al-
recent studies tend to apply the Deep Reinforcement Learning gorithms have been proposed for solving TSP over the past
(DRL) approach to solve the large-scale TSP. Nevertheless, decades. Traditional TSP algorithms can be classified into
most of existing end-to-end DRL algorithms only perform three categories, i.e., exact algorithms, approximate algo-
well on small TSP instances (no more than 100 nodes) and rithms and heuristic algorithms. Concorde (Applegate et al.
are hard to scale to larger instances. This is mainly due to the 2007) is one of the fastest exact solvers. It models TSP as
drastically soaring memory consumption and computation a mixed-integer programming problem, and then adopts a
time along with the increasing nodes. branch and cut algorithm (Padberg and Rinaldi 1991) to
In this paper, we propose a novel scalable DRL method search the solution. Christofides et al., (Christofides 1976)
based on multi-pointer Transformer, denoted as Pointer- proposed an approximation algorithm, and the approximation
former, aiming to solve TSP in an end-to-end manner. ratio of 1.5 is achieved by constructing the minimum span-
While following the classical encoder-decoder architec- ning tree and the minimum perfect matching of the graph.
ture (Vaswani et al. 2017), this new approach adopts re- LKH-3 (Helsgaun 2017) is one of the SOTA heuristics, which
versible residual network (Gomez et al. 2017; Kitaev, Kaiser, uses the k-opt operators to search in the solution space, with
and Levskaya 2019) instead of the standard residual network the guidance of an α−measure based on a variant of mini-
in the encoder to significantly reduce memory consumption. mum spanning tree. Among these traditional algorithms, the
Furthermore, instead of employing the memory-consuming heuristics are the most widely used algorithms in practice,
self-attention module as in (Kool, van Hoof, and Welling yet they are still time-consuming and difficult to be extended
2018; Kwon et al. 2020), we propose a multi-pointer net- to other problems.
work in the decoder to sequentially generate the next node Besides of these traditional algorithms, there are also
according to a given query. Besides addressing the issues of works that attempt to utilize the power of machine learning
memory consumption, Pointerformer contains delicate de- and reinforcement learning techniques. Earlier machine learn-
sign to further improve the model effectiveness. Particularly, ing approaches include the Hopfield neural network (Hopfield
to improve the effectiveness of obtained solutions, Pointer- and Tank 1985) and self-organising feature maps (Angeniol,
former employs a feature augmentation method to explore the Vaubois, and Le Texier 1988). There are several works like
symmetries of TSP at both training and inference stages as Ant-Q (Gambardella and Dorigo 1995) and Q-ACS (Sun, Tat-
well as an enhanced context embedding approach to include sumi, and Zhao 2001) that combined reinforcement learning
more comprehensive context information in the query. with ant colony algorithm, and Liu and Zeng (Liu and Zeng
To demonstrate the effectiveness of Pointerformer, we con- 2009) used reinforcement learning to improve the mutation
ducted extensive experiments on two datasets, including ran- of a successful genetic algorithm called EAX-GA (Nagata
domly generated instances and widely used public bench- 2006). It is worth mentioning that a recent work, called VSR-
marks. Experimental results have shown that Pointerformer LKH (Zheng et al. 2021), defined a novel Q-value based on
not only achieves comparative results on small-scale TSP in- reinforcement learning to replace the α−value used by the
stances as State-Of-The-Art (SOTA) DRL approaches do, but LKH algorithm, and achieved a better performance on TSP.
also can generalize to large-scale TSPs. More importantly, DL-based TSP algorithms. DL-based TSP algorithms
while being trained on randomly generated instances, our are mainly proposed in recent years, according to the way
approach can achieve much better performance on instances the solution is generated, they can be classified into two
with different distributions, indicating a better generalization. categories: end-to-end methods and search-based methods.
Our main contributions can be summarized as follows. End-to-end methods create a solution from the scratch
• We propose an effective end-to-end DRL algorithm without (Bello et al. 2016; Dai et al. 2017; Kim, Park et al. 2021;
relying on any hand-crafted heuristic operators, which is the Kool, van Hoof, and Welling 2018; Kwon et al. 2020; Nazari
first end-to-end DRL approach that can scale to TSP instances et al. 2018; Vinyals, Fortunato, and Jaitly 2015). Vinyals et
with up to 500 nodes to the best of our knowledge. al., (Vinyals, Fortunato, and Jaitly 2015) proposed a Pointer
• Our algorithm applies an auto-regressive decoder with a NetWork to solve TSP with supervised learning. Bello et
proposed multi-pointer network to generate solutions sequen- al., (Bello et al. 2016) then used RL to train a PtrNet model
tially without relying on any search components. Compared to minimize the length of solutions. This method achieves
with existing search-based DRL algorithms, we can achieve better performance and has stronger generalization and scal-
comparable solutions while the inference time is reduced by ability. To deal with both static and dynamic information,
almost an order of magnitude. Nazari (Nazari et al. 2018) improved PtrNet, which is more
• Besides scalability, extensive experiments also show that effective than many traditional methods. Dai et al., (Dai
our approach can generalize well to instances that have varied et al. 2017) proposed Structure2Vec which encodes partial
distributions without re-training. solutions and predicts the next node. The Q-learning method
is used to train the whole policy model. Attention Model
in (Kool, van Hoof, and Welling 2018) adopts the Trans-
Related Work former (Vaswani et al. 2017) architecture and the model is
Here we highlight a few of the best traditional algorithms for trained through the REINFORCE algorithm with a greedy
solving TSP, and then focus on presenting the DL algorithms roll-out baseline. It shows the efficiency of Transformer in
that are more related to our work. solving TSP. Then Kwon et al., proposed POMO (Kwon et al.
8133
2020) using REINFORCE algorithm with a shared baseline. the negative cost of the newly added edge. For each problem
It leverages the existence of multiple optimal solutions of a instance s, our goal is to maximize the expected cumulative
combinatorial optimization problem. Currently, end-to-end reward defined as follows:
methods perform well on TSP instances with nodes less than
100, but due to the complexity of the model and the low sam- J(θ | s) = Eτ ∼pθ (τ |s) R(τ ) (2)
pling efficiency of reinforcement learning, it is hard to extend
them to a larger scale. where R(τ ) = −L(τ ) and pθ (τ | s) = ΠN | s, τ [:
i=1 πθ (τ[i]
Search-based methods start from a feasible solution and i)).
learn how to constantly improve the solution (Chen and Tian According to the policy gradient theorem (Sutton et al.
2019; d O Costa et al. 2020; Fu, Qiu, and Zha 2021; Joshi, 2000), we can calculate the derivative of the objective func-
Laurent, and Bresson 2019; Kool et al. 2022). The improve- tion to update the model using many existing policy gradient
ment is often achieved by integrating with heuristic operators. algorithms.
For instance, Chen et al., proposed NeuRewriter (Chen and ∇θ J(θ | s) = Epθ (τ |s) [∇θ log pθ (τ | s)R(τ )] (3)
Tian 2019), which rewrites local components through region-
pick and rule-pick. They trained the model with Advantage
Actor-Critic, and the reduced cost per iteration is used as The Pointerformer Approach
its reward. Two approaches (Joshi, Laurent, and Bresson The proposed Pointerformer is an end-to-end DRL algorithm
2019; Kool et al. 2022) used supervised learning to generate based on multi-pointer transformer which combines a trans-
the heat maps of the given graphs, and then employed dy- former encoder and an auto-regressive decoder. The general
namic programming and beam search to find near-optimal framework of Pointerformer is illustrated in Figure 1.
solutions respectively. There is another method using Monte In principle, Pointerformer applies multiple attention lay-
Carlo tree search (MCTS) to improve the solution such as ers that consist of multi-head self-attention and feed-forward
Att-GCRN+MCTS (Fu, Qiu, and Zha 2021). They first train layers to encode the input nodes for obtaining an embedding
a model to generate heat maps for guiding MCTS on small- of each node. Then, a multi-pointer network with a single
scale instances by SL, based on which heat maps of larger head attention is employed to decode sequentially accord-
TSP instances were then constructed by graph sampling, ing to a query composed of an enhanced context embedding.
graph converting and heat maps merging. Finally, MCTS Here, the enhanced context embedding contains not only
is used to search for solutions based on these heat maps. information about the instance itself and nodes that are to
However, performance of such approaches highly depends be visited, but also information about nodes that have been
on the number of iterations or search, which is usually time- visited. The solution is generated by choosing a node at each
consuming and hinders their applications in time sensitive step according to the probability distribution given by the
tasks. decoder, where all the visited nodes are masked so that their
probability is 0. Finally, the proposed Pointerformer is trained
Problem Formulation with a modified REINFORCE algorithm, which is based on a
While there are many varieties of TSP problems, we focus on shared baseline for policy gradients while unifying the mean
the classic two-dimensional Euclidean TSP in this paper. Let and variance of a batch of instances. In the following subsec-
G(V, E) denote an undirected full connection graph, where tions, we describe the key components of Pointerformer.
V = {vi | 1 ≤ i ≤ N } represents all N cities/nodes and
E = {eij | 1 ≤ i, j ≤ N } is the set of all edges. Let Reversible Residual Network Based Encoder
cost(i, j) be the cost of moving from vi to vj , which equates The encoder is an important ingredient for the Pointerformer
the Euclidean distance between vi and vj . We further assume architecture. As we mentioned before, the resource consumed
depot ∈ V denoting the depot city, from which the salesman by the original Transformer (Vaswani et al. 2017) increases
starts the trip and will go back in the end. A route is defined dramatically as the length of the input sequence increases,
as a sequence of cities. A route is feasible if and only if it which equates the number of nodes in TSP. Therefore, we
starts from and ends at depot while traverses all other cities adopt a Transformer without positional encoding but includ-
exactly once. Given a route τ , its total cost, denoted by L(τ ), ing a reversible residual network, in order to scale to large
can be calculated by Eq. (1), where τ[i] denotes the i-th node TSP instances. To our knowledge, the reversible residual net-
on τ and N = |τ | is the length of τ . work has not been introduced into the DRL approaches of
combinatorial optimization problems before.
N −1
X In the classic two-dimensional Euclidean TSP setting, each
L(τ ) = cost(τ[N ] , τ[1] ) + cost(τ[i] , τ[i+1] ) (1) node is solely denoted by its coordinates (x, y). To obtain
i=1
a robust embedding for each node, we propose a feature
A solution τ of TSP can be generated sequentially by augmentation mechanism such that each node is denoted
selecting the next node from all nodes that are to be visited by (x, y, η), where η = atanh xy . Furthermore, inspired by
until returning to the depot. This can be seen as a Markov the data augmentation in POMO (Kwon et al. 2020) that
decision process. The decision of each step can be modeled by generates 8 equivalent instances of each instance by flipping
a deep neural network parameterized by θ: πθ (τ[i] | s, τ [: i)), and rotating its underlying graph, we finally use them on the
where s denotes a TSP instance and τ [: i) is the partial route defined feature to obtain 24 features for each node. These
on τ before the i-th step. The reward of each step is defined as features will be the input of the initial embedding layer.
8134
Figure 1: The overall architecture of Pointerformer. First, multiple attention layers are applied to encode the nodes of the input
TSP instance. Next, a multi-pointer network is used to sequentially decode the solution by a query composed of an enhanced
embedding.
After the initial embedding layer, nodes will go through but memory-consuming, mainly due to the attention module
the encoder with multiple residual layers, each of which used in the query. To alleviate this, we improve our decoder
is constituted by a multi-head self-attention (MHA) sub- by integrating the following distinguishing features.
layer and a feed-forward (FF) sub-layer. Here, we employ Enhanced Context Embedding. Recall that a route τ of
the reversible residual network (Gomez et al. 2017; Kitaev, TSP is composed of a sequence of nodes on it. We propose an
Kaiser, and Levskaya 2019) to save memory consumption. effective and enhanced context embedding that contains the
Different from residual networks where activation values of following information hτ[1] , hτ[t] , hg , and hτ , where t = |τ |
all residual layers need to be stored in order to calculate the is used to denote the length of τ :
derivations during back-propagation, in reversible residual • hτ[1] , embedding of the first node on τ : A static information
networks, MHA and FF maintain a pair of input and output that is the embedding of depot;
embedding features (X1 , X2 ) and (Y1 , Y2 ) so that derivations • hτ[t] , embedding of the last node on τ : A dynamic infor-
can be calculated directly. Below we illustrate the details in mation that is updated according to the current route;
Eq. (4) and (5):
• hg , graph embedding: To encode the whole TSP instance,
Y1 = X1 + MHA(X2 ), which is the summation of embeddings of all nodes in the
(4) PN
instance: hg = i=1 henc enc
Y2 = X2 + FF(Y1 ). i , where hi is the embedding of
the i-th node obtained by the encoder;
Obviously, the input embedding features (X1 , X2 ) can be • hτ , embedding of τ : To encode the current partial route,
calculated from the output embeddings (Y1 , Y2 ) easily during which is the summation of embeddings of all nodes on τ
back-propagation: Pt−1
hτ = i=1 henc τ[i] .
X2 = Y2 − FF(Y1 ), The enhanced context embedding is used as a query qt ,
(5) which is computed by qt = N1 (hg + hτ ) + hτ[t−1] + hτ[1] .
X1 = Y1 − MHA(X2 ).
Since the graph embedding is able to reflect different graph
Note that the deeper of the residual network, the more structures while information about depot and the last visited
memory the reversible residual network can save. In our node is crucial for selecting future nodes, we include such
work, we apply MHA and FF of six layers, we can observe information to guide the decoder similar as in the previous
dramatic reduction of memory consumption without affecting DRL algorithms (Kool, van Hoof, and Welling 2018; Kwon
the performance. et al. 2020). Additionally, we also utilize hτ in our decoder
which is ignored in previous solutions. The motivation is
Multi-pointer Network Based Decoder that even with the same first and last nodes, two routes may
The decoder is an auto-regressive process that is to sequen- cause different distributions over nodes that are to be visited.
tially generate a feasible route for each TSP instance. A As shown in our experiments, such information is crucial,
context embedding is used to represent the current state, and particularly for instances from practical applications. Notice
is used as a query to interact with embeddings of nodes that that we normalize the graph embedding and the current partial
are to be selected. The context embedding is updated con- route embedding by dividing the total number of nodes N .
stantly as more nodes are selected until a feasible route is A Multi-pointer Network. At each step, the above en-
obtained. The auto-regressive decoder is generally very fast hanced context embedding is used to interact with all nodes
8135
that are to be visited to output a probability distribution over
them. We devise a multi-pointer network to better utilize the ∇θ J(θ) ≈
context embedding. More specifically, we linearly project the !
B X N
R τij − µ (τi )

queries qt and keys kj (embedding of the j-th node given by 1 X
∇θ log pθ τij | s
the encoder) to dk dimensions by using H different linear B × N i=1 j=1 σ (τi )
projections for each of them. For each projection, we are able
N
to obtain an interaction between the√query and node j via 1 X j
µ (τi ) = R τi
a dot operator and normalization by dk . The final interac- N j=1
tion is simply evaluated by an average operator over all H
PH (q W q )T (k W k ) N
interactions, namely, PN = H1 h=0 t h√d j h . 1 X j 2
σ (τi ) = R τi − µ (τi )
k
N j=1
We further minus PN by the cost between the last node
(7)
i of the partial route and node j to obtain the interaction
score between i and j: scoreij = PN − cost(i, j). By doing
so, we encourage the approach to start from a good policy Experiments
that is always selecting the nearest node as the next one to To evaluate the efficiency of Pointerformer, we compare its
visit. Comparing to starting from a random policy, this will performance with SOTA DRL approaches. We train and test
accelerate our training procedure considerably. Pointerformer on randomly generated instances, and verify
Similar to (Bello et al. 2016), the probability is obtained its generalization on a public benchmark.
by Eq. (6), where we clip the score with tanh and mask
all visited nodes. Here, C is a coefficient that controls the Benchmark Instances
range of values. The larger the value of C is, the smaller of • TSP random: Uniformly sample a certain number of
the entropy, hence it can be seen as a parameter to control nodes from the unit square of [0, 1]2 . It includes five sets
the trade-off between exploitation and exploration during of TSP instances with N = 20, 50, 100, 200, 500. Same as
training. We will show via ablation studies that the value of in Att-GCRN+MCTS (Fu, Qiu, and Zha 2021), for TSP in-
C has a significant impact on performance. stances with N ≤ 100, we sample 10,000 instances for each
set, while for larger instances with N ≥ 200, the set size
is 128. The same benchmark is also widely adopted to tes-

C · tanh (scoreij ) node j is to be visited
uij = (6) tify existing DRL approaches except that they only consider
−∞ otherwise
instances with N ≤ 100;
• TSPLIB: A well-known TSP library (Reinelt 1991) that
Finally, we are able to compute the output probability contains 100 instances with various node distributions. These
vector p using a softmax function. instances come from practical applications with size rang-
ing from 14 to 85,900. In our experiment, we consider all
instances with no more than 1,002 nodes.
A Modified REINFORCE Algorithm
Baselines
We train our Pointerformer model by using the REINFORCE The following SOTA DL algorithms are considered as our
algorithm (Williams 1992), whose baseline applies diverse baselines.
greedy roll-outs of all instances for policy gradient. Inspired End-to-end DL algorithms:
by POMO (Kwon et al. 2020), our decoder also starts from • AM (Kool, van Hoof, and Welling 2018): A model based
N different nodes for each TSP instance with N nodes. By on attention layer is trained using the REINFORCE algorithm
taking each node as the depot, foreach TSP instance i, we can with a deterministic greedy roll-out baseline. AM can achieve
sample N feasible routes τi = τi1 , τi2 , . . . , τiN by Monte good performance on small-scale TSP instances;
Carlo sampling method. Therefore, given a batch containing
• POMO (Kwon et al. 2020): To reduce the variance of ad-
B TSP instances, we can obtain B × N routes, which can
vantage estimation, POMO improves the algorithm in AM
be used to train our policy according to Eq. (3). However,
such that it generates N trajectories for each instance with
directly applying REINFORCE will cause the algorithm hard
N nodes and uses data augmentation to improve the quality
to converge because of high variance of costs among different
of solutions during validation;
instances. In order to alleviate such a problem, we further
use a variance-consistent normalization mechanism before • AM+LCP (Kim, Park et al. 2021): It proposes a training
training, which can increase the speed of convergence while paradigm for solving TSP called termed learning collabora-
also stabilizes the training. More details can be found in tive policy. It distinguishes policy seeder and policy reviser,
Eq. (7), where µ(τi ) and σ(τi ) are the mean and variance which focus on exploration and exploitation, respectively.
of the N trajectories of instance i, respectively. One can Search-based DL algorithms:
R(τij )−µ(τi ) • DRL+2opt (d O Costa et al. 2020): DRL+2opt guides the
easily observe that σ(τi ) is an unbiased estimation of search of 2-opt operator through DRL. The combination of re-
the TSP objective function, which eliminates the effect of inforcement learning and heuristic search operator constantly
different rewards among different instances. improve solutions to achieve good results.
8136
• Att-GCN+MCTS (Fu, Qiu, and Zha 2021): It trains a model
to generate heat maps for guiding MCTS on small-scale
instances by supervised learning, based on which heat maps
of larger instances are then constructed by graph sampling,
graph converting and heat maps merging. Finally, MCTS is
used to search for solutions based on the heat maps.
Hyper-Parameters
In our experiments, we only use instances from TSP random
to train various models corresponding to instances with dif-
ferent nodes. During each training epoch, 100,000 instances Figure 2: Comparison of memory consumption between
are randomly sampled. To train models for instances of size Pointerformer and POMO. Along with the enlarging problem
N ≤ 200, we use a single GPU V100 (16G) with batch size size, the memory consumption of POMO increases sharply,
B = 64, while for other cases the models are trained on four while our model increases gradually.
GPUs V100 (32G) with batch size B = 32. Adam is used as
the optimizer for all models with a learning rate η = 10−4
and a weight decay ω = 10−6 . We use 6 layers in the encoder attain better results on TSP instances with 200 nodes in less
(nt = 6) and let dk = 128 and H = 8 of multi-pointer in the time. We should mention that results of Att-GCN+MCTS
decoder. The number of heads is 8 in the MHA layer. When are taken directly from (Fu, Qiu, and Zha 2021), where the
evaluating on TSP random, the batch size B is 128 for in- search component is implemented in C++ and runs in a CPU
stances with N ≤ 200, while B = 64 for other cases. Our with 8 cores in parallel.
algorithm is implemented based on PyTorch (Paszke et al. To the best of our knowledge, Pointerformer is the first
2019), the trained models and the related data are publicly end-to-end DRL algorithm that can scale to TSP instances
available. 1 with more than 100 nodes while still achieve comparable
results as search-based DRL approaches, but in shorter time.
Experimental Results In order to evaluate the generalization of the proposed
Pointerformer, we apply Model100 directly to the TSPLIB
To show the effectiveness of Pointerformer, we first train instances, similar for the baseline algorithms AM, POMO,
models with different number of nodes, denoted by ModelN and DRL+2opt. Note that we do not compare with Att-
with N = 20, 50, 100, 200, and 500, respectively. For train- GCN+MCTS and AM+LCP here, since we have not fig-
ing ModelN , random instances of size N are sampled from ured out how to extend Att-GCN+MCTS to non-random set-
TSP random using parameters as stated in the above section. ting, while the implementation of AM+LCP is not publicly
We have conducted the experiment on TSP random and available. To further verify the importance of scalability, we
a further study of generalization on TSPLIB, in all of which also apply Model200 to these instances, which are unavail-
we observe advantages of Pointerformer over others. For the able for the baselines due to lack of scalability. We see that
group of TSP random benchmark, the results are shown in Model200 has better generalization comparing to Model100,
Table 1, from which we can see that Pointerformer has the particularly for large-scale instances. Table 2 summarizes the
best trade-off between efficiency and optimality compared to results of Pointerformer in comparison with the three base-
others. Pointerformer can achieve results of relatively small lines on instances from TSPLIB, where we classify instances
gaps to the optimal solutions that are achieved by the exact in TSPLIB into three groups according to their sizes, i.e.,
algorithm Concorde, denoted by OPT. More importantly, one TSPLIB1∼100, TSPLIB101∼500, and TSPLIB501∼1002.
easily observes that Pointerformer can scale to TSP instances From the results, we can see that POMO performs the best
with up to 500 nodes while other DRL algorithms except Att- on instances with no more than 100 nodes and the second
GCN+MCTS quickly run out of memory for TSP instances best on instances between 101 to 500 nodes. While Pointer-
with N > 100 (indicated by - in Table 1). In Fig. 2, we also former (Model100) performs the best on instances between
compare memory consumption of our model with the SOTA 101 to 500 nodes and the second best on the other two groups.
DRL approach POMO trained on instances of different size. One notices that most instances of second group are around
One easily observes that along with the enlarging problem 100 nodes, so Pointerformer (Model100) has the best perfor-
size, the memory consumption of POMO increases sharply, mance and POMO has the second best performance for them.
while our model increases gradually. Note that since the Pointerformer (Model200) and Pointerformer (Model100)
architecture of POMO is most similar with ours, it is more perform the best and the second best on instances with more
fair to use POMO for comparison of memory consumption than 500 nodes, indicating that our model generalizes best to
when comparing to other DRL models. Comparing to search- large-scale instances.
based approach, the solutions obtained by Pointerformer may
be slightly worse than Att-GCN+MCTS on TSP instances Ablation Studies
with 500 nodes. However, we can accelerate the computing
time by up to 6 times (5.9m to 59.35s). In particular, we can In this section, we present some ablation studies that explain
some important choices of our approach.
1 To assess the influence of some key components to
https://github.com/Learning4Optimization-
HUST/Pointerformer the performance of Pointerformer, we carry out an addi-
8137
Method TSP random20 TSP random50 TSP random100 TSP random200 TSP random500
Len Gap Time Len Gap Time Len Gap Time Len Gap Time Len Gap Time
(%) (%) (%) (%) (%)
OPT 3.83 5.69 7.76 10.72 16.55
AM 3.83 0.06 5.22s 5.72 0.49 12.76m 7.94 23.20 32.72m - - - - - -
POMO 3.83 0.00 36.86s 5.69 0.02 1.15m 7.77 0.16 2.17m - - - - - -
AM+LCP 3.84 0.00 30.00m 5.70 0.02 6.89h 7.81 0.54 11.94h - - - - - -
DRL+2opt 3.83 0.00 3.33h 5.70 0.12 4.62m 7.82 0.78 6.57h - - - - - -
Att-GCN+MCTS 3.83 0.00 1.6m 5.69 0.01 7.90m 7.76 0.04 15m 10.81 0.88 2.5m 16.97 2.54 5.9m
Pointerformer 3.83 0.00 5.82s 5.69 0.02 11.63s 7.77 0.16 52.34s 10.79 0.68 5.54s 17.14 3.56 59.35s
TSP20, TSP50 and TSP100: 10,000 instances; TSP200 and TSP500: 128 instances.
Table 1: Comparison results on instances from TSP random.
Method TSPLIB1∼100 TSPLIB101∼500 TSP501∼1002

Len Gap Time Len Gap Time Len Gap Time
(%) (%) (%)
OPT 19454.17 40842.43 62427.71
AM 22283.67 15.36 0.23s 72137.93 78.18 0.86s 140664.29 139.02 5.79s
POMO 19628.67 1.20 1.41s 43652.77 6.99 1.55s 82162.29 26.93 3.49s
DRL+2opt 19916.50 2.43 15.20m 46651.40 13.85 27.92m 82797.71 42.57 1.24h
Pointerformer (Model100) 19728.50 1.33 0.20s 42963.20 5.43 0.46s 75081.43 18.65 5.14s
Pointerformer (Model200) 20135.00 2.91 0.20s 43810.67 8.37 0.46s 73915.57 18.20 5.14s
Table 2: Comparison results on practical instances from TSPLIB.
Algorithm Len Gap 40
Pointerformer 10.793 0.68%

Number…of…best…solutions
35
w.o. feature augmentation 10.813 0.87%
w.o. enhanced context embedding 11.013 2.73% 30
w.o. multi-pointer network 10.797 0.72% 25
20
Table 3: Ablations of three key elements of Pointerformer on
TSP random200. 15
10
5
tional ablation study to compare Pointerformer and its four
variants on instances from TSP random with 200 nodes 0
Pointerformer w.o.…feature w.o.…multi-w.o.…enhanced…
(TSP random200). The results are summarized in Table 3. augmentation …pointer… …context…
The first variant only uses the coordinates of each node as …network …embedding
inputs without any feature augmentation (denoted by w.o. fea-
ture augmentation in the table). The second variant removes Figure 3: Ablations of three key elements of Pointerformer
the embedding of the current partial route from the context on TSP random200.
embedding (denoted by w.o. enhanced context embedding).
And the third variant does not use the multi-pointer net-
works, denoted by w.o. multi-pointer network. From Table 3,
it is clear that Pointerformer achieves the best performance (TSPs). By integrating feature augmentation, reversible resid-
comparing to all the variants, which indicates all components ual network, and enhanced context embedding with the well-
play positive roles to our algorithm. Furthermore, we apply known Transformer architecture, Pointerformer can achieve
these models directly test the instances from TSPLIB and comparable results as SOTA algorithms do but using less
provide their comparisons in Figure 3. Pointerformer with all resources (time or memory). While being memory-efficient,
components outperforms the three variants, indicating that Pointerformer can be scaled to handle TSP instances with 500
these components are also important for the generalization nodes, that existing end-to-end DRL approaches could not
of Pointerformer. solve. More importantly, we show via extensive experiments
on well-known TSP instances with different distributions that
our approach has better generalization. For future work, we
Conclusion will explore how to extend our approach to address the more
In this paper, we propose an end-to-end DRL approach complicated problem of vehicle routing and other combinato-
called Pointerformer to solve the traveling salesman problems rial optimization problems.
8138
Acknowledgements Hacizade, U.; and Kaya, I. 2018. GA Based Traveling
This work is supported by National Natural Science Founda- Salesman Problem Solution and its Application to Transport
tion (U22B2017) and MSRA Collaborative Research 2022 Routes Optimization. IFAC-PapersOnLine, 51(30): 620–625.
(100338928). Helsgaun, K. 2017. An extension of the Lin-Kernighan-
Helsgaun TSP solver for constrained traveling salesman and
References vehicle routing problems. Roskilde: Roskilde University.
Alkaya, A. F.; and Duman, E. 2013. Application of sequence- Hopfield, J. J.; and Tank, D. W. 1985. “Neural” computation
dependent traveling salesman problem in printed circuit of decisions in optimization problems. Biological cybernetics,
board assembly. IEEE Transactions on Components, Packag- 52(3): 141–152.
ing and Manufacturing Technology, 3(6): 1063–1076. Jiang, Y.; Wu, Y.; Cao, Z.; and Zhang, J. 2022. Learning to
Angeniol, B.; Vaubois, G. D. L. C.; and Le Texier, J.-Y. 1988. Solve Routing Problems via Distributionally Robust Opti-
Self-organizing feature maps and the travelling salesman mization. arXiv preprint arXiv:2202.07241.
problem. Neural Networks, 1(4): 289–293. Joshi, C. K.; Laurent, T.; and Bresson, X. 2019. An effi-
Applegate, D. L.; Bixby, R. E.; Chvátal, V.; and Cook, W. J. cient graph convolutional network technique for the travelling
2007. The Traveling Salesman Problem: a Computational salesman problem. arXiv preprint arXiv:1906.01227.
Study. Princeton Series in Applied Mathematics. Kim, M.; Park, J.; et al. 2021. Learning Collaborative Policies
Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, S. to Solve NP-hard Routing Problems. Advances in Neural
2016. Neural combinatorial optimization with reinforcement Information Processing Systems, 34.
learning. arXiv preprint arXiv:1611.09940. Kitaev, N.; Kaiser, L.; and Levskaya, A. 2019. Reformer:
Bland, R. G.; and Shallcross, D. F. 1989. Large travelling The Efficient Transformer. In International Conference on
salesman problems arising from experiments in X-ray crystal- Learning Representations.
lography: A preliminary report on computation. Operations
Kool, W.; van Hoof, H.; Gromicho, J.; and Welling, M. 2022.
Research Letters, 8(3): 125–128.
Deep policy dynamic programming for vehicle routing prob-
Chen, X.; and Tian, Y. 2019. Learning to perform local lems. In International Conference on Integration of Con-
rewriting for combinatorial optimization. In Proceedings of straint Programming, Artificial Intelligence, and Operations
the 33rd International Conference on Neural Information Research, 190–213. Springer.
Processing Systems, 6281–6292.
Kool, W.; van Hoof, H.; and Welling, M. 2018. Attention,
Christofides, N. 1976. Worst-case analysis of a new heuris- Learn to Solve Routing Problems! In International Confer-
tic for the travelling salesman problem. Technical report, ence on Learning Representations.
Carnegie-Mellon Univ Pittsburgh Pa Management Sciences
Research Group. Kwon, Y. D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; and Min,
S. 2020. POMO: Policy optimization with multiple optima
d O Costa, P. R.; Rhuggenaath, J.; Zhang, Y.; and Akcay, for reinforcement learning. Advances in Neural Information
A. 2020. Learning 2-opt Heuristics for the Traveling Sales- Processing Systems, 2020-December.
man Problem via Deep Reinforcement Learning. In Asian
Conference on Machine Learning, 465–480. PMLR. Kwon, Y.-D.; Choo, J.; Yoon, I.; Park, M.; Park, D.; and
Gwon, Y. 2021. Matrix encoding networks for neural com-
Dai, H.; Khalil, E. B.; Zhang, Y.; Dilkina, B.; and Song, L.
binatorial optimization. Advances in Neural Information
2017. Learning Combinatorial Optimization Algorithms over
Processing Systems, 34: 5138–5149.
Graphs. Advances in Neural Information Processing Systems,
30: 6348–6358. Liu, F.; and Zeng, G. 2009. Study of genetic algorithm with
reinforcement learning to solve the TSP. Expert Systems with
Fu, Z.-H.; Qiu, K.-B.; and Zha, H. 2021. Generalize a Small
Applications, 36(3): 6995–7001.
Pre-trained Model to Arbitrarily Large TSP Instances. Pro-
ceedings of the AAAI Conference on Artificial Intelligence, Ma, Y.; Li, J.; Cao, Z.; Song, W.; Zhang, L.; Chen, Z.; and
35(8): 7474–7482. Tang, J. 2021. Learning to iteratively solve routing prob-
Gambardella, L. M.; and Dorigo, M. 1995. Ant-Q: A rein- lems with dual-aspect collaborative transformer. Advances in
forcement learning approach to the traveling salesman prob- Neural Information Processing Systems, 34: 11096–11107.
lem. In Machine learning proceedings 1995, 252–260. Else- Madani, A.; Batta, R.; and Karwan, M. 2020. The balancing
vier. traveling salesman problem: application to warehouse order
Ghiani, G.; Guerriero, F.; Laporte, G.; and Musmanno, R. picking. Top.
2003. Real-time vehicle routing: Solution concepts, algo- Matai, R.; Singh, S.; and Lal, M. 2010. Traveling Salesman
rithms and parallel computing strategies. European Journal Problem: an Overview of Applications, Formulations, and
of Operational Research, 151(1): 1–11. Solution Approaches. Traveling Salesman Problem, Theory
Gomez, A. N.; Ren, M.; Urtasun, R.; and Grosse, R. B. 2017. and Applications.
The reversible residual network: Backpropagation without Nagata, Y. 2006. Fast EAX algorithm considering popula-
storing activations. In Proceedings of the 31st International tion diversity for traveling salesman problems. In European
Conference on Neural Information Processing Systems, 2211– Conference on Evolutionary Computation in Combinatorial
2221. Optimization, 171–182. Springer.
8139
Nazari, M.; Oroojlooy, A.; Takáč, M.; and Snyder, L. V. 2018.
Reinforcement learning for solving the vehicle routing prob-
lem. In Proceedings of the 32nd International Conference on
Neural Information Processing Systems, 9861–9871.
Nowak, A.; Villar, S.; Bandeira, A. S.; and Bruna, J. 2017.
A Note on Learning Algorithms for Quadratic Assign-
ment with Graph Neural Networks. ArXiv e-prints, 1706:
arXiv:1706.07450.
Padberg, M.; and Rinaldi, G. 1991. A branch-and-cut algo-
rithm for the resolution of large-scale symmetric traveling
salesman problems. SIAM review, 33(1): 60–100.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;
Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.;
Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.;
and Chintala, S. 2019. PyTorch: An Imperative Style, High-
Performance Deep Learning Library. In Advances in Neural
Information Processing Systems 32, 8024–8035. Curran As-
sociates, Inc.
Reinelt, G. 1991. TSPLIB—A traveling salesman problem
library. ORSA journal on computing, 3(4): 376–384.
Sun, R.; Tatsumi, S.; and Zhao, G. 2001. Multiagent re-
inforcement learning method with an improved ant colony
system. In 2001 IEEE International Conference on Systems,
Man and Cybernetics. e-Systems and e-Man for Cybernetics
in Cyberspace (Cat. No. 01CH37236), volume 3, 1612–1617.
IEEE.
Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour,
Y. 2000. Policy gradient methods for reinforcement learn-
ing with function approximation. In Advances in neural
information processing systems, 1057–1063.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;
Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention
is all you need. In Advances in neural information processing
systems, 5998–6008.
Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer
Networks. In Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama,
M.; and Garnett, R., eds., Advances in Neural Information
Processing Systems, volume 28. Curran Associates, Inc.
Williams, R. J. 1992. Simple Statistical Gradient-Following
Algorithms for Connectionist Reinforcement Learning. In
Reinforcement Learning, 5–32. Springer.
Xu, Z.; Li, Z.; Guan, Q.; Zhang, D.; Li, Q.; Nan, J.; Liu,
C.; Bian, W.; and Ye, J. 2018. Large-scale order dispatch in
on-demand ride-hailing platforms: A learning and planning
approach. Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 905–
913.
Zheng, J.; He, K.; Zhou, J.; Jin, Y.; and Li, C.-M. 2021.
Combining Reinforcement Learning with Lin-Kernighan-
Helsgaun Algorithm for the Traveling Salesman Problem. In
Proceedings of the AAAI Conference on Artificial Intelligence,
12445–12452.
8140

25982-Article Text-30045-1-2-20230626

Uploaded by

Copyright:

Available Formats

25982-Article Text-30045-1-2-20230626

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

25982-Article Text-30045-1-2-20230626

Uploaded by

Copyright:

Available Formats

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Pointerformer: Deep Reinforced Multi-Pointer Transformer

Abstract Due to both of its theoretical and practical importance,

Table 1: Comparison results on instances from TSP random.

Method TSPLIB1∼100 TSPLIB101∼500 TSP501∼1002

Table 2: Comparison results on practical instances from TSPLIB.

Algorithm Len Gap 40

Pointerformer 10.793 0.68%

w.o. multi-pointer network 10.797 0.72% 25

You might also like