Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers

Learning Sampling Dictionaries for Efficient and Generalizable Robot
Motion Planning with Transformers

Jacob J. Johnson† , Ahmed H. Qureshi‡ , and Michael C. Yip†
Abstract— Motion planning is integral to robotics applica-

tions such as autonomous driving, surgical robots, and indus-
trial manipulators. Existing planning methods lack scalability
to higher-dimensional spaces, while recent learning-based plan-
ners have shown promise in accelerating sampling-based motion
arXiv:2306.00851v2 [cs.RO] 26 Sep 2023
planners (SMP) but lack generalizability to out-of-distribution

environments. To address this, we present a novel approach,
Vector Quantized-Motion Planning Transformers (VQ-MPT)
that overcomes the key generalization and scaling drawbacks
of previous learning-based methods. VQ-MPT consists of two
stages. Stage 1 is a Vector Quantized-Variational AutoEncoder
model that learns to represent the planning space using a finite
number of sampling distributions, and stage 2 is an Auto-
Regressive model that constructs a sampling region for SMPs
by selecting from the learned sampling distribution sets. By
splitting large planning spaces into discrete sets and selectively
choosing the sampling regions, our planner pairs well with out-
of-the-box SMPs, generating near-optimal paths faster than
without VQ-MPT’s aid. It is generalizable in that it can be
applied to systems of varying complexities, from 2D planar
to 14D bi-manual robots with diverse environment representa-
tions, including costmaps and point clouds. Trained VQ-MPT
models generalize to environments unseen during training and
achieve higher success rates than previous methods. Videos and
code are available at https://sites.google.com/ucsd.
edu/vq-mpt/home.
Fig. 1. VQ-MPT can efficiently split high-dimensional planning spaces
I. INTRODUCTION into discrete sets of distributions. Each distribution is represented using a
latent variable called code or dictionary value. Given a planning problem,
Sampling-based motion planning use randomly sampled the model selects a subset of codes and samples from the associated
distributions to construct the trajectory. By sampling efficiently, VQ-MPT
points to generate a tree-based collision-free path between a reduces planning times by 2-6× compared to previous planners.
start and goal locations [1], [2]. However, random sampling
is inefficient [3] for goal-directed tasks, particularly when for reconfiguring planner parameters. Most of these planners
the search space spans a high number of dimensions. Since are probabilistically complete, i.e., the planner will find a
sampling-based motion planners (SMPs) are a fundamental path if a trajectory exists, given enough time. But finding
component of numerous autonomous systems [4], [5], im- a trajectory that is optimal, like the shortest path, is also a
proving the efficiency and generalizability of the underlying challenge. Numerous works have been proposed that address
planners enables these systems to handle more complex some of these challenges.
tasks that involve intricate sequences of planning, improves For efficient sampling, prior works have reduced the search
task execution, and reduces the need to retrain planners spaces through hand-crafted heuristics or parametric func-
for different environments. While SMPs effectively gener- tions, decreasing planning time. The current state-of-the-art
ate a trajectory, they face several challenges in improving motion planners leverage goal-directed heuristics; Informed-
sampling efficiency. As the dimensionality of the configu- RRT∗ (IRRT∗ ) [6] and Batch Informed Trees (BIT∗ ) [7]
ration space increases, the ”curse of dimensionality” makes search for a path in an ellipsoidal region between the start
sampling more difficult and time-consuming. Efficiently ex- and goal location. In [8], [9], Artificial Potential Fields (APF)
ploring high-dimensional spaces to find feasible paths is a guide random samples toward regions with an optimal solu-
significant challenge. These planners must also be able to tion. Sampling-based A∗ [1] extends the A∗ search algorithm
reliably solve for different environments without the need to sampling-based planning and uses heuristics to sample
from selected vertices. But for higher dimensional spaces,
† J.J. Johnson and M.C.Yip are with the Electrical and Computer Engi- sampling with these heuristics still leaves many samples
neering Department at University of California San Diego, La Jolla, CA, unused for constructing a trajectory.
USA {jjj025, yip}@eng.ucsd.edu
‡ A.H. Qureshi is with the Department of Computer Science at Purdue On the other hand, learning-based methods leverage data
University, West LaFayette, IN, USA [email protected] from prior planned data to accelerate planning in similar en-
vironments [10], [11], [12], [13]. Motion Planning Networks II. BACKGROUND
(MPNet) [14] was the first neural planner to generate the
full motion planning solution through a recurrent sampling A. Problem Definition
of its networks, given the current and goal position of the Consider the planning space defined by X ∈ Rn . We
robot as well as the environment representation. MPNet define a subspace Xf ree ⊂ X , such that all states in
considerably reduces planning time for higher dimensions, Xf ree do not collide with any obstacle in the environment
but these models do not generalize to larger environment and are considered valid configuration. The objective of
representations [15]. Other neural planners [16], [17] have the motion planner is to generate a sequence of states:
also explored using neural networks for planning. Q = {q1 , q2 , . . . , qns } for a given start state (q1 ) and a goal
Transformer models are an ideal candidate for solving region (Xgoal ) such that qi ∈ Xf ree , ∀i ∈ {1, 2, . . . , ns },
the planning problem because of their ability to make the edge connecting qi and qi+1 is also in Xf ree , i.e.,
long-horizon connections [18]. Advances in large language (1 − α)qi + αqi+1 ∈ Xf ree , ∀α ∈ [0, 1], and qns ∈ Xgoal .
models, such as BERT [19], and GPT [20], have inspired The sequence of states is often referred to as a trajectory or
similar efforts in solving planning tasks using transformer path. In this work, we are interested in a novel learning-based
models [21], [22]. These models make better control deci- approach to promote efficient sampling in X for generating
sions in robotic quadrupedal walking tasks by attending to a valid, optimized trajectory.
proprioceptive and visual sensor data [23]. Although these
works support the possibility of using transformer models
B. Vector Quantized Models
for decision-making, it is difficult to interpret the policy’s
future control actions and provide any form of guarantee The VQ-VAE model has been shown to compress high-
for the underlying planner. Other works [15], [24] only dimensional spaces such as images and audio without pos-
solve for planar manipulators and 2D mobile robots because, terior collapse observed in VAE models [25]. We utilize a
inherently, their network models follow those used in image VQ-VAE in a similar manner to compress the robot planning
understanding in 2D discrete spaces. Since these models space X . The VQ model encodes input q ∈ Rn using a
have to discretize the entire planning space, extending these function f to a latent space Z, and is quantized to a set of
methods to higher dimensional, continuous planning spaces learned vectors ZQ = {ẑ1 , ẑ2 , . . . , ẑN }. The vectors in ZQ
would exponentially increase training and memory costs. are often called codes or dictionary values in literature. The
Furthermore, these planners require the space in which the function g decodes the closest vector in ZQ to f (q) back to
path is constructed (planning space) to overlap with the space the input space. The parameters of f and g and the set of
in which the environment is represented (task space). For vectors in ZQ are estimated using self-supervised learning
example, for a 14-degree-of-freedom bi-manual robot arm by minimizing the following error,
setup, the environment is represented using point clouds
which is R3 , while the planning space is R14 . How these L = Lrecon + ∥sg[f (q)] − ẑ∥ + β∥f (q) − sg[ẑ]∥, (1)
methods apply to environments with disjoint planning and
task space is unclear. where ẑ is the quantized vector and sg[ ] stands for the stop
In this work, we propose VQ-MPT, a scalable transformer- gradient operator [25], which has zero partial derivatives,
based model that accelerates SMP by narrowing the sampling i.e. ∇sg(x) = 0, preventing the operand from being updated
space. VQ-MPT uses a Vector Quantized (VQ) model to dis- during training. Lrecon is the main AE reconstruction loss
cretize the planning space. VQ models are generative models (we will derive this later). The second term is used to update
with an encoder-decoder architecture similar to Variational the latent vectors in ZQ while keeping the encoder output
AutoEncoder (VAE) models but with the latent dimension constant, and the last term is called the commitment loss
represented as a collection of learnable vectors referred to as and updates the encoder function while keeping the latent
dictionaries. A transformer model selects a subset of these vectors constant. This prevents the output of the encoder
learned vectors to generate the search region for the given from drifting away from the current set of latent vectors. Yu
planning problem. We describe in this paper how the VQ et al. [26] proposed two further improvements in representing
approach can be used in the context of motion planning, the codes to help improve the training stability, code usage,
leading to the following major advantages: and reconstruction quality of VQ-VAE models for images.
1) Reduces planning times by 2-6× compared to tradi- 1) Factorized Codes: The output from the encoder func-
tional planning algorithms such as BIT∗ and by 3-6× tion is linearly projected to a lower dimensional space. For
compared to learned planners such as MPNet. example, if the encoder output is a 1024-d vector, it is
2) Scales to 14-dimensional planning spaces without projected to an 8-d vector. The authors in [26] show that
compromising planning performance. using a lower dimensional space improves code usage and
3) Learns efficient quantization of high dimensional plan- reconstruction quality.
ning space without increasing the dictionary size. 2) Normalized Codes: Each factorized codes, ẑi , are l2
4) Generalizes to unseen in-distribution and out-of- normalized. Hence all the dictionary values are mapped
distribution environments more successfully than onto a hypersphere. This improves the training stability and
learned planners such as MPNet. reconstruction quality of the model.
Input Trajectory
Key-Value Query
Self-Attention Feature
Extractor
MLP
Encoder
Transformer Cross-Attention
Cross-Attention
Transformer
MLP
Point Cloud
Quantize
Self-Attention AR Transformer
MLP
Dictionary
Output Distributions
Stage 1 : Learning sampling dictionary Stage 2 : Predicting Distributions
Fig. 2. An outline of the model architecture of VQ-MPT. Stage 1 (Left) is a Vector Quantizer that learns a set of latent dictionary values that can be mapped
to a distribution in the planning space. By encoding the planning space to discrete distributions, we can plan for high-dimensional robot systems. Stage 2
(Right) is the Auto-Regressive (AR) model that sequentially predicts the sampling regions for a given environment and a start and goal configuration. The
cross-attention model transduces the start and goal embeddings given the environment embedding generated using a feature extractor. The output from the
AR Transformer is mapped to a distribution in the planning space using the decoder model from Stage 1.
C. Transformer Models space as a collection of distributions (Fig. 2). Below, we

Transformer models are transduction models that consist describe the two stages and objectives used for training.
of self-attention [27] and fully connected layers. They have
A. Stage 1: Vector Quantizer
been shown to efficiently model sequence data for language
and image tasks [18], [28], hence an ideal encoder model. The first stage learns to represent the planning space
The self-attention layer is a Scaled Dot-product Attention using a set of distributions. It does not take any sensor
[18] that takes three matrices - query (Q ∈ Rns ×dq ), value data such as costmap or pointcloud. We use a VQ model
(V ∈ Rns ×dv ), and key (K ∈ Rns ×dq ) vectors to generate similar to VQ-VAE [25] with a transformer network as the
the attention output encoder and propose a maximum likelihood-based recon-
struction loss to learn the set of distributions. The encoder
Atten(Q, K, V ) = softmax γ −1 QK T V,

(2) network takes in a trajectory, Q = {q1 , q2 , . . . , qns }, and
outputs a set of latent vectors, Z = {z1 , z2 , . . . , zns }.
where ns is the sequence length, dq is the dimension of the The decoder model, an MLP model, maps the quantized
√ dv is the dimension of key and value space,
query space, encoder output to a sequence of parameterized distributions,
and γ = dv is a scaling factor. Rather than doing a single {P (· ; θ1 ), P (· ; θ2 ), . . . , P (· ; θns )}, in the planning space.
attention function, these models linearly project the query, We define our reconstruction loss as follows:
key, and value vectors multiple times using different learned
ns
weights and is called the multi-headed attention model. This X
Lrecon = − log(P (qj ; θj ))
enables the model to attend to different features present in the
j=1
data. The final output is a linear combination of individual ns (3)
attention values evaluated on each projected set. The pooled
X
−λ Eq∼X [− log(P (q; θj ))]
output is passed through deep residual multilayer perceptron j=1
(MLP) networks. In [29], the authors introduce Prenorm- where λ is a scaling constant. The first term maximizes the
Transformer where the inputs to the attention and MLP layers likelihood of observing the input trajectory, while the second
are normalized as this makes training the model more stable. term maximizes the differential entropy. The entropy term
prevents the distribution from overfitting to each batch of
III. VECTOR QUANTIZED-MOTION PLANNING
data because a small batch size does not cover the entire
TRANSFORMERS
planning space. In the following paragraphs, we provide
The VQ pipelines in image generation [30], [26] consist further details of our models.
of a quantization stage and a prediction stage. We adapt this The encoder model transforms each state in the trajectory
pipeline for sequence generation and represent the planning into an efficient representation by learning patterns in the
sequence. Each input state, qj , to the encoder is linearly The environment representation (i.e., costmap or point
projected to a latent space Rd , and fixed position embedding cloud data) is passed through a feature extractor to construct
[18] is added to the projected output. The resulting vector the environment encodings E = {e1 , e2 , . . . , ene } where ei ∈
is passed through multiple blocks of Prenorm-Transformer Rd . The feature extractor reduces the dimensionality of the
described in Section II-C to obtain the set Z. Each latent environment representation and captures local environment
vector zj ∈ Z is quantized to a vector from the set ZQ = structures as latent variables using convolutional layers for
{ẑ1 , ẑ2 , . . . , ẑN } using the function zq (·) defined by: costmaps and set-abstraction layers for point clouds. The
start and goal states (qs and qg ) are projected to the start
zq (z) = ẑi where i = argmin ∥z − ẑk ∥ (4) and goal embedding (Es ∈ Rd and Eg ∈ Rd ) using a
k∈{1,...,N }
MLP network. The cross-attention model is a Prenorm-
where ẑi is the quantized vector corresponding to qi . We Transformer model that uses the environment embedding,
prepend and append the transduced set with static encodings E, and the start and goal embedding, {Es , Eg } to generate
zs and zg to indicate the start and end of the sequence, latent vectors M . The cross-attention model learns a feature
respectively. Hence the robot trajectory Q is transduced to embedding that fuses the given start and goal pair with the
Ẑ = {zs , zq (z1 ), zq (z2 ), . . . zq (zns ), zg }. given planning environment. It uses the vector in E as key-
The decoder model maps each quantized vector, zq (zi ), value pairs, and Es and Eg as query vectors to generate M .
to the parameterized distribution P (· ; θi ). We choose We use an AR Transformer model, π(·), to predict the
the output distribution as Gaussian, but any parametric dictionary indexes H. A Transformer-based AR model was
distribution, such as Gaussian Mixture Models, Exponential chosen because of their ability to make long-horizon con-
distributions, or Uniform distributions, can be chosen. The nections. For each index hj , the model outputs a probability
decoder model outputs the mean and the covariance matrix distribution over ZQ ∪ {zg } given dictionary values of
of the Gaussian distribution (N (µ, Σ)); hence it is a function previous predictions {ẑh1 , ẑh2 , . . . , ẑhj−1 } and the planning
of the dictionary value zq (zj ), ∀j ∈ {1, . . . , ns }, and is context M :
represented by µ(zq (zj )) and Σ(zq (zj )) respectively. We will N +1
refer to these variables as µj and Σj for simplicity.
X
π(hj = i|ẑh1 , . . . , ẑhj−1 , M ) = pi where pi = 1 (7)
To ensure that the covariance matrix always remains i=1
positive definite during training, we decompose Σj using Using the learned decoder from Stage 1, we can convert
Cholesky decomposition as in previous works [31], [32]: each of the predicted dictionary values, ẑhj , into a Gaussian
Σj = Lj Dj LTj (5) distribution (N (µhj , Σhj )) in the planning space.
where Lj is a lower triangle matrix with ones along the C. Generating Distributions for Sampling
diagonal, and Dj is a diagonal matrix with positive values.
The output from the penultimate MLP layer is passed through With stage 1, we have efficiently split the planning space
separate linear layers to obtain µj and Lj , while for Dj , it into a discrete set of distributions represented using a set
is passed through a linear and soft-plus layer [33] to ensure of latent vectors, and with stage 2, we have provided a
values are positive. Using the soft-plus layer improves the means to select a subset of distributions from the dictionary.
stability of training the model. Given a new planning problem, we use the trained Stage 2
models to generate a sequence of dictionary indexes H =
B. Stage 2: Auto-Regressive (AR) Prediction {h1 , . . . hnh }. Since each index can take N values, we pick
The second stage generates sampling regions by predicting the sequence H that maximizes the following probability:
indexes from the dictionary set ZQ for a given planning nh
problem and sensor data. It comprises two models - a
Y
P (h1 , . . . , hnh |M ) = π(hi |h1 , . . . , hi−1 , M ) (8)
cross-attention model to embed start and goal pairs and i=1
the environment embedding into latent vectors (M ), and a
Transformer-based Auto-Regressive (AR) model to predict where hnh is the goal index and π is the probability from
the dictionaries indexes, H = {h1 , h2 , . . . hnh }. Both models Eqn. 7. We apply a beam-search algorithm to optimize for
are trained end-to-end by reducing the cross entropy loss Eqn. 8 as done before in language model tasks [19].
using trajectories from an RRT∗ planner: The decoder model from Stage 1 is used to gener-
nh N +1
ate a set of distributions, P, from the dictionary values,
δi (hj ) log(π(hj = i|ẑh1 , · · · , ẑhj−1 , M ))] h1 , ẑh2 , . . . , ẑhnh −1 }, corresponding to the predicted in-
{ẑ
X X
LCE = E[−
j=1 i=1
dexes {h1 , h2 , . . . , hnh −1 }. We define this set as a Gaussian
(6) Mixture Model (GMM) with uniform mixing coefficients:
where δi (·) is the Kronecker delta function, π(·) is the output
nXh −1
of the AR model, and ẑhi corresponds to the latent dictionary 1
P(q) = N (µ(ẑhi ), Σ(ẑhi )) (9)
vector associated with the ground truth index hi , and the nh − 1
i=1
expectation is over multiple trajectories. We provide more
details of the models in the following section. An example of this distribution is in Fig. 3 for a 2D robot.
Algorithm 1: VQMPTPlanner(qs , qg , P, K, b)
1 τ ← {qs };
2 for k ← 0 to K do
3 qrand ← SAMPLE(P);
4 qnear ←NEAREST(qrand , τ );
5 if CONNECT(qrand , qnear ) then
6 τ ← τ ∪ {qrand };
7 end
8 if rand()> b then
9 qgn ←NEAREST(qg , τ );
10 if CONNECT(qgn , qg ) then
11 τ ← τ ∪ {qg };
12 break;
13 end
14 end
15 SIMPLIFY(τ );
16 return τ
17 end
Fig. 3. A trajectory (black) planned using VQ-MPT for the 2D robot

and the corresponding GMM used for sampling. Each ellipse represents the embeddings for larger-sized costmaps or point clouds. The
distribution encoded by the dictionary values. The shaded region represents same transformer model architecture was used for the Stage
the 2 standard deviation confidence interval region. The dictionary values
can encode the planning space using a finite number of vectors. 1 encoder, the cross-attention network, and the AR model.
TABLE I Each transformer model consisted of 3 attention layers with
M ODEL AND ENVIRONMENT PARAMETERS FOR EACH ROBOT 3 attention heads each. Table I details the latent vector
Robot Environment d Dictionary dk dv dimensions and the dictionary size used for each robot. A
Representation Keys larger key size was used for the 7D and 14D robots because
2D Costmap 512 1024 512 256 of the larger planning space. We observed that increasing the
7D Point Cloud 512 2048 512 256 dictionary size further did not reduce the reconstruction loss.
14D Point Cloud 512 2048 512 256
All models were trained using data collected from simu-
lation. We collected two sets of trajectories.
D. Planning 1) Trajectories without obstacles: This set consisted of
To generate a trajectory, any SMP can be used to generate trajectories in an environment without obstacles and was
the trajectory by sampling from the distribution given in Eqn. used to train Stage 1 of the model. These trajectories were
9. We use Algorithm 1, to generate a path using samples from free from any form of self-collision and covered the whole
the distribution in Eqn. 9. The VQMPTPlanner function takes planning space of the planner. For each robot, we collected
the start and goal state (qs and qg ), the number of samples 2000 trajectories of this type.
to generate (K), and a threshold value (b) to sample the goal 2) Trajectories with obstacles: This set consisted of valid
state and returns a valid trajectory. This function is a modified trajectories collected from environments where obstacles
RRT algorithm, where instead of CONNECT extending the were placed randomly in the scene. It was used to train Stage
current node by a small range, it checks if a valid path exists 2 of the model. For each robot, we collected 10 trajectories
between the current and sampled node. for 2000 randomly generated environments.
We trained Stages 1 and 2 using the Adam optimizer [36]
IV. EXPERIMENTS with β1 = 0.9, β = 0.98 and ϵ = 10−9 and a scheduled
We evaluated our framework on three environments - a 2D learning rate from [18].
point robot, a 7D Franka Panda Arm, and a 14D Bimanual
Setup. Our experiments compare the use of VQ-MPT cou- B. Results - Unseen In-Distribution Environments
pled with RRT (Algorithm 1) with traditional and learning-
based planners on a diverse set of planning problems. All We compared our framework against traditional and
planners were implemented using the Open Motion Planning learning-based SMP algorithms for each robot system on
Library (OMPL) [34]. a trajectory from 500 different environments. To quantify
planning performance, we measured three metrics: planning
A. Setup time - the time it takes for the planner to generate a valid
We trained a separate VQ-MPT model for each robot trajectory; vertices - the number of collision-free vertices
system and chose feature extractors based on environment required to find the trajectory and accuracy - the percentage
representations. For costmaps, we used the Fully Convo- of planning problems solved before a given cutoff time. We
lutional Network (FCN) as in [15], while for point cloud chose to measure vertices because checking the validity of a
data, we used two layers of set-abstraction proposed in vertex imposes a significant cost on most SMPs [37]. Since
PointNet++ [35]. We chose these architectures because they optimal planners do not have termination conditions, for
are agnostic to the environment size and can generate latent fair comparisons, we stopped planning when the constructed
Fig. 4. Plots of planning time and percentage of paths successfully planned on in-distribution environments for the 2D (Left), 7D (Center), and 14D
(Right) robots. VQ-MPT can solve problems faster than other SMP planners by reducing the planning space and scales to higher dimensional problems.
TABLE II
C OMPARING ACCURACY AND MEAN PLANNING TIME AND VERTICES IN I N -D ISTRIBUTION ENVIRONMENTS
Robot RRT∗ RRT∗ (50%) IRRT∗ IRRT∗ (50%) BIT* BIT* (50%) MPNet VQ-MPT
Accuracy 94.8% · 97.4% · 96.0 % · 92.35% 97.6%
2D
Time (sec) 1.588 · 0.244 · 0.297 · 0.296 0.147
Vertices 1195 · 195 · 457 · 63 306
Accuracy 52.80% 95.20% 89.0% 94.80% 72.20% 97.40% 94.2% 97.4%
7D
Time (sec) 49.35 10.51 54 15.03 7.58 5.26 5.18 0.929
Vertices 683 149 63 71 826 640 147 45
Accuracy 11.80% 32.00% 21.80% 40.40% 30.80% 43.40% 92.20% 99.20%
14D
Time (sec) 1.80 15.03 52.84 29.16 9.56 39.09 17.46 2.62
Vertices 9 94 45 77 384 2021 117 18
Fig. 5. Sample paths planned by the VQ-MPT planner for different robot systems (Left) 2D robot, (Center) 7D robot, and (Right) 14D robot on in-
distribution environments. The red and green color represents the start and goal states of the robot, respectively. Given an environment with crowded
obstacles, VQ-MPT can sample efficiently from learned distributions to find a trajectory.
Fig. 6. Snapshots of a trajectory planning using VQ-MPT for physical panda robot arm for a given start and goal pose on a shelf environment. On
the top-right of each image, we show the point cloud data captured using Azure Kinect cameras. We used markerless camera-to-robot pose estimation to
localize the captured point cloud in the robot’s reference frame. VQ-MPT can generalize to real-world sensor data without additional training or fine-tuning.
trajectory, {q1 , q2 , . . . , qn }, satisfied the following condition: and ϵ = 0.5. In our tables, planners that used ϵ = 0.5 are
reported by ‘X (50%)’, where X is the planner. The planning
n−1 m−1
X X
∗ time reported for VQ-MPT also includes the time taken for
∥qi+1 − qi ∥2 ≤ (1 + ϵ) ∥qj+1 − qj∗ ∥2 (10)
model inference. All results are summarized in Table II and
i=0 j=0
the percentage of planning problems solved vs planning time
where Q∗ = {q1∗ , . . . , qn∗ } is the path planned by VQ-MPT is shown in Figure 4.
and ϵ ≥ 0 is a user-defined threshold. If VQ-MPT could not We first tested our framework on a simple 2D robot. An
generate a path for the trajectory, we used a path from RRT∗ example of the path planned by the VQ-MPT framework is
running for 300 seconds (s) to generate Q∗ . For optimal shown in Fig. 5 (Left). The cutoff time set was 20 seconds.
planners like RRT∗ , IRRT∗ , and BIT∗ , we used ϵ = 0.1 VQ-MPT showed efficient sampling of points in the planning
TABLE III
C OMPARING ACCURACY AND MEAN PLANNING TIME AND VERTICES IN O UT- OF -D ISTRIBUTION E NVIRONMENTS
Robot RRT∗ RRT∗ (50%) IRRT∗ IRRT∗ (50%) BIT* BIT* (50%) RRT MPNet VQ-MPT
Accuracy 8.60% 66.60% 44.60% 59.20% 37.80% 88.60% 84.20% 53.20% 92.20%
7D
Time (sec) 107.75 22.75 55.12 23.94 75.32 11.86 8.88 10.14 3.24
Vertices 1338 279 215 72 5147 896 477 310 306
Accuracy 6.00% 18.60% 10.60% 17.80% 12.20% 30.00% 75.00% 80.40% 98.60%
14D
Time (sec) 4.92 7.61 20.72 10.57 30.07 40.58 19.75 23.91 6.21
Vertices 39 67 20 34 1673 2889 179 104 70
Accuracy · · 100% · 100% · 100% 30% 100%
7D (Real) Time (sec) · · 30.68 · 26.42 · 1.69 2.23 1.17
Vertices · · 607 · 2852 · 21 7 34
Fig. 7. Plots of planning time and percentage of paths successfully planned for the 7D (Left) and 14D (Right) robots on environments different from
ones used for training. VQ-MPT can reduce the planning space in unseen environments, enabling efficient planning in challenging environments.
space and found trajectories faster than traditional planners. ferent from the training environments. We test our framework
VQ-MPT can also use 3D environment representations on different planning scenes resembling real-world scenarios
such as point clouds to generate sampling regions. We eval- (Fig. 1). We test the model for each robot on 500 and 10
uated the framework on a 7D panda robot arm with a point start and goal locations for simulation and real-world envi-
cloud environment representation. The dictionary encodings ronments, respectively. The cutoff time for each planner was
can capture diverse sets of valid configurations in 7D space set at 100 s. The results of the experiments are summarized
(Fig. 2). An example of the trajectory planned by the VQ- in Table III, and the plot of the percentage of paths solved
MPT framework is shown in Fig. 5 (Center). The cutoff time across planning time is given in Fig. 7. Higher dimensional
set was 100 s. VQ-MPT planner generates a trajectory nearly 7D and 14D spaces are challenging. The environment is even
5× faster with fewer vertices than the next best accurate more challenging because of the goal location inside the shelf
planner. MPNet performs poorly compared to VQ-MPT. The since it reduces the number of feasible trajectories in the
rigid feature encoding of MPNet potentially prevents it from same way a narrow passage eliminates feasible trajectories
generalizing to larger point cloud data environments. VQ- in mobile robots [38]. Even non-optimal planners like RRT
MPT, in contrast, learns to identify suitable regions to sample solve only 75-91% of trajectories. Existing optimal SMP
in the joint space using point cloud data of different sizes. planners cannot achieve the same accuracy as VQ-MPT even
We also tested the framework in a bi-manual panda arm after relaxing path length constraints.
setup with 14D. An example of a VQ-MPT trajectory is To evaluate the performance of VQ-MPT on physical
shown in Fig. 5 (Right). Stage 1 captures the planning sensor data, we tested a trained model in a real-world
space with the same 2048 dictionary values used in the 7D environment (Fig. 6). The environment was represented using
panda experiment. The cutoff time was 250 s. While BIT* point cloud data from Azure Kinect sensors, and collision
performed relatively well compared to traditional planners checking was done using the octomap collision checker from
for the 2D and 7D problems, performance and accuracy Moveit 1 . Camera to robot base transform was estimated
decreased due to the high-dimensional planning space. Since using markerless pose estimation technique [39]. Our re-
Stage 1 of the VQ-MPT framework encodes self-collision- sults show that the model can plan trajectories faster than
free regions, it’s easier for the planner to generate feasible RRT with the same accuracy. We observed that VQ-MPT
trajectories in Stage 2, resulting in faster trajectory genera- trajectories are also shorter than RRT trajectories, which
tion with fewer vertices. can be clearly seen in some of the attached videos. This
experiment shows that VQ-MPT models can also generalize
C. Results - Out-of-Distribution Environments
well to physical sensor data without further training or fine-
Our next set of experiments evaluated VQ-MPT’s perfor-
mance for the 7D and 14D robots in environments very dif- 1 https://moveit.ros.org/
tuning. Such generalization will benefit the larger robotics [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
community since other researchers can use trained models Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,”
in Advances in Neural Information Processing Systems, 2017.
in diverse settings without collecting new data or fine-tuning [19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training
the model. of deep bidirectional transformers for language understanding,” in
Proceedings of the Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language
V. CONCLUSION Technologies, 2019.
[20] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari-
VQ-MPT can plan near-optimal paths in a fraction of the wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
time required by traditional planners, scales to higher di- A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
mension planning space, and achieves better generalizability D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
than previous learning-based planners. Our approach will be I. Sutskever, and D. Amodei, “Language models are few-shot learners,”
beneficial for planning multi-arm robot systems like the ABB in Advances in Neural Information Processing Systems, 2020.
Yumi and Intuitive’s da Vinci®Surgical System. It is also [21] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin,
P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Re-
helpful for applications where generating nodes and edges for inforcement learning via sequence modeling,” in Advances in Neural
SMPs is computationally expensive, such as for constrained Information Processing Systems, 2021.
motion planning [40]. Future works will extend VQ-MPT to [22] M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one
big sequence modeling problem,” in Advances in Neural Information
these applications. Processing Systems, 2021.
[23] R. Yang, M. Zhang, N. Hansen, H. Xu, and X. Wang, “Learning
R EFERENCES vision-guided quadrupedal locomotion end-to-end with cross-modal
transformers,” in Int. Conf. on Learning Representations, 2022.
[1] S. M. LaValle and J. James J. Kuffner, “Randomized kinodynamic [24] D. S. Chaplot, D. Pathak, and J. Malik, “Differentiable spatial planning
planning,” The International Journal of Robotics Research, 2001. using transformers,” in ICML, 2021.
[2] L. Kavraki, P. Svestka, J.-C. Latombe, and M. Overmars, “Probabilistic [25] A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural discrete
roadmaps for path planning in high-dimensional configuration spaces,” representation learning,” in Advances in Neural Information Process-
IEEE Trans. on Robotics and Auto., 1996. ing Systems, 2017.
[3] D. Hsu, T. Jiang, J. Reif, and Z. Sun, “The bridge test for sampling [26] J. Yu, X. Li, J. Y. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu,
narrow passages with probabilistic roadmap planners,” in IEEE Int. J. Baldridge, and Y. Wu, “Vector-quantized image modeling with
Conf. on Robotics and Auto., 2003. improved VQGAN,” in Int. Conf. on Learning Representations, 2022.
[4] Z.-Y. Chiu, F. Richter, E. K. Funk, R. K. Orosco, and M. C. Yip, [27] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and
“Bimanual regrasping for suture needles using reinforcement learning Y. Bengio, “A structured self-attentive sentence embedding,” in Int.
for rapid motion planning,” in IEEE Int. Conf. on Robotics and Auto., Conf. on Learning Representations, 2017.
2021. [28] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[5] R. Alterovitz, K. Goldberg, and A. Okamura, “Planning for steerable T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
bevel-tip needle insertion through 2d soft tissue with obstacles,” in J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Proceedings of the IEEE Int. Conf. on Robotics and Auto., 2005. Transformers for image recognition at scale,” in Int. Conf. on Learning
[6] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed RRT*: Representations, 2021.
Optimal sampling-based path planning focused via direct sampling of [29] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang,
an admissible ellipsoidal heuristic,” in Int. Conf. on Intelligent Robots Y. Lan, L. Wang, and T. Liu, “On layer normalization in the trans-
and Systems, 2014. former architecture,” in Int. Conf. on Machine Learning, 2020.
[7] ——, “Batch informed trees (BIT*): Sampling-based optimal planning [30] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-
via the heuristically guided search of implicit random geometric fidelity images with vq-vae-2,” in Advances in Neural Information
graphs,” in 2015 IEEE Int. Conf. Robot. Autom., 2015. Processing Systems. Curran Associates, Inc., 2019.
[8] A. H. Qureshi and Y. Ayaz, “Potential functions based sampling [31] H. Hu and G. Kantor, “Parametric covariance prediction for het-
heuristic for optimal path planning,” Autonomous Robots, 2016. eroscedastic noise,” in Int. Conf. on Intelligent Robots and Systems
[9] Z. Tahir, A. H. Qureshi, Y. Ayaz, and R. Nawaz, “Potentially guided (IROS), 2015.
bidirectionalized rrt* for fast optimal path planning in cluttered [32] K. Liu, K. Ok, W. Vega-Brown, and N. Roy, “Deep inference for
environments,” Robotics and Autonomous Systems, 2018. covariance estimation: Learning gaussian noise models for state esti-
[10] P. Lehner and A. Albu-Schäffer, “The repetition roadmap for repetitive mation,” in 2018 IEEE Int. Conf. on Robotics and Auto. (ICRA), 2018,
constrained motion planning,” IEEE Robot. and Autom. Letters, 2018. pp. 1436–1443.
[33] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incorpo-
[11] C. Chamzas, Z. Kingston, C. Quintero-Peña, A. Shrivastava, and L. E.
rating second-order functional knowledge for better option pricing,”
Kavraki, “Learning sampling distributions using local 3d workspace
in Advances in Neural Information Processing Systems, 2000.
decompositions for motion planning in high dimensions,” in IEEE Int.
[34] I. A. Şucan, M. Moll, and L. E. Kavraki, “The Open Motion Planning
Conf. on Robot. and Autom., 2021.
Library,” IEEE Robotics & Auto. Magazine, 2012.
[12] B. Ichter and M. Pavone, “Robot motion planning in learned latent
[35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierar-
spaces,” IEEE Robotics and Auto. Letters, 2019.
chical feature learning on point sets in a metric space,” in Advances
[13] R. Kumar, A. Mandalika, S. Choudhury, and S. Srinivasa, “Lego:
in Neural Information Processing Systems, 2017.
Leveraging experience in roadmap generation for sampling-based
[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
planning,” in Int. Conf. on Intelligent Robots and Systems, 2019.
tion,” in Int. Conf. on Learning Representations, 2015.
[14] A. H. Qureshi, Y. Miao, A. Simeonov, and M. C. Yip, “Motion [37] N. Das and M. Yip, “Learning-based proxy collision detection for
planning networks: Bridging the gap between learning-based and robot motion planning applications,” IEEE Trans. on Robotics, 2020.
classical motion planners,” IEEE Trans. on Robotics, 2020. [38] J. Borenstein and Y. Koren, “The vector field histogram-fast obstacle
[15] J. J. Johnson, U. S. Kalra, A. Bhatia, L. Li, A. H. Qureshi, and M. C. avoidance for mobile robots,” IEEE Trans. on Robotics and Auto.,
Yip, “Motion planning transformers: A motion planning framework 1991.
for mobile robots,” 2021. [39] J. Lu, F. Richter, and M. C. Yip, “Markerless camera-to-robot pose
[16] B. Chen, B. Dai, Q. Lin, G. Ye, H. Liu, and L. Song, “Learning to estimation via self-supervised sim-to-real transfer,” 2023.
plan in high dimensions via neural exploration-exploitation trees,” in [40] J. J. Johnson and M. C. Yip, “Chance-constrained motion planning
Int. Conf. on Learning Representations, ICLR, 2020. using modeled distance- to-collision functions,” in Int. Conf. on Auto.
[17] C. Yu and S. Gao, “Reducing collision checking for sampling-based Science and Engineering (CASE), 2021.
motion planning using graph neural networks,” in Advances in Neural
Information Processing Systems, 2021.

Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers

Uploaded by

Copyright:

Available Formats

Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers

Uploaded by

Copyright:

Available Formats

Learning Sampling Dictionaries for Efficient and Generalizable Robot

Motion Planning with Transformers

Abstract— Motion planning is integral to robotics applica-

planners (SMP) but lack generalizability to out-of-distribution

C. Transformer Models space as a collection of distributions (Fig. 2). Below, we

Fig. 3. A trajectory (black) planned using VQ-MPT for the 2D robot

You might also like