Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers
Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers
Learning Sampling Dictionaries For Efficient and Generalizable Robot Motion Planning With Transformers
Quantize
Self-Attention AR Transformer
MLP
Dictionary
Output Distributions
Stage 1 : Learning sampling dictionary Stage 2 : Predicting Distributions
Fig. 2. An outline of the model architecture of VQ-MPT. Stage 1 (Left) is a Vector Quantizer that learns a set of latent dictionary values that can be mapped
to a distribution in the planning space. By encoding the planning space to discrete distributions, we can plan for high-dimensional robot systems. Stage 2
(Right) is the Auto-Regressive (AR) model that sequentially predicts the sampling regions for a given environment and a start and goal configuration. The
cross-attention model transduces the start and goal embeddings given the environment embedding generated using a feature extractor. The output from the
AR Transformer is mapped to a distribution in the planning space using the decoder model from Stage 1.
where Lj is a lower triangle matrix with ones along the C. Generating Distributions for Sampling
diagonal, and Dj is a diagonal matrix with positive values.
The output from the penultimate MLP layer is passed through With stage 1, we have efficiently split the planning space
separate linear layers to obtain µj and Lj , while for Dj , it into a discrete set of distributions represented using a set
is passed through a linear and soft-plus layer [33] to ensure of latent vectors, and with stage 2, we have provided a
values are positive. Using the soft-plus layer improves the means to select a subset of distributions from the dictionary.
stability of training the model. Given a new planning problem, we use the trained Stage 2
models to generate a sequence of dictionary indexes H =
B. Stage 2: Auto-Regressive (AR) Prediction {h1 , . . . hnh }. Since each index can take N values, we pick
The second stage generates sampling regions by predicting the sequence H that maximizes the following probability:
indexes from the dictionary set ZQ for a given planning nh
problem and sensor data. It comprises two models - a
Y
P (h1 , . . . , hnh |M ) = π(hi |h1 , . . . , hi−1 , M ) (8)
cross-attention model to embed start and goal pairs and i=1
the environment embedding into latent vectors (M ), and a
Transformer-based Auto-Regressive (AR) model to predict where hnh is the goal index and π is the probability from
the dictionaries indexes, H = {h1 , h2 , . . . hnh }. Both models Eqn. 7. We apply a beam-search algorithm to optimize for
are trained end-to-end by reducing the cross entropy loss Eqn. 8 as done before in language model tasks [19].
using trajectories from an RRT∗ planner: The decoder model from Stage 1 is used to gener-
nh N +1
ate a set of distributions, P, from the dictionary values,
δi (hj ) log(π(hj = i|ẑh1 , · · · , ẑhj−1 , M ))] h1 , ẑh2 , . . . , ẑhnh −1 }, corresponding to the predicted in-
{ẑ
X X
LCE = E[−
j=1 i=1
dexes {h1 , h2 , . . . , hnh −1 }. We define this set as a Gaussian
(6) Mixture Model (GMM) with uniform mixing coefficients:
where δi (·) is the Kronecker delta function, π(·) is the output
nXh −1
of the AR model, and ẑhi corresponds to the latent dictionary 1
P(q) = N (µ(ẑhi ), Σ(ẑhi )) (9)
vector associated with the ground truth index hi , and the nh − 1
i=1
expectation is over multiple trajectories. We provide more
details of the models in the following section. An example of this distribution is in Fig. 3 for a 2D robot.
Algorithm 1: VQMPTPlanner(qs , qg , P, K, b)
1 τ ← {qs };
2 for k ← 0 to K do
3 qrand ← SAMPLE(P);
4 qnear ←NEAREST(qrand , τ );
5 if CONNECT(qrand , qnear ) then
6 τ ← τ ∪ {qrand };
7 end
8 if rand()> b then
9 qgn ←NEAREST(qg , τ );
10 if CONNECT(qgn , qg ) then
11 τ ← τ ∪ {qg };
12 break;
13 end
14 end
15 SIMPLIFY(τ );
16 return τ
17 end
TABLE II
C OMPARING ACCURACY AND MEAN PLANNING TIME AND VERTICES IN I N -D ISTRIBUTION ENVIRONMENTS
Robot RRT∗ RRT∗ (50%) IRRT∗ IRRT∗ (50%) BIT* BIT* (50%) MPNet VQ-MPT
Accuracy 94.8% · 97.4% · 96.0 % · 92.35% 97.6%
2D
Time (sec) 1.588 · 0.244 · 0.297 · 0.296 0.147
Vertices 1195 · 195 · 457 · 63 306
Accuracy 52.80% 95.20% 89.0% 94.80% 72.20% 97.40% 94.2% 97.4%
7D
Time (sec) 49.35 10.51 54 15.03 7.58 5.26 5.18 0.929
Vertices 683 149 63 71 826 640 147 45
Accuracy 11.80% 32.00% 21.80% 40.40% 30.80% 43.40% 92.20% 99.20%
14D
Time (sec) 1.80 15.03 52.84 29.16 9.56 39.09 17.46 2.62
Vertices 9 94 45 77 384 2021 117 18
Fig. 5. Sample paths planned by the VQ-MPT planner for different robot systems (Left) 2D robot, (Center) 7D robot, and (Right) 14D robot on in-
distribution environments. The red and green color represents the start and goal states of the robot, respectively. Given an environment with crowded
obstacles, VQ-MPT can sample efficiently from learned distributions to find a trajectory.
Fig. 6. Snapshots of a trajectory planning using VQ-MPT for physical panda robot arm for a given start and goal pose on a shelf environment. On
the top-right of each image, we show the point cloud data captured using Azure Kinect cameras. We used markerless camera-to-robot pose estimation to
localize the captured point cloud in the robot’s reference frame. VQ-MPT can generalize to real-world sensor data without additional training or fine-tuning.
trajectory, {q1 , q2 , . . . , qn }, satisfied the following condition: and ϵ = 0.5. In our tables, planners that used ϵ = 0.5 are
reported by ‘X (50%)’, where X is the planner. The planning
n−1 m−1
X X
∗ time reported for VQ-MPT also includes the time taken for
∥qi+1 − qi ∥2 ≤ (1 + ϵ) ∥qj+1 − qj∗ ∥2 (10)
model inference. All results are summarized in Table II and
i=0 j=0
the percentage of planning problems solved vs planning time
where Q∗ = {q1∗ , . . . , qn∗ } is the path planned by VQ-MPT is shown in Figure 4.
and ϵ ≥ 0 is a user-defined threshold. If VQ-MPT could not We first tested our framework on a simple 2D robot. An
generate a path for the trajectory, we used a path from RRT∗ example of the path planned by the VQ-MPT framework is
running for 300 seconds (s) to generate Q∗ . For optimal shown in Fig. 5 (Left). The cutoff time set was 20 seconds.
planners like RRT∗ , IRRT∗ , and BIT∗ , we used ϵ = 0.1 VQ-MPT showed efficient sampling of points in the planning
TABLE III
C OMPARING ACCURACY AND MEAN PLANNING TIME AND VERTICES IN O UT- OF -D ISTRIBUTION E NVIRONMENTS
Robot RRT∗ RRT∗ (50%) IRRT∗ IRRT∗ (50%) BIT* BIT* (50%) RRT MPNet VQ-MPT
Accuracy 8.60% 66.60% 44.60% 59.20% 37.80% 88.60% 84.20% 53.20% 92.20%
7D
Time (sec) 107.75 22.75 55.12 23.94 75.32 11.86 8.88 10.14 3.24
Vertices 1338 279 215 72 5147 896 477 310 306
Accuracy 6.00% 18.60% 10.60% 17.80% 12.20% 30.00% 75.00% 80.40% 98.60%
14D
Time (sec) 4.92 7.61 20.72 10.57 30.07 40.58 19.75 23.91 6.21
Vertices 39 67 20 34 1673 2889 179 104 70
Accuracy · · 100% · 100% · 100% 30% 100%
7D (Real) Time (sec) · · 30.68 · 26.42 · 1.69 2.23 1.17
Vertices · · 607 · 2852 · 21 7 34
Fig. 7. Plots of planning time and percentage of paths successfully planned for the 7D (Left) and 14D (Right) robots on environments different from
ones used for training. VQ-MPT can reduce the planning space in unseen environments, enabling efficient planning in challenging environments.
space and found trajectories faster than traditional planners. ferent from the training environments. We test our framework
VQ-MPT can also use 3D environment representations on different planning scenes resembling real-world scenarios
such as point clouds to generate sampling regions. We eval- (Fig. 1). We test the model for each robot on 500 and 10
uated the framework on a 7D panda robot arm with a point start and goal locations for simulation and real-world envi-
cloud environment representation. The dictionary encodings ronments, respectively. The cutoff time for each planner was
can capture diverse sets of valid configurations in 7D space set at 100 s. The results of the experiments are summarized
(Fig. 2). An example of the trajectory planned by the VQ- in Table III, and the plot of the percentage of paths solved
MPT framework is shown in Fig. 5 (Center). The cutoff time across planning time is given in Fig. 7. Higher dimensional
set was 100 s. VQ-MPT planner generates a trajectory nearly 7D and 14D spaces are challenging. The environment is even
5× faster with fewer vertices than the next best accurate more challenging because of the goal location inside the shelf
planner. MPNet performs poorly compared to VQ-MPT. The since it reduces the number of feasible trajectories in the
rigid feature encoding of MPNet potentially prevents it from same way a narrow passage eliminates feasible trajectories
generalizing to larger point cloud data environments. VQ- in mobile robots [38]. Even non-optimal planners like RRT
MPT, in contrast, learns to identify suitable regions to sample solve only 75-91% of trajectories. Existing optimal SMP
in the joint space using point cloud data of different sizes. planners cannot achieve the same accuracy as VQ-MPT even
We also tested the framework in a bi-manual panda arm after relaxing path length constraints.
setup with 14D. An example of a VQ-MPT trajectory is To evaluate the performance of VQ-MPT on physical
shown in Fig. 5 (Right). Stage 1 captures the planning sensor data, we tested a trained model in a real-world
space with the same 2048 dictionary values used in the 7D environment (Fig. 6). The environment was represented using
panda experiment. The cutoff time was 250 s. While BIT* point cloud data from Azure Kinect sensors, and collision
performed relatively well compared to traditional planners checking was done using the octomap collision checker from
for the 2D and 7D problems, performance and accuracy Moveit 1 . Camera to robot base transform was estimated
decreased due to the high-dimensional planning space. Since using markerless pose estimation technique [39]. Our re-
Stage 1 of the VQ-MPT framework encodes self-collision- sults show that the model can plan trajectories faster than
free regions, it’s easier for the planner to generate feasible RRT with the same accuracy. We observed that VQ-MPT
trajectories in Stage 2, resulting in faster trajectory genera- trajectories are also shorter than RRT trajectories, which
tion with fewer vertices. can be clearly seen in some of the attached videos. This
experiment shows that VQ-MPT models can also generalize
C. Results - Out-of-Distribution Environments
well to physical sensor data without further training or fine-
Our next set of experiments evaluated VQ-MPT’s perfor-
mance for the 7D and 14D robots in environments very dif- 1 https://moveit.ros.org/
tuning. Such generalization will benefit the larger robotics [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
community since other researchers can use trained models Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,”
in Advances in Neural Information Processing Systems, 2017.
in diverse settings without collecting new data or fine-tuning [19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training
the model. of deep bidirectional transformers for language understanding,” in
Proceedings of the Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language
V. CONCLUSION Technologies, 2019.
[20] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari-
VQ-MPT can plan near-optimal paths in a fraction of the wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
time required by traditional planners, scales to higher di- A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
mension planning space, and achieves better generalizability D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
than previous learning-based planners. Our approach will be I. Sutskever, and D. Amodei, “Language models are few-shot learners,”
beneficial for planning multi-arm robot systems like the ABB in Advances in Neural Information Processing Systems, 2020.
Yumi and Intuitive’s da Vinci®Surgical System. It is also [21] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin,
P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Re-
helpful for applications where generating nodes and edges for inforcement learning via sequence modeling,” in Advances in Neural
SMPs is computationally expensive, such as for constrained Information Processing Systems, 2021.
motion planning [40]. Future works will extend VQ-MPT to [22] M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one
big sequence modeling problem,” in Advances in Neural Information
these applications. Processing Systems, 2021.
[23] R. Yang, M. Zhang, N. Hansen, H. Xu, and X. Wang, “Learning
R EFERENCES vision-guided quadrupedal locomotion end-to-end with cross-modal
transformers,” in Int. Conf. on Learning Representations, 2022.
[1] S. M. LaValle and J. James J. Kuffner, “Randomized kinodynamic [24] D. S. Chaplot, D. Pathak, and J. Malik, “Differentiable spatial planning
planning,” The International Journal of Robotics Research, 2001. using transformers,” in ICML, 2021.
[2] L. Kavraki, P. Svestka, J.-C. Latombe, and M. Overmars, “Probabilistic [25] A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural discrete
roadmaps for path planning in high-dimensional configuration spaces,” representation learning,” in Advances in Neural Information Process-
IEEE Trans. on Robotics and Auto., 1996. ing Systems, 2017.
[3] D. Hsu, T. Jiang, J. Reif, and Z. Sun, “The bridge test for sampling [26] J. Yu, X. Li, J. Y. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu,
narrow passages with probabilistic roadmap planners,” in IEEE Int. J. Baldridge, and Y. Wu, “Vector-quantized image modeling with
Conf. on Robotics and Auto., 2003. improved VQGAN,” in Int. Conf. on Learning Representations, 2022.
[4] Z.-Y. Chiu, F. Richter, E. K. Funk, R. K. Orosco, and M. C. Yip, [27] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and
“Bimanual regrasping for suture needles using reinforcement learning Y. Bengio, “A structured self-attentive sentence embedding,” in Int.
for rapid motion planning,” in IEEE Int. Conf. on Robotics and Auto., Conf. on Learning Representations, 2017.
2021. [28] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
[5] R. Alterovitz, K. Goldberg, and A. Okamura, “Planning for steerable T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
bevel-tip needle insertion through 2d soft tissue with obstacles,” in J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Proceedings of the IEEE Int. Conf. on Robotics and Auto., 2005. Transformers for image recognition at scale,” in Int. Conf. on Learning
[6] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed RRT*: Representations, 2021.
Optimal sampling-based path planning focused via direct sampling of [29] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang,
an admissible ellipsoidal heuristic,” in Int. Conf. on Intelligent Robots Y. Lan, L. Wang, and T. Liu, “On layer normalization in the trans-
and Systems, 2014. former architecture,” in Int. Conf. on Machine Learning, 2020.
[7] ——, “Batch informed trees (BIT*): Sampling-based optimal planning [30] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-
via the heuristically guided search of implicit random geometric fidelity images with vq-vae-2,” in Advances in Neural Information
graphs,” in 2015 IEEE Int. Conf. Robot. Autom., 2015. Processing Systems. Curran Associates, Inc., 2019.
[8] A. H. Qureshi and Y. Ayaz, “Potential functions based sampling [31] H. Hu and G. Kantor, “Parametric covariance prediction for het-
heuristic for optimal path planning,” Autonomous Robots, 2016. eroscedastic noise,” in Int. Conf. on Intelligent Robots and Systems
[9] Z. Tahir, A. H. Qureshi, Y. Ayaz, and R. Nawaz, “Potentially guided (IROS), 2015.
bidirectionalized rrt* for fast optimal path planning in cluttered [32] K. Liu, K. Ok, W. Vega-Brown, and N. Roy, “Deep inference for
environments,” Robotics and Autonomous Systems, 2018. covariance estimation: Learning gaussian noise models for state esti-
[10] P. Lehner and A. Albu-Schäffer, “The repetition roadmap for repetitive mation,” in 2018 IEEE Int. Conf. on Robotics and Auto. (ICRA), 2018,
constrained motion planning,” IEEE Robot. and Autom. Letters, 2018. pp. 1436–1443.
[33] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incorpo-
[11] C. Chamzas, Z. Kingston, C. Quintero-Peña, A. Shrivastava, and L. E.
rating second-order functional knowledge for better option pricing,”
Kavraki, “Learning sampling distributions using local 3d workspace
in Advances in Neural Information Processing Systems, 2000.
decompositions for motion planning in high dimensions,” in IEEE Int.
[34] I. A. Şucan, M. Moll, and L. E. Kavraki, “The Open Motion Planning
Conf. on Robot. and Autom., 2021.
Library,” IEEE Robotics & Auto. Magazine, 2012.
[12] B. Ichter and M. Pavone, “Robot motion planning in learned latent
[35] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierar-
spaces,” IEEE Robotics and Auto. Letters, 2019.
chical feature learning on point sets in a metric space,” in Advances
[13] R. Kumar, A. Mandalika, S. Choudhury, and S. Srinivasa, “Lego:
in Neural Information Processing Systems, 2017.
Leveraging experience in roadmap generation for sampling-based
[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
planning,” in Int. Conf. on Intelligent Robots and Systems, 2019.
tion,” in Int. Conf. on Learning Representations, 2015.
[14] A. H. Qureshi, Y. Miao, A. Simeonov, and M. C. Yip, “Motion [37] N. Das and M. Yip, “Learning-based proxy collision detection for
planning networks: Bridging the gap between learning-based and robot motion planning applications,” IEEE Trans. on Robotics, 2020.
classical motion planners,” IEEE Trans. on Robotics, 2020. [38] J. Borenstein and Y. Koren, “The vector field histogram-fast obstacle
[15] J. J. Johnson, U. S. Kalra, A. Bhatia, L. Li, A. H. Qureshi, and M. C. avoidance for mobile robots,” IEEE Trans. on Robotics and Auto.,
Yip, “Motion planning transformers: A motion planning framework 1991.
for mobile robots,” 2021. [39] J. Lu, F. Richter, and M. C. Yip, “Markerless camera-to-robot pose
[16] B. Chen, B. Dai, Q. Lin, G. Ye, H. Liu, and L. Song, “Learning to estimation via self-supervised sim-to-real transfer,” 2023.
plan in high dimensions via neural exploration-exploitation trees,” in [40] J. J. Johnson and M. C. Yip, “Chance-constrained motion planning
Int. Conf. on Learning Representations, ICLR, 2020. using modeled distance- to-collision functions,” in Int. Conf. on Auto.
[17] C. Yu and S. Gao, “Reducing collision checking for sampling-based Science and Engineering (CASE), 2021.
motion planning using graph neural networks,” in Advances in Neural
Information Processing Systems, 2021.