Trajectron Paper
Trajectron Paper
1 Introduction
Predicting the future behavior of humans is a necessary part of developing safe
human-interactive autonomous systems. Humans can naturally navigate through
many social interaction scenarios because they have an intrinsic “theory of
mind,” which is the capacity to reason about other people’s actions in terms
of their mental states [14]. As a result, imbuing autonomous systems with this
capability could enable more informed decision making and proactive actions
to be taken in the presence of other intelligent agents, e.g., in human-robot in-
teraction scenarios. Figure 1 illustrates a scenario where predicting the intent
of other agents may inform an autonomous vehicle’s path planning and deci-
sion making. Indeed, multi-agent behavior prediction has already become a core
?
Equal contribution.
†
Work done as a visiting student in the Autonomous Systems Lab.
2 T. Salzmann? , B. Ivanovic? , et al.
Fig. 1. Exemplary road scene depicting pedestrians crossing a road in front of a vehicle
which may continue straight or turn right. The graph representation of the scene is
shown on the ground, where each agent and their interactions are represented as nodes
and edges, visualized as white circles and dashed black lines, respectively. Arrows depict
potential future agent velocities, with colors representing different high-level future
behavior modes.
integrators) and past trajectory data (i.e., no considerations are made for added
environmental information, if available).
In this work we present Trajectron++, an open and extensible approach built
upon the Trajectron [20] framework which produces dynamically-feasible trajec-
tory forecasts from heterogeneous input data for multiple interacting agents of
distinct semantic types. Our key contributions are twofold: First, we show how
to effectively incorporate high-dimensional data through the lens of encoding
semantic maps. Second, we propose a general method of incorporating dynam-
ics constraints into learning-based methods for multi-agent trajectory forecast-
ing. Trajectron++ is designed to be tightly integrated with downstream robotic
modules, with the ability to produce trajectories that are optionally conditioned
on future ego-agent motion plans. We present experimental results on a vari-
ety of datasets, which collectively demonstrate that Trajectron++ outperforms
an extensive selection of state-of-the-art deterministic and generative trajectory
prediction methods, in some cases achieving 60% lower average prediction error.
2 Related Work
Deterministic Regressors. Many earlier works in human trajectory forecast-
ing were deterministic regression models. One of the earliest, the Social Forces
model [16], models humans as physical objects affected by Newtonian forces
(e.g., with attractors at goals and repulsors at other agents). Since then, many
approaches have been applied to the problem of trajectory forecasting, formu-
lating it as a time-series regression problem and applying methods like Gaussian
Process Regression (GPR) [38,48], Inverse Reinforcement Learning (IRL) [32],
and Recurrent Neural Networks (RNNs) [1,34,47] to good effect. An excellent
review of such methods can be found in [40].
Generative, Probabilistic Approaches. Recently, generative approaches
have emerged as state-of-the-art trajectory forecasting methods due to recent ad-
vancements in deep generative models [44,12]. Notably, they have caused a shift
from focusing on predicting the single best trajectory to producing a distribution
of potential future trajectories. This is advantageous in autonomous systems as
full distribution information is more useful for downstream tasks, e.g., motion
4 T. Salzmann? , B. Ivanovic? , et al.
planning and decision making where information such as variance can be used to
make safer decisions. Most works in this category use a deep recurrent backbone
architecture with a latent variable model, such as a Conditional Variational Au-
toencoder (CVAE) [44], to explicitly encode multimodality [31,21,11,42,20,39],
or a Generative Adversarial Network (GAN) [12] to implicitly do so [13,41,28,53].
Common to both approach styles is the need to produce position distributions.
GAN-based models can directly produce these and CVAE-based recurrent mod-
els usually rely on a bivariate Gaussian Mixture Model (GMM) to output po-
sition distributions. However, both of these output structures make it difficult
to enforce dynamics constraints, e.g., non-holonomic constraints such as those
arising from no side-slip conditions. Of these, the Trajectron [20] and MATF [53]
are the best-performing CVAE-based and GAN-based models, respectively, on
standard pedestrian trajectory forecasting benchmarks [37,33].
Accounting for Dynamics and Heterogeneous Data. There are few
works that account for dynamics or make use of data modalities outside of
prior trajectory information. This is mainly because standard trajectory fore-
casting benchmarks seldom include any other information, a fact that will surely
change following the recent release of autonomous vehicle-based datasets with
rich multi-sensor data [50,6,9,26]. As for dynamics, current methods almost ex-
clusively reason about positional information. This does not capture dynamical
constraints, however, which might lead to predictions in position space that are
unrealizable by the underlying control variables (e.g., a car moving sideways).
Table 1 provides a detailed breakdown of recent state-of-the-art approaches and
their consideration of these desiderata.
3 Problem Formulation
LSTM LSTM
N + C
qϕ(z|x,y,M,yR) [ex,R;z;y(t)] [ex,R;z;ŷ(t+1)]
x4,R
(t-1)
x(t)
4,R
Map
CNN F
C
M1
(t)
LEGEND
3
4
∫ Dynamics Integration
Robot Future
eR F
1 LSTM LSTM Dense Layer
C
R x (t+(T-1))
x(t+T) Random Sampling
R R
2
+ Concatenation
Node Future ey Of f line Training
LSTM LSTM
Online Inference
x(t+(T-1)) x(t+T)
1 1
Both
4 Trajectron++
1
All of our source code, trained models, and data can be found online at
https://github.com/StanfordASL/Trajectron-plus-plus.
6 T. Salzmann? , B. Ivanovic? , et al.
a car looks much farther ahead on the road than a pedestrian does while walking
on the sidewalk.
Modeling Agent History. Once a graph of the scene is constructed, the
model needs to encode a node’s current state, its history, and how it is influ-
enced by its neighboring nodes. To encode the observed history of the modeled
agent, their current and previous states are fed into a Long Short-Term Mem-
ory (LSTM) network [19] with 32 hidden dimensions. Since we are interested in
(t−H:t)
modeling trajectories, the inputs x = s1,...,N (t) ∈ R(H+1)×N (t)×D are the current
and previous D-dimensional states of the modeled agents. These are typically
positions and velocities, which can be easily estimated online.
Ideally, agent models should be chosen to best match their semantic class
Si . For example, one would usually model vehicles on the road using a bicycle
model [27,35]. However, estimating the bicycle model parameters of another ve-
hicle from online observations is very difficult as it requires estimation of the
vehicle’s center of mass, wheelbase, and front wheel steer angle. As a result, in
this work pedestrians are modeled as single integrators and wheeled vehicles are
modeled as dynamically-extended unicycles [29], enabling us to account for key
non-holonomic constraints (e.g., no side-slip constraints) [35] without requiring
complex online parameter estimation procedures – we will show through exper-
iments that such a simplified model is already quite impactful on improving
prediction accuracy. While the dynamically-extended unicycle model serves as
an important representative example, we note that our approach can also be
generalized to other dynamics models, provided its parameters can either be
assumed or quickly estimated online.
Encoding Agent Interactions. To model neighboring agents’ influence on
the modeled agent, Trajectron++ encodes graph edges in two steps. First, edge
information is aggregated from neighboring agents of the same semantic class. In
this work, an element-wise sum is used as the aggregation operation. We choose
to combine features in this way rather than with concatenation or an average to
handle a variable number of neighboring nodes with a fixed architecture while
preserving count information [3,21,22]. These aggregated states are then fed into
an LSTM with 8 hidden dimensions whose weights are shared across all edge in-
stances of the same type, e.g., all Pedestrian-Bus edge LSTMs share the same
weights. Then, the encodings from all edge types that connect to the modeled
node are aggregated to obtain one “influence” representation vector, represent-
ing the effect that all neighboring nodes have. For this, an additive attention
module is used [2]. Finally, the node history and edge influence encodings are
concatenated to produce a single node representation vector, ex .
Incorporating Heterogeneous Data. Modern sensor suites are able to
produce much more information than just tracked trajectories of other agents.
Notably, HD maps are used by many real-world systems to aid localization as well
as inform navigation. Depending on sensor availability and sophistication, maps
can range in fidelity from simple binary obstacle maps, i.e., M ∈ {0, 1}H×W ×1 ,
to HD semantic maps, e.g., M ∈ {0, 1}H×W ×L where each layer 1 ≤ ` ≤ L
corresponds to an area with semantic type (e.g., “driveable area,” “road block,”
Trajectron++: Dynamically-Feasible Trajectory Forecasting 7
ing methods which directly output positions, our approach is uniquely able to
guarantee that its trajectory samples are dynamically feasible by integrating an
agent’s dynamics with the predicted controls.
Output Configurations. Based on the desired use case, Trajectron++ can
produce many different outputs. The main four are outlined below.
1. Most Likely (ML): The model’s deterministic and most-likely single out-
put. The high-level latent behavior mode and output trajectory are the modes
of their respective distributions, where
3. Full : The model’s full sampled output, where z and y are sampled sequen-
tially according to
z ∼ pθ (z | x), y ∼ pψ (y | x, z). (3)
4. Distribution: Due to the use of a discrete latent variable and Gaussian out-
put structure, the model
P can provide an analytic output distribution by directly
computing p(y | x) = z∈Z pψ (y | x, z)pθ (z | x).
Training the Model. We adopt the InfoVAE [52] objective function, and
modify it to use discrete latent states in a conditional formulation (since the
model uses a CVAE). Formally, we aim to solve
N
X
max Ez∼qφ (·|xi ,yi ) log pψ (yi | xi , z)
φ,θ,ψ
i=1
(4)
− βDKL qφ (z | xi , yi ) k pθ (z | xi ) + αIq (x; z),
5 Experiments
2.5 Hz (∆t = 0.4s). In total, there are 5 sets of data, 4 unique scenes, and 1536
unique pedestrians. They are a standard benchmark in the field, containing chal-
lenging behaviors such as couples walking together, groups crossing each other,
and groups forming and dispersing. However, they only contain pedestrians, so
we also evaluate on the recently-released nuScenes dataset. It is a large-scale
dataset for autonomous driving with 1000 scenes in Boston and Singapore. Each
scene is annotated at 2 Hz (∆t = 0.5s) and is 20s long, containing up to 23
semantic object classes as well as HD semantic maps with 11 annotated layers.
Trajectron++ was implemented in PyTorch [36] on a desktop computer run-
ning Ubuntu 18.04 containing an AMD Ryzen 1800X CPU and two NVIDIA
GTX 1080 Ti GPUs. We trained the model for 100 epochs (∼ 3 hours) on the
pedestrian datasets and 12 epochs (∼ 8 hours) on the nuScenes dataset.
Evaluation Metrics. As in prior work [1,13,20,41,28,53], our method for
trajectory forecasting is evaluated with the following four error metrics:
1. Average Displacement Error (ADE): Mean `2 distance between the ground
truth and predicted trajectories.
2. Final Displacement Error (FDE): `2 distance between the predicted final
position and the ground truth final position at the prediction horizon T .
3. Kernel Density Estimate-based Negative Log Likelihood (KDE NLL): Mean
NLL of the ground truth trajectory under a distribution created by fitting a
kernel density estimate on trajectory samples [20,45].
4. Best-of-N (BoN): The minimum ADE and FDE from N randomly-sampled
trajectories. We compare our method to an exhaustive set of state-of-the art
deterministic and generative approaches.
Deterministic Baselines. Our method is compared against the following
deterministic baselines: (1) Linear : A linear regressor with parameters estimated
by minimizing least square error. (2) LSTM : An LSTM network with only agent
history information. (3) Social LSTM [1]: Each agent is modeled with an LSTM
and nearby agents’ hidden states are pooled at each timestep using a proposed
social pooling operation. (4) Social Attention [47]: Same as [1], but all other
agents’ hidden states are incorporated via a proposed social attention operation.
Generative Baselines. On the ETH and UCY datasets, our method is com-
pared against the following generative baselines: (1) S-GAN [13]: Each agent is
modeled with an LSTM-GAN, which is an LSTM encoder-decoder whose out-
puts are the generator of a GAN. The generated trajectories are then evalu-
ated against the ground truth trajectories with a discriminator. (2) SoPhie [41]:
An LSTM-GAN with the addition of a proposed physical and social attention
module. (3) MATF [53]: An LSTM-GAN model that leverages CNNs to fuse
agent relationships and encode environmental information. (4) Trajectron [20]:
An LSTM-CVAE encoder-decoder which is explicitly constructed to match the
spatiotemporal structure of the scene. Its scene abstraction is similar to ours,
but uses undirected edges.
On the nuScenes dataset, the following methods are also compared against:
(5) Convolutional Social Pooling (CSP) [11]: An LSTM-based approach which
explicitly considers a fixed number of movement classes and predicts which of
10 T. Salzmann? , B. Ivanovic? , et al.
Table 2. (a) Our model’s deterministic Most Likely output outperforms other deter-
ministic methods on displacement error metrics, even if it was not originally trained to
do so. (b) Our model’s probabilistic Full output significantly outperforms other meth-
ods, yielding accurate predictions even in a small number of samples. Lower is better.
Bold indicates best.
those the modeled agent is likely to take. (6) CAR-Net [42]: An LSTM-based
approach which encodes scene context with visual attention. (7) SpAGNN [7]: A
CNN encodes raw LIDAR and semantic map data to produce object detections,
from which a Graph Neural Network (GNN) produces probabilistic, interaction-
aware trajectories.
Evaluation Methodology. For the ETH and UCY datasets, a leave-one-out
strategy is used for evaluation, similar to previous works [1,13,20,28,41,53], where
the model is trained on four datasets and evaluated on the held-out fifth. An
observation length of 8 timesteps (3.2s) and a prediction horizon of 12 timesteps
(4.8s) is used for evaluation. For the nuScenes dataset, we split off 15% of the
train set for hyperparameter tuning and test on the provided validation set.
Throughout the following, we report the performance of Trajectron++ in
multiple configurations. Specifically, Ours refers to the base model using only
node and edge encoding, trained to predict
R agent velocities and Euler integrating
velocity to produce positions; Ours+ is the base model with dynamics inte-
gration, trained to predict control actions and integrating
R the agent’s dynamics
with the control actions to produce positions;
R Ours+ , M additionally includes
the map encoding CNN; and Ours+ , M, yR adds the robot future encoder.
Our approach is first evaluated on the ETH [37] and UCY [33] Pedestrian
Datasets, against deterministic methods on standard trajectory forecasting met-
Trajectron++: Dynamically-Feasible Trajectory Forecasting 11
Table 3. Mean KDE-based NLL for each dataset. Lower is better. 2000 trajectories
were sampled per model at each prediction timestep. Bold indicates the best values.
Table 4. [nuScenes] (a): Vehicle-only FDE across time for Trajectron++ compared
to that of other single-trajectory and probabilistic approaches. Bold indicates best.
(b): Pedestrian-only FDE and KDE NLL across time for Trajectron++.
(a) Vehicle-only
FDE (m)
Method
@1s @2s @3s @4s (b) Pedestrian-only
Const. Velocity 0.32 0.89 1.70 2.73 KDE NLL FDE (m)
Method
S-LSTM∗ [1,7] 0.47 - 1.61 - @1s @2s @3s @4s @1s @2s @3s @4s
CSP∗ [11,7] 0.46 - 1.50 -
CAR-Net∗ [42,7] 0.38 - 1.35 - Ours (ML) −2.69 −2.46 −1.76 −1.09 0.03 0.17 0.37 0.60
SpAGNN∗ [7]
R
0.36 - 1.23 - Ours+ ,M (ML) −5.58 −3.96 −2.77 −1.89 0.01 0.17 0.37 0.62
Ours (ML)
R 0.18 0.57 1.25 2.24
Ours+ ,M (ML) 0.07 0.45 1.14 2.20
∗
We subtracted 22-24cm from these reported values (their detection/tracking error [7]), as we do
R not use a detector/tracker. This is done to establish a fair comparison.
Legend: = Integration via Dynamics, M = Map Encoding, yR = Robot Future Encoding.
Since other methods use a detection/tracking module (whereas ours does not),
to establish a fair comparison we subtracted other methods’ detection and track-
ing error from their reported values. The dynamics integration scheme and map
encoding yield a noticeable improvement with vehicles, as their dynamically-
extended unicycle dynamics now differ from the single integrator assumption
made by the base model. Note that our method was only trained to predict 3s
into the future, thus its performance at 4s also provides a measure of its ca-
pability to generalize beyond its training configuration. Other methods do not
report values at 2s and 4s. As can be seen, Trajectron++ outperforms existing
approaches without facing a sharp degradation in performance after 3s. Our
approach’s performance on pedestrians is reported in Table 4 (b), where the
inclusion of HD maps and dynamics integration similarly improve performance
as in the pedestrian datasets.
Ablation Study. To develop an understanding of which model components
influence performance, a comprehensive ablation study is performed in Table 5.
As can be seen in the first row, even the base model’s deterministic ML output
performs strongly relative to current state-of-the-art approaches for vehicle tra-
jectory forecasting [7]. Adding the dynamics integration scheme yields a drastic
reduction in NLL as well as FDE at all prediction horizons. There is also an as-
sociated slight increase in the frequency of road boundary-violating predictions.
This is a consequence of training in position (as opposed to velocity) space, which
yields more variability in the corresponding predictions. Additionally including
map encoding maintains prediction accuracy while reducing the frequency of
boundary-violating predictions.
The effect of conditioning on the ego-vehicle’s future motion plan is also stud-
ied, with results summarized in Table 5 (b). As one would expect, providing the
model with future motion plans of the ego-vehicle yields significant reductions
in error and road boundary violations. This use-case is common throughout au-
14 T. Salzmann? , B. Ivanovic? , et al.
road_divider
lane_divider
Ours (ML)
Ground Truth
drivable_area
road_segment
lane
ped_crossing
walkway
stop_line
R R
(a) Ours (b)+ (c)+ , M
6 Conclusion
In this work, we present Trajectron++, a generative multi-agent trajectory fore-
casting approach which uniquely addresses our desiderata for an open, generally-
applicable, and extensible framework. It can incorporate heterogeneous data
beyond prior trajectory information and is able to produce future-conditional
predictions that respect dynamics constraints, all while producing full probabil-
ity distributions, which are especially useful in downstream robotic tasks such
as motion planning, decision making, and control. It achieves state-of-the-art
prediction performance in a variety of metrics on standard and new real-world
multi-agent human behavior datasets.
Acknowledgment. This work was supported in part by the Ford-Stanford
Alliance. This article solely reflects the opinions and conclusions of its authors.
Trajectron++: Dynamically-Feasible Trajectory Forecasting 15
References
1. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.:
Social LSTM: Human trajectory prediction in crowded spaces. In: IEEE Conf. on
Computer Vision and Pattern Recognition (2016) 3, 5, 9, 10, 11, 12
2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning
to align and translate. In: Int. Conf. on Learning Representations (2015) 6
3. Battaglia, P.W., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction
networks for learning about objects, relations and physics. In: Conf. on Neural
Information Processing Systems (2016) 6
4. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Gen-
erating sentences from a continuous space. In: Proc. Annual Meeting of the Asso-
ciation for Computational Linguistics (2015)
5. Britz, D., Goldie, A., Luong, M.T., Le, Q.V.: Massive exploration of neural ma-
chine translation architectures. In: Proc. of Conf. on Empirical Methods in Natural
Language Processing. pp. 1442–1451 (2017) 7
6. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A.,
Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous
driving (2019) 4, 8, 12
7. Casas, S., Gulino, C., Liao, R., Urtasun, R.: SpAGNN: Spatially-aware graph neu-
ral networks for relational behavior forecasting from sensor data (2019) 3, 10, 12,
13
8. Casas, S., Luo, W., Urtasun, R.: IntentNet: Learning to predict intention from raw
sensor data. In: Conf. on Robot Learning. pp. 947–956 (2018) 3
9. Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D.,
Carr, P., Lucey, S., Ramanan, D., Hays, J.: Argoverse: 3d tracking and forecasting
with rich maps. In: IEEE Conf. on Computer Vision and Pattern Recognition
(2019) 4
10. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for
statistical machine translation. In: Proc. of Conf. on Empirical Methods in Natural
Language Processing. pp. 1724–1734 (2014) 7
11. Deo, M.F., Trivedi, J.: Multi-modal trajectory prediction of surrounding vehicles
with maneuver based lstms. In: IEEE Intelligent Vehicles Symposium (2018) 4, 9,
12
12. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Conf. on Neural
Information Processing Systems (2014) 3, 4
13. Gupta, A., Johnson, J., Li, F., Savarese, S., Alahi, A.: Social GAN: Socially accept-
able trajectories with generative adversarial networks. In: IEEE Conf. on Computer
Vision and Pattern Recognition (2018) 4, 5, 9, 10, 11
14. Gweon, H., Saxe, R.: Developmental cognitive neuroscience of theory of mind.
In: Neural Circuit Development and Function in the Brain, chap. 20, pp.
367–377. Academic Press (2013). https://doi.org/https://doi.org/10.1016/B978-
0-12-397267-5.00057-1, http://www.sciencedirect.com/science/article/pii/
B9780123972675000571 1
15. Hallac, D., Leskovec, J., Boyd, S.: Network lasso: Clustering and optimization in
large graphs. In: ACM Int. Conf. on Knowledge Discovery and Data Mining (2015)
16. Helbing, D., Molnár, P.: Social force model for pedestrian dynamics. Physical Re-
view E 51(5), 4282–4286 (1995) 3
16 T. Salzmann? , B. Ivanovic? , et al.
17. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed,
S., Lerchner, A.: beta-VAE: Learning basic visual concepts with a constrained
variational framework. In: Int. Conf. on Learning Representations (2017)
18. Ho, J., Ermon, S.: Multiple futures prediction. In: Conf. on Neural Information
Processing Systems (2019) 3
19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation
(1997) 6
20. Ivanovic, B., Pavone, M.: The Trajectron: Probabilistic multi-agent trajectory
modeling with dynamic spatiotemporal graphs. In: IEEE Int. Conf. on Computer
Vision (2019) 2, 3, 4, 5, 9, 10, 11
21. Ivanovic, B., Schmerling, E., Leung, K., Pavone, M.: Generative modeling of mul-
timodal multi-human behavior. In: IEEE/RSJ Int. Conf. on Intelligent Robots &
Systems (2018) 4, 5, 6
22. Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: Deep learning on
spatio-temporal graphs. In: IEEE Conf. on Computer Vision and Pattern Recog-
nition (2016) 5, 6
23. Jain, A., Casas, S., Liao, R., Xiong, Y., Feng, S., Segal, S., Urtasun, R.: Discrete
residual flow for probabilistic pedestrian behavior prediction. In: Conf. on Robot
Learning (2019) 3
24. Jang, E., Gu, S., Poole, B.: Categorial reparameterization with gumbel-softmax.
In: Int. Conf. on Learning Representations (2017) 8
25. Kalman, R.E.: A new approach to linear filtering and prediction problems. ASME
Journal of Basic Engineering 82, 35–45 (1960) 7
26. Kesten, R., Usman, M., Houston, J., Pandya, T., Nadhamuni, K., Ferreira, A.,
Yuan, M., Low, B., Jain, A., Ondruska, P., Omari, S., Shah, S., Kulkarni, A.,
Kazakova, A., Tao, C., Platinsky, L., Jiang, W., Shet, V.: Lyft Level 5 AV Dataset
2019. https://level5.lyft.com/dataset/ (2019) 4
27. Kong, J., Pfeifer, M., Schildbach, G., Borrelli, F.: Kinematic and dynamic vehi-
cle models for autonomous driving control design. In: IEEE Intelligent Vehicles
Symposium (2015) 2, 6
28. Kosaraju, V., Sadeghian, A., Martı́n-Martı́n, R., Reid, I., Rezatofighi, S.H.,
Savarese, S.: Social-BiGAT: Multimodal trajectory forecasting using bicycle-GAN
and graph attention networks. In: Conf. on Neural Information Processing Systems
(2019) 3, 4, 9, 10
29. LaValle, S.M.: Better unicycle models. In: Planning Algorithms, pp. 743–743. Cam-
bridge Univ. Press (2006) 6
30. LaValle, S.M.: A simple unicycle. In: Planning Algorithms, pp. 729–730. Cambridge
Univ. Press (2006)
31. Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H.S., Chandraker, M.: DESIRE:
distant future prediction in dynamic scenes with interacting agents. In: IEEE Conf.
on Computer Vision and Pattern Recognition (2017) 3, 4
32. Lee, N., Kitani, K.M.: Predicting wide receiver trajectories in American football.
In: IEEE Winter Conf. on Applications of Computer Vision (2016) 3
33. Lerner, A., Chrysanthou, Y., Lischinski, D.: Crowds by example. Computer Graph-
ics Forum 26(3), 655–664 (2007) 4, 8, 10
34. Morton, J., Wheeler, T.A., Kochenderfer, M.J.: Analysis of recurrent neural net-
works for probabilistic modeling of driver behavior. IEEE Transactions on Pattern
Analysis & Machine Intelligence 18(5), 1289–1298 (2017) 3
35. Paden, B., Čáp, M., Yong, S.Z., Yershov, D., Frazzoli, E.: A survey of motion
planning and control techniques for self-driving urban vehicles. IEEE Transactions
on Intelligent Vehicles 1(1), 33–55 (2016) 2, 6
Trajectron++: Dynamically-Feasible Trajectory Forecasting 17
36. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In:
Conf. on Neural Information Processing Systems - Autodiff Workshop (2017) 9
37. Pellegrini, S., Ess, A., Schindler, K., Gool, L.v.: You’ll never walk alone: Modeling
social behavior for multi-target tracking. In: IEEE Int. Conf. on Computer Vision
(2009) 4, 8, 10
38. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning
(Adaptive Computation and Machine Learning). MIT Press, first edn. (2006) 3
39. Rhinehart, N., McAllister, R., Kitani, K., Levine, S.: PRECOG: Prediction con-
ditioned on goals in visual multi-agent settings. In: IEEE Int. Conf. on Computer
Vision (2019) 3, 4
40. Rudenko, A., Palmieri, L., Herman, M., Kitani, K.M., Gavrila, D.M., Arras,
K.O.: Human motion trajectory prediction: A survey (2019), Available at https:
//arxiv.org/abs/1905.06113 3
41. Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, S.H., Savarese,
S.: SoPhie: An attentive GAN for predicting paths compliant to social and physical
constraints. In: IEEE Conf. on Computer Vision and Pattern Recognition (2019)
4, 9, 10, 12
42. Sadeghian, A., Legros, F., Voisin, M., Vesel, R., Alahi, A., Savarese, S.: CAR-Net:
Clairvoyant attentive recurrent network. In: European Conf. on Computer Vision
(2018) 4, 10, 12
43. Schöller, C., Aravantinos, V., Lay, F., Knoll, A.: What the constant velocity model
can teach us about pedestrian motion prediction. IEEE Robotics and Automation
Letters (2020)
44. Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep
conditional generative models. In: Conf. on Neural Information Processing Systems
(2015) 3, 4, 7
45. Thiede, L.A., Brahma, P.P.: Analyzing the variety loss in the context of proba-
bilistic trajectory prediction. In: IEEE Int. Conf. on Computer Vision (2019) 9,
11
46. Thrun, S., Burgard, W., Fox, D.: The extended Kalman filter. In: Probabilistic
Robotics, pp. 54–64. MIT Press (2005) 7
47. Vemula, A., Muelling, K., Oh, J.: Social attention: Modeling attention in human
crowds. In: Proc. IEEE Conf. on Robotics and Automation (2018) 3, 5, 9, 10, 11
48. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for
human motion. IEEE Transactions on Pattern Analysis & Machine Intelligence
30(2), 283–298 (2008) 3
49. Waymo: Safety report (2018), Available at https://waymo.com/safety/. Re-
trieved on November 9, 2019 2
50. Waymo: Waymo Open Dataset: An autonomous driving dataset. https://waymo.
com/open/ (2019) 4
51. Zeng, W., Luo, W., Suo, S., Sadat, A., Yang, B., Casas, S., Urtasun, R.: End-to-
end interpretable neural motion planner. In: IEEE Conf. on Computer Vision and
Pattern Recognition (2019) 3
52. Zhao, S., Song, J., Ermon, S.: InfoVAE: Balancing learning and inference in vari-
ational autoencoders. In: Proc. AAAI Conf. on Artificial Intelligence (2019) 8
53. Zhao, T., Xu, Y., Monfort, M., Choi, W., Baker, C., Zhao, Y., Wang, Y., Wu, Y.N.:
Multi-agent tensor fusion for contextual trajectory prediction. In: IEEE Conf. on
Computer Vision and Pattern Recognition (2019) 3, 4, 9, 10, 12