0532 Exploring Dynamic Context For Multi-Path Trajectory Prediction
0532 Exploring Dynamic Context For Multi-Path Trajectory Prediction
I. I NTRODUCTION
future (12 steps) by observing their trajectories (8 steps),
Intelligent autonomous systems, such as robots and au- as showcased in Fig 1.
tonomous vehicles, have a high demand for the ability Specifically, the main contributions of this work are
to accurately perceive, understand and predict the future as follows: (1) It provides a novel framework to predict
behavior of humans for effective and safe deployments in trajectories of heterogeneous agents (pedestrians, bicycles,
our real-world environment. For example, an autonomous vehicles, etc.) in various traffic situations, i.e., 20 different
agent will adjust its moving path according to the possible shared spaces and four intersections with mixed traffic. (2)
locations of other agents to prevent obstructions or collisions. Self-attention modules are integrated into our framework to
However, it is challenging to predict the future location of an explore the dynamic context among agents. (3) A set of
agent because it is not deterministic: (1) an agent may change possible trajectories for each agent is predicted conditioned
its mind during the movement, (2) other agents’ behaviors on its observed trajectory and the learned dynamic context
will affect its next step (e.g., to avoid collisions), and (3) the using a CVAE [1, 2] module. Extensive experiments are
influence from other agents is dynamic. Therefore, it is more conducted on two of the most popular benchmarks Trajnet
beneficial to predict a set of potential trajectories adaptive challenge [3] and the new large-scale benchmark inD [4] to
to the dynamic interactions between agents than to predict validate the effectiveness of DCENet for trajectory forecast-
a deterministic one. In this work, we seek to explore the ing. To judge the effectiveness of each proposed module,
dynamic context between agents in traffic scenarios to predict we conduct additional ablation studies. An overview of our
multiple possible trajectories for each agent in the short framework is depicted in Fig. 2.
∗ Equal contribution, name in alphabet order
1 Institute
II. R ELATED W ORK
of Cartography and Geoinformatics, Leibniz University Han-
nover, Germany, {cheng, sester}@ikg.uni-hannover.de Trajectory Prediction. Forecasting human trajectory has
2 Institute of Information Processing, Leibniz University Hannover, Ger-
been researched for decades. In the early stages, many classic
many, {lastname}@tnt.uni-hannover.de
3 Scene Understanding Group, University of Twente, The Netherlands, approaches are widely applied such as linear regression
[email protected] and Kalman filter [5], Gaussian processes [6] and Markov
This work is supported by the German Research Foundation (DFG) decision processing [7, 8]. These traditional methods heavily
through the Research Training Group SocialCars (GRK 1931) and Ger-
many’s Excellence Strategy within the Cluster of Excellence PhoenixD rely on the quality of manually designed features, which
(EXC 2122). cannot work reliably in a real-world environment of complex
Conv1D
Predictions
Conv1D
GApool
attention
GApool
attention
Self-
Self-
FC
FC
𝜎2
FC
LSTM
Dynamic maps Dynamic maps
FC
FC
FC
FC
FC
z
FC
Conv2D
µ
attention
Conv2D
FC
attention
LSTM
LSTM
Self-
Self-
FC
Ground Truth
FC
Observation Only in training Concatenation
Both in training and inference
Fig. 2: The pipeline for the proposed method. The Encoder Y and Encoder X are identical in structure.
spatial-temporal dynamics and are poor at scaling up for decoder conditioned on observations. Katyal et al. [33] pro-
dealing with a large amount of data. In recent years, many pose to predict the intent of the target agent using a Bayesian
artificial intelligent (AI) technologies have been boosted by approach as a condition of their CVAE-based LSTM encoder-
the cutting-edge deep learning technologies [9], including decoder to help generate multiple paths. Meanwhile, they in-
human trajectory prediction [10]–[16]. The deep learning troduce an LSTM discriminator to train the framework in an
models, especially Recurrent Neural Networks (RNNs) with adversarial way. Salzmann et al. [35] propose a CVAE-based
Long Short-Term Memories (LSTMs), show great power model using spatial-temporal graphs to predict pedestrian and
in modeling complex social interactions between agents for car trajectories. In [36], scene context and the interactions
collision avoidance and exploiting the time dependency for between individual and group agents are accounted as a
predicting futures [17]. The Social LSTM network [10] condition in a CVAE-based framework to sample multiple
explores the interactions between pedestrians by connecting trajectories. [37] applies a determinantal point process to
neighboring LSTMs in the social pooling layer and predicts increase the diversity sampling of a CVAE-based model for
trajectories for multiple pedestrians. Zhang et al. [13] pro- 2D and 3D motion prediction using synthetic data. Some
pose the States Refinement LSTM (SR-LSTM) model that other works treat the multi-path trajectory prediction problem
aligns all the agents together and refines the state of each as the estimation of a multimodal distribution. Cui et al. [32]
agent through a message-passing framework. Chandra et propose to model the multimodality of vehicle movement
al. [18] combine LSTM and Convolutional Neural Network prediction with Deep Convolutional Networks. In [30], first,
(CNN) to model the interactions between heterogeneous the multimodal distributions are predicted with an evolving
road agents. However, many works have figured out the strategy by combining the Winner-Takes-ALL loss [38].
limited capability of LSTMs in modeling human-human Then, the samples from the first stage fit a distribution for
interactions [19, 20]. Hence, the attention module [21] is trajectory prediction. Cheng et al. [39] propose AMENet that
incorporated in LSTMs to learn the spatial-temporal context only employs the self-attention mechanism [24] for learning
of trajectories between pedestrians in [12, 22, 23]. Recently, agent-to-agent interaction. In comparison, DCENet adopts a
the Transformer structure [24] has shown its power in context two-stream architecture [40, 41] of attention modules, with
learning and sequential prediction [25, 26]. In this paper, we respective streams dedicated to learning the spatial and
will adopt the self-attention module to encode the dynamic temporal contexts explicitly.
interactions between agents. The recent work [27] seeks to
utilize the Transformer structure to predict trajectory instead III. M ETHOD
of LSTMs. Our work is different from it essentially: (1) we A. Problem Formulation
use the generic self-attention module rather than the Deep Trajectory prediction is defined as to sequentially predict
Bidirectional Transformers (BERT) [25], which is a heavy 0
the future positions Ŷi = {ŷTi +1 , · · · , ŷTi } of target agent i by
stacked Transformer structure and is pre-trained on large- observing its trajectory Xi = {x1i , · · · , xTi }, where xti = (xti , yti )
scale datasets, and (2) our framework is a generative model. is the coordinates at the t-th step and 1 ≤ t ≤ T . Similarly,
0 0 0
Multi-path Trajectory Prediction. Many approaches ŷti = (xti , yti ) is the coordinates at the t 0 -th step and T < t 0 ≤
have been proposed to predict a socially compliant set of T 0 . T is the length of observed trajectory and T 0 is the total
possible trajectories for an agent [11, 28]–[33]. Generative length of being observed and predicted trajectory in discrete
Adversarial Nets (GAN) [34] and CVAE [1, 2] are the time steps. Ŷi should be as close to the corresponding ground
most popular generative models used for this task. In [11] truth Yi as possible. The problem of multi-path trajectory
a trajectory sampler named Social GAN is proposed that prediction can be formulated as predicting a set of trajectories
considers the social effects of all agents. The generator is Ŷi = {Ŷi,1 , · · · , Ŷi,N } by observing Xi for agent i, where N
trained to predict a set of trajectories for each agent against is the total number of predicted trajectories.
a recurrent discriminator. In [12] social and physical attention
mechanisms are implemented in the GAN sampler to predict B. Dynamic Maps
paths for each agent. In [28], multiple plausible prediction To model the interactions among agents, we first create
samples are generated by a CVAE-based RNN encoder- dynamic maps for each agent that consist of the orientation,
12796
speed and position layers of its intermediate environment. also called scaled dot-product attention [24]. The Q, K and
These dynamic maps are different from the ones in [41] V are obtained by three separated linear transformations:
that are designed for modeling map rasterization and traffic
lights. Centralized on the target agent, a map is defined as a Q = π(X)WQ , K = π(X)WK , V = π(X)WV , (3)
rectangular area of size W × H and divided into grid cells. where WQ ,WK ,WV ∈ Rdπ ×dk are the trainable parameters and
First, referring to the target agent i, the neighboring agents dπ is the dimension of π(X).
N(i) are mapped into the closest grid cellstw×h according to Because the self-attention module takes all inputs at the
their relative position as well as the cells reached by their same time, position encodings are added to the Q, K and
anticipated relative offset (speed) in the x and y directions: V at the bottom of each self-attention layer to encode the
cellstw = xtj − xti + (∆xtj − ∆xti ), temporal information. The sine and cosine functions of
(1) different frequencies (varying in time here) are the most
cellsth = ytj − yti + (∆ytj − ∆yti ),
widely used:
where w ≤ W, h ≤ H, j ∈ N(i) and j 6= i. The orientation (
t
layer O stores the heading direction that is defined as the t D sin( 10000 d/D ), for d even;
p = {pt,d }d=1 , pt,d = t (4)
angle ϑ j in the Euclidean plane and calculated in the given cos( 10000d/D ), for d odd,
radians by ϑ j = arctan2(∆ytj , ∆xtj ). (∆ytj , ∆xtj ) is the offset of
the position from t-th step to the next one for neighboring where D = dk ensures position encodings to have the same
agent j. The angle is shifted into degree [0, 360). Similarly, dimension as the vectors of Q, K and V .
the speed layer S stores the travel speed and the position To attend to different information from different repre-
layer P stores the position using a binary flag in the cells sentation subspaces jointly, the multi-head attention [24]
mapped above. Last, layer-wise, a Min-Max normalization strategy is applied as a conventional operation, where a head
scheme is applied for normalization, see Fig. 1. The map is an independent scaled dot-product attention module:
should cover a large vicinity area. Empirically we found 32×
MultiHead(Q, K,V ) = ConCat(head1 , ..., headh )WO ,
32 m2 a proper setting considering both the coverage and (5)
the computational cost. The cell size is set to 1 × 1 m2 as a headi = Attention(QWQi , KWKi ,VWVi ),
balance to avoid the overlap of multiple agents in one cell where WQi , WKi , WVi ∈ RD×dki are the linear transformation
based on the distribution of the experimental data, which is parameters same as in Eq. (3) and WO are the linear transfor-
also supported by the preservation of personal space [42]. mation parameters for aggregating the extracted information
C. Encoder Network from different heads. Note that dki = dhk and dki must be an
The spatial-temporal context from both the observation aliquot part of dk . h is the total number of the attention heads
time and prediction time are encoded by Encoder X and and we use two heads in the implementation.
Y, respectively. Both encoders have the same two-stream Then the GApool is used to extract the temporal depen-
structure: both streams consist of stacked self-attention lay- dencies between steps by taking as input the output of the
ers; as illustrated in Fig. 2 one stream is followed by a self-attention module and output an encoded representation.
global average pooling (GApool), while the other one is The lower stream that exploits the dynamic interactions
followed by an LSTM module. The upper stream is trained to among agents works in the same way but the spatial de-
learn motion information from the observed trajectory, whose pendencies among agents are encoded by the hidden states
input is the locations vector of the observed trajectory of the of an LSTM. Finally, the outputs of these two streams are
target agent Xi = {x1i , · · · , xTi } ∈ RT ×2 . The lower stream is connected and passed to a FC layer for fusion as the encoded
trained to explore dynamic interactions among agents from information that includes dynamic spatial-temporal context.
the dynamic maps noted as DM = {O, S, P} ∈ RT ×H×W ×3
D. Multiple Trajectories Prediction
(discussed in Sec. III-B). For simplicity, we take the upper
stream for illustration. To get a sparse high dimensional Our method is CVAE-based and predicts multiple trajec-
representation, Xi is first passed to a 1D convolution layer tories by repeatedly sampling from a learned latent space
(Conv1D) and a fully connected (FC) layer. Each of them conditioned on the encoded information. The CVAE is
is followed by a ReLU non-linear activation. We denote this an extension of the VAE [43] by introducing a condi-
operation as π(Xi ). A self-attention layer takes as input the tion to control the output [2]. Given a set of samples
Query (Q), Key (K) and Value (V ) and outputs a weighted (X, Y) = ((X1 , Y1 ), · · · , (Xm , Ym )), it jointly learns a recogni-
sum of the value vectors. The weight assigned to each tion model qφ (z|Y, X) of a variational approximation of the
value is calculated as the dot-product of the query with the true posterior pθ (z|Y, X) and a generation model pθ (Y|X, z)
corresponding key: for predicting the output Y conditioned on the input X. z
are the stochastic latent variables, φ and θ are the respec-
QK T tive recognition and generative parameters. The goal is to
Attention(Q, K,V ) = softmax( √ )V, (2)
dk maximize the Conditional Log-Likelihood: log pθ (Y|X) =
√
where dk is the scaling factor, dk is the dimension of the log ∑z pθ (Y, z|X) = log (∑z qφ (z|X, Y) pθ (Y|X,z)p θ (z|X)
qφ (z|X,Y) ). Ac-
vector K and T is the transpose operation. This operation is cording to Jensen’s inequality [44], the evidence lower bound
12797
can be obtained: to fit the positions into the probability density function:
log pθ (Y|X) ≥ − DKL (qφ (z|X, Y)||pθ (z))+ 0 1 −Z
f (x̂i , ŷi )t = exp ,
(6) − ρ 2)
p
Eqφ (z|X, Y) [log pθ (Y|X, z)], 2πσX̂i σŶi 1 − ρ 2 2(1
(x̂i − µX̂i )2 (ŷi − µŶi )2 2ρ(x̂i − µX̂i )(ŷi − µŶi )
where pθ (z) is made statistically independent from Z= + − .
σX̂i 2 σŶi 2 σX̂i σŶi
pθ (z|X) [1, 2]. Here both the approximated posterior
(8)
qφ (z|X, Y) and the prior pθ (z) are assumed to be Gaussian
distribution for an analytical solution [43]. During training, where µ denotes the mean and σ the standard deviation, and
the Kullback-Leibler divergence DKL (·) acts as a regularizer ρ is the correlation between X̂i and Ŷi . A predicted trajectory
and pushes the approximated posterior to the prior distri- is scored as the sum of the relative likelihood of all its
0 0
bution pθ (z). The generation error Eqφ (z|X, Y) (·) measures steps: S(Ŷi,n ) = ∑tT0 =T +1 f (x̂i , ŷi )t . All predicted trajectories
the distance between the generated output and the ground are ranked by this score and the one with the highest score
truth. During inference, for a given observation Xi , one latent stands out for the single-path prediction.
variable zi is drawn from the prior distribution pθ (z), and one IV. E XPERIMENTS
of the possible output Ŷi is generated from the distribution
pθ (Yi |Xi , zi ). The latent variables z allow for the one-to- To evaluate the performance of our proposed method,
many mapping from the condition to the output via multiple we compare DCENet with the most influential and recent
sampling. In this work, we model a conditional distribution nine state-of-the-art models from the Trajnet [3] challenge
pθ (Yn |X), where X is the observed trajectory information leader-board for a fair comparison: (1) Linear (off): a sim-
and Yn is one of its possible future trajectories. ple temporal linear regressor; (2) Social Force [46]: the
very high impact rule-based model that implements social
Training: As shown in Fig. 2, during the training, both
force to avoid collisions; (3) S-LSTM [10]: the highly cited
the observed trajectory Xi and its future trajectory Yi are
LSTM-based model that introduces social pooling layer
encoded by Encoder X and Y (see Sec. III-C), respectively.
for modeling interactions; (4) S-GAN [11]: a GAN-based
Then, their encodings are concatenated and passed through
trajectory predictor; (5) MX-LSTM [47]: an LSTM trajectory
two FC layers (each is followed by a ReLU activation) for
predictor that utilizes the head direction of agent; (6) SR-
fusion. Then, two side-by-side FC layers are used to estimate
LSTM [13]: an LSTM-based model that refines the hidden
the mean µzi and the standard deviation σzi of the latent
states by message passing; (7) RED [19]: an RNN encoder-
variables zi . A trajectory Ŷi is reconstructed by an LSTM
decoder model predicts trajectory only using observations;
decoder step by step by taking zi and the encodings of
(8) Ind-TF [27]: a Transformer-based trajectory predictor; (9)
observation as input. Because the random sampling process
AMENet [39]: the most recent state-of-the-art on the Trajnet
of zi can not be back propagated during training, the standard
leader-board. We further design a series of ablation studies
reparameterization trick [43] is adopted to make it differen-
to analyze the impact of each proposed module, i.e., dynamic
tiable. To minimize the error between the predicted trajectory
maps, transformer and LSTM encoder/decoder: (1) Baseline:
Ŷi and the ground truth Yi , the reconstruction loss is defined
an LSTM encoder-decoder only using the observed trajectory
as the L2 loss (Euclidean distance). Thus, the whole network
as input; (2) DCENet w/o DMs: the stream of encoding
is trained by minimizing the loss function using the stochastic
dynamic maps is removed from our final model; (3) Trans.
gradient descent method:
En&De: the LSTM encoder-decoder is substituted by the
L = kŶ − Yk2 + DKL (qφ (z|X, Y)||N (0, I)). (7) Transformer encoder/decoder [24] in our framework.
A. Datasets
Test: In the test phase, the ground truth of future trajectory
Trajnet [3] is one of the most popular forecasting bench-
is no more available and its pathway is removed (color
marks. In Trajnet, 8 consecutive ground-truth locations (3.2
coded in green in Fig. 2). A latent variable z is sampled
seconds) of each trajectory are for observation and the
from the prior distribution N (0, I) and concatenated with
following 12 steps (4.8 seconds) are required to forecast.
the observation encodings that serve as the condition for the
Trajnet is a superset of diverse popular benchmark datasets:
following trained decoder, so that the decoder can predict
ETH [48], UCY [49], Stanford Drone Dataset [50], BIWI
a trajectory. To predict multiple trajectories, this process
Hotel [48], and MOT PETS [51]. There is a total of 11448
(sampling and decoding) is repeated multiple times.
trajectories from these four subsets covering 38 scenes for
training. The test data is from the diverse partitions of them
E. Trajectory Ranking
(besides MOT PETS) of the other 20 scenes without ground
We propose a ranking strategy to select the most-likely truth. The Trajnet challenge provides a specific server for
predicted trajectory out of the multiple predictions in order online evaluation. It is worth noting that many existing works
to adjust the Trajnet challenge setting. We apply bivari- are evaluated on a subset of Trajnet using their own train/test
ate Gaussian distribution to rank the predicted trajectories splits. For the sake of fairness, we only compare DCENet to
(Ŷi,1 , · · · , Ŷi,N ) for each agent. At step t 0 , all the predicted the works which have shown their performance on the Trajnet
0
positions for agent i are stored in |X̂i , Ŷi |t . We follow [45] challenge leader-board.
12798
TABLE I: Results of different methods on the Trajnet TABLE II: Quantitative results of our model and the compar-
challenge [3]. Models are categorized into deterministic ative models on the inD benchmark measured by ADE/FDE.
(determ.) and stochastic (stoch.) depending on whether they
Model S-LSTM S-GAN AMENet DCENet
incorporate a generative module.
inD @top 10
Model Category Avg. [m]↓ FDE [m]↓ ADE [m]↓ Intersection-(A) 2.04/4.61 2.84/4.91 0.95/1.94 0.72/1.50
Intersection-(B) 1.21/2.99 1.47/3.04 0.59/1.29 0.50/1.07
S-LSTM [10] determ. 1.3865 3.098 0.675 Intersection-(C) 1.66/3.89 2.05/4.04 0.74/1.64 0.66/1.40
S-GAN [11] stoch. 1.3340 2.107 0.561 Intersection-(D) 2.04/4.80 2.52/5.15 0.28/0.60 0.20/0.45
MX-LSTM [47] determ. 0.8865 1.374 0.399 Avg. 1.74/4.07 2.22/4.29 0.64/1.37 0.52/1.23
Linear (off) determ. 0.8185 1.266 0.371
Social Force [46] determ. 0.8185 1.266 0.371 inD Most-likely
SR-LSTM [13] determ. 0.8155 1.261 0.370 Intersection-(A) 2.29/5.33 3.02/5.30 1.07/2.22 0.96/2.12
RED [19] determ. 0.7800 1.201 0.359 Intersection-(B) 1.28/3.19 1.55/3.23 0.65/1.46 0.64/1.41
Ind-TF [27] determ. 0.7765 1.197 0.356 Intersection-(C) 1.78/4.24 2.22/4.45 0.83/1.87 0.86/1.93
AMENet [39] stoch. 0.7695 1.183 0.356 Intersection-(D) 2.17/5.11 2.71/5.64 0.37/0.80 0.28/0.62
Avg. 1.88/4.47 2.38/4.66 0.73/1.59 0.69/1.52
Baseline stoch. 0.8045 1.239 0.370
DCENet w/o DMs stoch. 0.7760 1.195 0.357
Trans. En&De stoch. 0.7780 1.196 0.360
DCENet stoch. 0.7660 1.179 0.353
a better spatial-temporal context than Transformer. Further-
more, Ind-TF utilizes BERT, a heavily stacked Transformer
structure and must be pre-trained on an external large-scale
inD was acquired by Bock et al. [4] using drones at
dataset, while DCENet does not require it. The results of
four busy intersections in Germany in 2019. The traffic is
DCENet w/o DMs indicates that its superior performance is
dominated by vehicles and they interact with pedestrians
not because we used more information (dynamic maps).
heavily. The speed difference and confrontation makes the
trajectory prediction challenging. The data was processed to Second, by the comparison between the Baseline and S-
obtain the same format as Trajnet: 8 steps for observation LSTM, our Baseline model was significantly better. The
and the following 12 steps for prediction. difference between them is that our Baseline is CVAE-based
and generates multiple trajectories. It indicates that the future
B. Evaluation Metrics motion of humans is of high uncertainty, and predicting a set
We adopt the most popular evaluation metrics: the mean of possible trajectories is better than only predicting a single
average displacement error (ADE) and the final displacement one. It also demonstrates the effectiveness of the trajectory
error (FDE) to measure the trajectory prediction perfor- ranking methods (Sec. III-E), which was used to select
mance. ADE measures the aligned Euclidean distance from the most-likely trajectory from the multiple predictions.
the prediction to its corresponding ground truth trajectory Our Baseline outperformed S-GAN significantly, which is
averaged over all steps. The mean value across all the a generative model for multiple trajectories prediction.
trajectories is reported. FDE measures the Euclidean distance Third, interestingly, Trans. En&De that adopts the Trans-
between the last position from the prediction to the corre- former encoder and decoder in our framework did not
sponding ground truth position. In addition, the most-likely achieve improved performance compared to DCENet. This
prediction is decided by the ranking method as described phenomenon indicates that our self-attention + LSTM en-
in Sec III-E. Compared with the ground truth (only if it is coder/decoder structure explored better dynamic context be-
available), @top10 is the one out of ten predicted trajectories tween agents than Transformer encoder/decoder in terms of
that has the smallest ADE and FDE. trajectory prediction. The superior performance of DCENet
The implementation details of training and testing our w/o DMs against Ind-TF has also confirmed that.
methods can be found in our code repository. Lastly, DCENet outperformed DCENet w/o DMs. It indi-
cates that the dynamic maps helped model the interactions
C. Results between agents and were useful for trajectory prediction.
The experimental results from different methods including Discussion According to the comparison above, the results
our ablative models reported on the Trajnet leader-board indicate: (1) DCENet is effective for predicting accurate
are listed in Table I. Without ground truth trajectories, the trajectories for heterogeneous agents in various real-world
single-path trajectory prediction was selected by the ranking traffic scenes, even without modeling interactions explicitly
mechanism. We can see that DCENet reported new state- (the Baseline model). (2) The ranking method correctly
of-the-art performance and the ablative models also had estimates the multiple predictions and recommends a reliable
comparable performances compared to the previous works. candidate for the single-path trajectory prediction task. (3)
First, by comparing to the Baseline, both DCENet w/o Compared to the Baseline model, DCENet learns interaction
DMs and Ind-TF had much better results, and DCENet w/o via the dynamic maps with the self-attention structure ef-
DMs was slightly better in the average score and FDE but a fectively and shows improved performance. (4) Both LSTM
little inferior in ADE than Ind-TF. Considering both models and Transformer networks are capable of learning complex
only use observed trajectories as input, it indicates that our sequential patterns but their combination further enhances
method (self-attention + LSTM encoder/decoder) explored the performance in terms of trajectory prediction.
12799
(a) Trajnet bookstore-3 (b) Trajnet coupa-3 (c) Trajnet deathCircle-0 (d) Trajnet hyang-6
(e) inD Intersection-(A) (f) inD Intersection-(B) (g) inD Intersection-(C) (h) inD Intersection-(D)
Fig. 3: Multi-path trajectory predictions in shared spaces in Trajnet (1st row) and at different intersections in inD (2nd row).
Furthermore, we have tested DCENet on inD [4] to justify in coupa-3, as well as the pedestrians walking in parallel. In
its performance and generalization ability. We compare our deathCircle-0, our model predicted different possible turning
model with the three most relevant models: S-LSTM for angles for the cyclist in the roundabout. In hyang-6, two
comparing with its occupancy grid mapping for agent-to- pedestrians walking closely to each other were predicted
agent interaction, S-GAN for its generative module, and correctly. The second row showcases the scenarios in the
AMENet for its CVAE module and LSTM sequential mod- inD dataset. Our model predicted a fast driving vehicle
eling. To guarantee a fair comparison, all the models were with a slightly different predicted speed at the Intersection-
trained and tested using the same data. S-LSTM predicts the (A). It predicted that a left-turning vehicle may turn at the
distributions of the positions [10]. During inference, multiple intersection-(B) with varying tuning angle and speed. The
positions were generated by sampling. Table II lists the model also correctly predicted the interaction at the zebra
performance measured by ADE/FDE. Our model achieved crossing at the intersection-(C), where the vehicle stops to
the best performance for the @top10 prediction across all yield the way to the pedestrian. Similar predictions can
the intersections and reduced the errors by a big margin. Our be seen for the walking and static pedestrians, as well as
model also outperformed the other models for the most-likely the vehicle waiting at the entrance of the intersection-(D).
prediction at three out of four intersections. It only slightly Overall, we can also see that the recommended single path is
fell behind the AMENet model on the intersection-(C). We very close to the corresponding ground truth for each agent.
anticipate that the most-likely prediction fell behind the
@top10 prediction. However, the ranking method was still V. C ONCLUSION
effective in recommending a reliable candidate in comparison In this paper, we proposed a novel framework DCENet for
to the other models. The results indicate: (1) Our model is multi-path trajectory prediction for heterogeneous agents in
able to generalize on different datasets and maintain superior various real-world traffic scenarios. We decompose the learn-
performance. (2) Predicting multiple paths is more beneficial ing of dynamic spatial-temporal context into exploiting the
than predicting a single one for an agent. On the one hand, dynamic spatial context between agents using self-attention
multiple predictions increase the chances to narrow down the and the LSTM encoder and learning temporal context be-
errors. On the other hand, a single prediction may lead to tween steps with the following self-attention and global
a wrong conclusion especially if the initial steps predicted average pooling. The spatial-temporal context is encoded
are deviating from the ground truth and the errors will into a latent space using a CVAE module. Finally, a set of
accumulate significantly with time. The multiple predictions future trajectories for each agent is predicted conditioned on
form into an area indicating the potential intent of an agent the spatial-temporal context using the trained CVAE module.
and the area size reflects the uncertainty of an agent’s intent. DCENet was evaluated on the Trajnet challenge benchmark
The qualitative results are shown in Fig. 3. The first row and achieved the new state-of-the-art performance on the
showcases the scenarios in the Trajnet dataset. Note that leader-board. Its superior performance on the inD bench-
the qualitative analysis on Trajnet was carried out on the mark further validated its efficacy and generalization ability.
validation set (an independent subset of the training set) The ablation studies justified the impact of each module
for comparing with the ground truth. Our model accurately in DCENet. In the future, we are interested in extending
predicted two pedestrians walking towards each other at the method for learning the impact from environment/static
bookstore-3. The shadow areas indicate multiple possible context, e.g., space layout and scene deployment, to further
trajectories. It also correctly predicted the static pedestrians enhance the performance of trajectory prediction.
12800
R EFERENCES [27] F. Giuliari, I. Hasan, M. Cristani, and F. Galasso, “Transformer
networks for trajectory forecasting,” in ICPR, 2020.
[1] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi- [28] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chan-
supervised learning with deep generative models,” in NeuIPS, 2014, draker, “Desire: Distant future prediction dynamic scenes with inter-
pp. 3581–3589. acting agents,” in CVPR, 2017, pp. 336–345.
[2] K. Sohn, H. Lee, and X. Yan, “Learning structured output representa- [29] J. Amirian, J.-B. Hayet, and J. Pettré, “Social ways: Learning multi-
tion using deep conditional generative models,” in NeuIPS, 2015, pp. modal distributions of pedestrian trajectories with gans,” in 2019
3483–3491. IEEE/CVF Conference on Computer Vision and Pattern Recognition
[3] A. Sadeghian, V. Kosaraju, A. Gupta, S. Savarese, and A. Alahi, Workshops (CVPRW). IEEE, 2019, pp. 2964–2972.
“Trajnet: Towards a benchmark for human trajectory prediction,” arXiv [30] O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming limitations
preprint, 2018. of mixture density networks: A sampling and fitting framework for
[4] J. Bock, R. Krajewski, T. Moers, S. Runde, L. Vater, and L. Eckstein, multimodal future prediction,” in CVPR, 2019, pp. 7144–7153.
“The ind dataset: A drone dataset of naturalistic road user trajectories [31] A. Poibrenski, M. Klusch, I. Vozniak, and C. Müller, “M2p3: mul-
at german intersections,” arXiv preprint arXiv:1911.07602, 2019. timodal multi-pedestrian path prediction by self-driving cars with
[5] A. C. Harvey, Forecasting, structural time series models and the egocentric vision,” in Annual ACM Symposium on Applied Computing,
Kalman filter. Cambridge university press, 1990. 2020, pp. 190–197.
[6] M. K. C. Tay and C. Laugier, “Modelling smooth paths using gaussian [32] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K.
processes,” in Field and Service Robotics, 2008, pp. 381–390. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions
[7] D. Makris and T. Ellis, “Spatial and probabilistic modelling of for autonomous driving using deep convolutional networks,” in ICRA,
pedestrian behaviour.” in BMVC, 2002, pp. 1–10. 2019, pp. 2090–2096.
[8] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity [33] K. D. Katyal, G. D. Hager, and C.-M. Huang, “Intent-aware pedestrian
forecasting,” in ECCV, 2012, pp. 201–214. prediction for adaptive crowd navigation,” in ICRA, 2020, pp. 3277–
[9] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 3283.
521, no. 7553, p. 436, 2015. [34] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[10] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
S. Savarese, “Social lstm: Human trajectory prediction crowded in NeuIPS, 2014, pp. 2672–2680.
spaces,” in CVPR, 2016, pp. 961–971. [35] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Tra-
[11] A. Gupta, L. Johnson, Justand Fei-Fei, S. Savarese, and A. Alahi, jectron++: Dynamically-feasible trajectory forecasting with heteroge-
“Social gan: Socially acceptable trajectories with generative adversar- neous data,” in ECCV, vol. 12363. Springer, 2020, pp. 683–700.
ial networks,” in CVPR, 2018, pp. 2255–2264. [36] H. Cheng, W. Liao, M. Y. Yang, M. Sester, and B. Rosenhahn,
[12] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, and S. Savarese, “Mcenet: Multi-context encoder network for homogeneous agent tra-
“Sophie: An attentive gan for predicting paths compliant to social and jectory prediction mixed traffic,” in ITSC, 2020.
physical constraints,” in CVPR, 2019, pp. 1349–1358. [37] Y. Yuan and K. M. Kitani, “Diverse trajectory forecasting with
[13] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng, “Sr-lstm: State determinantal point processes,” in ICLR, 2020.
refinement for lstm towards pedestrian trajectory prediction,” in CVPR, [38] A. Guzman-Rivera, D. Batra, and P. Kohli, “Multiple choice learning:
2019, pp. 12 085–12 094. Learning to produce multiple structured outputs,” in NeuIPS, 2012,
[14] N. Mohajerin and M. Rohani, “Multi-step prediction of occupancy grid pp. 1799–1807.
maps with recurrent neural networks,” in CVPR, 2019, pp. 10 600– [39] H. Cheng, W. Liao, M. Y. Yang, B. Rosenhahn, and M. Sester,
10 608. “Amenet: Attentive maps encoder network for trajectory prediction,”
[15] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” in ISPRS Journal of Photogrammetry and Remote Sensing, vol. 172, pp.
NeuIPS, 2019, pp. 15 398–15 408. 253–266, 2021.
[16] R. Chandra, U. Bhattacharya, A. Bera, and D. Manocha, “Traphic: [40] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
Trajectory prediction dense and heterogeneous traffic using weighted for action recognition in videos,” in NeuIPSF, 2014, pp. 568–576.
interactions,” in CVPR, 2019, pp. 8483–8492. [41] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict
[17] P. Kothari, S. Kreiss, and A. Alahi, “Human trajectory fore- intention from raw sensor data,” in Conference on Robot Learning.
casting crowds: A deep learning perspective,” arXiv preprint PMLR, 2018, pp. 947–956.
arXiv:2007.03639, 2020. [42] C. L. Gérin-Lajoie, Martand Richards and B. J. McFadyen, “The
[18] R. Chandra, T. Guan, S. Panuganti, T. Mittal, U. Bhattacharya, negotiation of stationary and moving obstructions during walking: an-
A. Bera, and D. Manocha, “Forecasting trajectory and behavior of ticipatory locomotor adaptations and preservation of personal space,”
road-agents using spectral clustering in graph-lstms,” IEEE Robotics Motor control, vol. 9, no. 3, pp. 242–269, 2005.
and Automation Letters, vol. 5, no. 3, pp. 4882–4890, 2020. [43] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
ICLR, 2014.
[19] S. Becker, R. Hug, W. Hübner, and M. Arens, “An evaluation of
[44] J. L. W. V. Jensen et al., “Sur les fonctions convexes et les inégalités
trajectory prediction approaches and notes on the trajnet benchmark,”
entre les valeurs moyennes,” Acta mathematica, vol. 30, pp. 175–193,
arXiv preprint arXiv:1805.07663, 2018.
1906.
[20] S. Becker, R. Hug, W. Hubner, and M. Arens, “Red: A simple but
[45] A. Graves, “Generating sequences with recurrent neural networks,”
effective baseline predictor for the trajnet benchmark,” in ECCV, 2018,
arXiv preprint arXiv:1308.0850, 2013.
pp. 138–153.
[46] D. Helbing and P. Molnar, “Social force model for pedestrian dynam-
[21] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
ics,” Physical review E, vol. 51, no. 5, p. 4282, 1995.
jointly learning to align and translate,” in ICLR, 2015.
[47] I. Hasan, F. Setti, T. Tsesmelis, A. Del Bue, F. Galasso, and
[22] A. Al-Molegi, M. Jabreel, and A. Martinez-Balleste, “Move, attend
M. Cristani, “Mx-lstm: mixing tracklets and vislets to jointly forecast
and predict: An attention-based neural model for people’s movement
trajectories and head poses,” in CVPR, 2018, pp. 6067–6076.
prediction,” Pattern Recognition Letters, vol. 112, pp. 34–40, 2018.
[48] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never
[23] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling walk alone: Modeling social behavior for multi-target tracking,” in
attention human crowds,” in ICRA, 2018, pp. 1–7. ICCV, 2009, pp. 261–268.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [49] A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,”
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in in Computer Graphics Forum, vol. 26, no. 3. Wiley Online Library,
NeuIPS, 2017, pp. 5998–6008. 2007, pp. 655–664.
[25] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training [50] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning
of deep bidirectional transformers for language understanding,” in social etiquette: Human trajectory understanding crowded scenes,” in
Conference of the North American Chapter of the Association for ECCV, 2016, pp. 549–565.
Computational Linguistics: Human Language Technologies, 2019, pp. [51] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,”
4171–4186. in International workshop on performance evaluation of tracking and
[26] S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and surveillance, 2009, pp. 1–6.
N. Pugeault, “Image captioning through image transformer,” in ACCV,
2020.
12801