8 Fend
8 Fend
Abstract
1400
In this paper, we pick up the general idea of using con- more diverse trajectories, which is different from our goal
trastive learning to enhance the model ability on long-tailed to distinguish tail patterns from head patterns and optimize
data. A new framework is developed called FEND: Future the feature embedding space.
ENhanced Distribution-aware contrastive trajectory predic- Trajectory prediction approaches based on con-
tion, which is a pattern-based contrastive feature learning trastive learning. Contrastive learning [34] is a self-
framework enhanced by future trajectory information. An supervised method to improve the representation ability of
offline trajectory clustering process and prototypical con- the network given the similarities between sample pairs, and
trastive learning are introduced for recognizing and sepa- has many variants [4, 8, 19, 20] with different ways of se-
rating different trajectory patterns to boost the tail samples lecting positive and negative samples and calculating con-
modeling. To deal with the afore mentioned problem, the trastive loss. Prototypical Contrastive Learning (PCL) [23]
features of trajectories within the same pattern cluster are is a variant of contrastive learning that can preserve local
pulled together, while the features from different pattern smoothness therefore induce semantically hierarchical clus-
clusters will be pushed apart. Moreover, a more flexible tered feature space [23]. Contrastive learning has also been
network structure of the decoder is introduced to exploit the incorporated into trajectory prediction. DisDis [7] uses con-
shaped feature embedding space with different pattern clus- trastive learning in a CVAE framework to discriminate the
ters. Our contribution can be summarized as follows: latent variable distributions and make the predictions more
• We propose a future enhanced contrastive feature diverse. ABC+ [12] uses action labels from their datasets
learning framework for long-tailed trajectory predic- and contrasts according to them. Social-NCE [26] uses con-
tion, which can better distinguish tail patterns from trastive learning to make the predictions away from their
head patterns, and the different patterns are repre- simulated collision cases. None of those above-mentioned
sented by different cluster prototypes to enhance the methods have discussed long-tail prediction. The most rel-
modeling of the tailed data. evant work is from Makansi et al. [28], which also tries
• We propose a distribution-aware hyper predictor, aim- to solve the long-tail prediction problem with contrastive
ing at providing separated decoder parameters for tra- learning and uses Kalman prediction errors to select posi-
jectory inputs with different patterns. tive and negative samples. Makansi et al. [28] push all the
• Experimental results show that our proposed frame- tailed samples together in their method. In this work, we
work can outperform start-of-the-art methods. not only separate the tails from the heads as the study [28]
did, but also recognize the patterns of the tailed samples due
2. Related Work to the fact that the tailed samples can be tailed in different
ways, e.g. turning or accelerating, as shown in Fig. 1 and
2.1. Trajectory Prediction
Fig. 3, which further improves the model capabilities.
Deep learning has become a mainstream trajectory pre-
diction method because of its powerful representational 2.2. Long-tailed Learning
ability. Some studies [1, 32, 43, 46, 48] focus on bet-
Long-tailed learning aims to improve the performance
ter modeling subtle relationship such as social interactions
on tailed samples when faced with unbalanced data. Most
to make their prediction more precise, and some works
of them focus on classification tasks. Typical methods do
[29, 33, 35, 36, 50] aim to produce more diverse trajec-
data resampling [6, 13, 41] or loss reweighting [9, 15, 25]
tory proposals. Strong baselines [30, 38, 40, 47] have been
to improve the capability of the network on tailed samples.
brought up. Although the trajectory prediction methods be-
Recent advances [3, 31] seek for a theoretical balance of
come increasingly accurate, the long-tail issue in the task of
head-tail performance by means of adjusting the classifica-
trajectory prediction has been rarely discussed.
tion boundaries, whereas these methods cannot be directly
Trajectory prediction approaches based on cluster-
used in regression tasks. Very recently Yang et al. [45] have
ing. Existing methods [5, 42, 44] have used trajectory clus-
investigated imbalanced regression tasks and propose a fea-
tering for trajectory prediction. MultiPath [5] performs
ture distribution smoothing and label distribution smooth-
Kmeans with the square distances between the trajectories
ing method. But the methodology in [45] needs labels with
to get anchor trajectory sets. PCCS-Net [42] decouples mul-
structured relationships, which is incongruent with the tra-
timodal trajectory prediction into three steps: feature clus-
jectory data. In our methods, we find out structured relation-
tering, cluster selecting, and synthesizing. Memo-Net [44]
ships between trajectories by forming pattern clusters, and
clusters trajectories in the original coordinates and uses an
optimize feature space according to item. Besides, we use
attention network for better cluster selecting. All existing
Hypernetwork [11] as the trajectory decoder to deal with
methods that use trajectory clustering are aiming at select-
tail samples utilizing its distribution-aware modeling abil-
ing future modalities for trajectory decoders and producing
ity, which has not been discussed in long-tail regression to
Codes available at https://github.com/ynw2021/FEND. our best knowledge.
1401
Figure 2. Illustration of our overall future enhanced distribution-aware contrastive learning framework. Top: Offline Kmeans clustering for
pseudo cluster labels. Bottom: The baseline prediction network with FEND plugged in for prediction. The FEND module contains a PCL
optimization procedure and a hyper decoder.
1402
3.1.2 Prototypical Contrastive Learning is the number of Kmeans clustering hierarchies, cm s means
the prototype of the cluster to which i belongs, and cm j
In our methods, we have already got the cluster labels after
means the prototype of an arbitrary cluster j. The proto-
the trajectory clustering step. Therefore we use the cluster
type is calculated by taking an average of all the features
assignments as pseudo labels for computing prototypes and
in a cluster. Nm denotes the number of clusters for hier-
densities. The original PCL [23] is an self-supervised meth-
archy m. ϕm j denotes the density of a cluster j, which is
ods with EM steps, therefore it needs to perform clustering
calculated as below:
before every training epoch. Our methods use the pseudo
labels to reduce the clustering steps therefore require less PZ
∥v ′ − c∥2
computation source compared to the original PCL. Given ϕ = z=1 z , (4)
Z log(Z + α)
pseudo cluster labels, PCL can pull the features of instances
belonging to the same cluster together and push the features where Z is the number of instances in the cluster, and α
of instances in different clusters apart, as what vanilla con- is a smoothing factor to ensure that small clusters do not
trastive learning does to the positive and negative samples. have an overly large ϕ. We set α = 10 same as [23]. vz′
Implementing PCL loss. We do PCL on the features at is the momentum updated feature for instance z to ensure
the bottleneck of the encoder-decoder trajectory prediction stability.
network: after the encoder. Similar to Makansi et al. [28],
we add a fully-connected (FC) layer after the encoder and 3.2. Distribution-Aware Hyper Predictor
add the PCL loss to its output features. The features before
the FC layer will be given to the trajectory decoder. We Distribution-aware hypernetwork. Intuitively, the
perform a multi-level clustering with M hierarchies when head clusters and the tail clusters should be assigned dif-
calculating PCL loss. The PCL loss is as follows: ferent decoders to impair their influence on each other. But
there is an insufficient amount of data for the tail samples,
LP rotoN CE = Lins + Lproto , (1) and separately training decoders for them will cause badly
overfitting. Therefore, we want to transfer common knowl-
where the first term is an instance-wise contrastive term and edge across the whole dataset, while keep the modeling
the second-term is an instance-prototype contrastive term. flexibility of separate decoders. HyperNetworks [11] is an
Instance-wise term. The first term in Eq. (1) is an approach of using a small network, which is known as a hy-
instance-wise contrastive term considering the pseudo clus- pernetwork, to generate the weights of the main network,
ter labels, which can be written as follows: and it naturally suits our demands. The hypernetwork con-
Npoi
tains the knowledge of all samples, which prevents overfit-
r
X 1 X exp vi · vi+ /τ ting. Also, there are separate decoder parameters for head
Lins = − log Pr . (2)
Npoi j=1 exp (vi · vj /τ )
and tail clusters, which make the decoder aware of the dis-
i=1 i+ =1
tribution of the clustered feature space. So the hyper de-
The instance-wise term can help the instances gather to- coder can predict the tailed clusters differently.
gether faster and the algorithm converge faster. vi and vi+ LSTM trajectory decoder. As an example of a hyper
are feature embeddings of trajectory instance i and positive predictor, we employ an LSTM as the trajectory decoder,
sample i+ after the encoder respectively, i+ ̸= i. Npoi is which is commonly used in recent studies [28, 38, 49].The
the number of positive samples to i in a batch. τ is the con- original formulation of an LSTM is as follows:
trastive temperature of the instance-wise contrastive term.
In Eq. (2), the positive samples i+ are the instances from it = Whi ht−1 + Wxi xt + bi ,
the same cluster with the instance i, and the rest instances gt = Whg ht−1 + Wxg xt + bg ,
in the batch, i.e. belonging to other clusters, are regarded as
ft = Whf ht−1 + Wxf xt + bf ,
negative samples. j means an arbitrary sample in the cur- (5)
rent batch data. r denotes the batch size. ot = Who ht−1 + Wxo xt + bo ,
Instance-prototype term. The second term in Eq. (1) is mt = σ (ft ) ⊙ mt−1 + σ (it ) ⊙ ψ (gt ) ,
an instance-prototype contrastive term, which can be writ- ht = σ (ot ) ⊙ ψ (mt ) ,
ten as follows:
r M where i, g, f, o are the input gate, update gate, forget gate,
1 XX exp (vi · cm m
s /ϕs ) and output gate respectively. Wh ∈ RNh ×Nh , Wx ∈
Lproto = − log PNm .
M i=1 m=1 m
j=1 exp vi · cj /ϕj
m
RNh ×Nx , b ∈ RNh , Nh and Nx are the dimensions of input
(3) and hidden states. ht , mt are the hidden state and the cell
The prototypes help preserving local smoothness and the state. σ is the sigmoid operator, and ψ is the tanh operator.
formation of clusters with different patterns. In Eq. (3), M The initial x and h are produced by the feature embedding
1403
v of the observed trajectory: 4. Experiments
x1 = Wxv v + bvx , 4.1. Datasets
(6)
h0 = Whv v + bvh , We evaluate our proposed method on several widely used
public pedestrain datasets including ETH-UCY, Nuscenes
where Whv ∈ RNh ×Nv , Wxv ∈ RNx ×Nv , bvh ∈ RNh , bvx ∈ and SDD. ETH-UCY is a pedestrian dataset with rich so-
RN x . cial interactions. Nuscenes is a large scale trajectory dataset
HyperLSTM. In our implement, the formulation of an with both vehicles and pedestrians. In this work, we mainly
LSTM with a small hypernetwork is as follows: evaluate the performances of our model on the vehicle type,
same as [28]. SDD is another large scale bird view trajec-
yt = LN (dyh ⊙ Why ht−1 + dyx ⊙ Wxy xt + by (zby )) , tory dataset. We use ETH-UCY and Nuscenes in the way
(7) same as our backbone Traj++ EWTA [28] and SDD in the
where way same as our backbone Y-Net [30].
dyh (zh ) = Whz
y
zh ,
dyx (zx ) = Wxz
y
zx , (8) 4.2. Evaluation Metrics
b y
(zby ) = y y
Wbz zb + by0 .
Performance metrics. We use the common met-
In Eq. (7), y means one of {i, g, f, o} four gates in the rics for evaluating multimodal trajectory prediction per-
original LSTM formulation Eq. (5) for brevity. ⊙ denotes formance: Average-Displacement-Error (ADE) and Final-
the element-wise product operation, LN () denotes the layer Displacement-Error (FDE), which is commonly used in
normalization, ds and b are the weights and bias adjusting studies [1, 5, 48]. ADE means the averaged L2 distance be-
vectors from the hypernetwork to change the weights and tween future prediction and ground truth trajectory, while
bias in the original LSTM. ds and b are generated by the FDE means the L2 distance between the final predicted
output zs of the hypernetwork as in Eq. (8), where W s and destination and the ground truth destination. For evaluat-
by0 are the weights and bias of the linear fully-connected ing multi-modality, we calculate mininum ADE and FDE
layers. z can be written as follows for instance i with input among all the output guesses, which are denoted as mi-
feature vi : nADE and minFDE and are averaged across the dataset.
Tailed test sample selecting. In order to demonstrate
zi = fH (vi ) , (9) our model on the long-tailed data, we need to separate the
hard samples as well as the easy ones for evaluation. Specif-
where the fH means the hypernetwork mapping function, ically, we use the testing FDEs of the baseline method as
which should be a shallow network to reduce computation the threshold to divide the datasets into seven kinds of sam-
and prevent overfitting. ples: the top 1%-5% challenging samples with the largest
errors, the rest easier samples, as well as all samples in the
3.3. Loss Reweighting
datasets. In [28], the Kalman predictor prediction error is
Our final network loss is as follows: utilized for dataset division. Compared with the FDEs of
a simple Kalman predictor, performances of an advanced
L = Lpred + λLP rotoN CE , (10) baseline predictor can better reflect the degrees of difficulty
for the samples to be modeled by the data-driven network,
where Lpred is the loss of the baseline prediction network, which can better reveal the ability of the long-tail prediction
λ is a coefficient on the PCL loss term. For easy samples methods to deal with the hard tailed samples. The Kalman
that the network has already fitted perfectly, the PCL loss divisions are discussed in supplementaries.
would hardly bring more benefit in network optimization.
Thus, we make λ vary across samples, which performs as 4.3. Baseline
a gate to shut off the PCL loss on easy samples. We use
the prediction loss Lpred of the network after a warm-up We use Trajectron++ EWTA (Traj++ EWTA) [28] as a
training stage to indicate the hardness of the samples, which baseline for our framework on ETH-UCY and Nuscenes,
is denoted as L′pred . λ is determined according to L′pred : which has achieved state-of-the-art results according to
[28]. Traj++ EWTA augments the Trajectron++ [38]
λ=a L′pred > θ, by removing the conditional variational autoencoder parts
(11) and using a multi-head decoder trained with the evolving
λ=0 L′pred < θ,
winner-take-all (EWTA) strategy. Another strong baseline
where a is a constant hyperparameter, and θ is the threshold we experiment on is Y-Net [30], which uses a U-Net back-
to filter out head samples. bone and achieves state-of-the-art results on SDD.
1404
Top 1% Top 2% Top 3% Top 4% Top 5% Rest All
Traj++ EWTA [28] 0.98/2.54 0.79/2.07 0.71/1.81 0.65/1.63 0.60/1.50 0.14/0.26 0.17/0.32
Traj++ EWTA+resample [41] 0.90/2.17 0.77/1.90 0.73/1.78 0.66/1.60 0.64/1.52 0.20/0.41 0.23/0.47
Traj++ EWTA+reweighting [9] 0.97/2.47 0.78/2.03 0.68/1.73 0.62/1.55 0.56/1.40 0.14/0.26 0.16/0.32
Traj++ EWTA+LDAM [3] 0.92/2.35 0.76/1.96 0.68/1.71 0.62/1.53 0.57/1.37 0.15/0.27 0.17/0.32
Traj++ EWTA+contrastive [28] 0.92/2.33 0.74/1.91 0.67/1.71 0.60/1.48 0.55/1.32 0.15/0.27 0.17/0.32
Traj++ EWTA+FEND (ours) 0.84/2.13 0.68/1.68 0.61/1.46 0.56/1.30 0.52/1.19 0.15/0.27 0.17/0.32
Table 1. Prediction errors in the format of (minADE/minFDE) in meters on seven kinds of testing samples on the ETH-UCY dataset.
Table 2. Prediction errors in the format of (minADE/minFDE) in meters on seven kinds of testing samples on Nuscenes dataset.
4.4. Implement Details samples. Specifically, our framework outperforms the sec-
ond best method: Traj++ EWTA+contrastive [28] by 9.5%
We follow the train schedule of Traj++ EWTA, to train
on ADE and 8.5% on FDE on the top 1% hardest samples,
the network with a batch size of 256 for 100 epochs for
and maintains the average ADE and FDE nearly stable. The
ETH-UCY and 5 epochs for Nuscenes in each EWTA stage.
Traj++ EWTA+reweighting [9] performs best on the aver-
The learning rate is initially set as 0.01 and exponentially
age ADE/FDE, but its performance gains on tailed samples
decays with the rate of 0.001. We use a warm-up of 300
are relatively little. The Traj++ EWTA+resampling [41]
epochs in our final model for ETH-UCY. We set a = 50
gets more gains on the most tailed samples, but its aver-
as an initial loss factor same as [28], and a will decade to
age ADE/FDE become much worse. Unlike simply doing
0.2 after 100 epochs to not to harm the prediction training
resampling or loss reweighting, hypernetwork can decou-
process, according to the drop on the EWTA loss. The head
ple head samples and tail samples in the parameter space of
sample filter threshold θ is set to 0.2. For the feature extrac-
decoder, therefore achieves better performances.
tor, we use a 1D CNN with 16 output channel and a kernel
size of 3, attached with an LSTM with a hidden size of 128. Quantitative comparisons on Traj++ EWTA on
For Kmeans clustering, we use {20, 50, 100} as the clus- Nuscenes. Comparison results with the previous best
ter numbers for getting hierarchical clusters. And we use long-tail prediction method [28] on Nuscenes are in Ta-
a fully-connected multilayer perception with a hidden size ble 2. We find out that the resampling operation in the
of 128 as the hypernetwork. To train Y-Net, we follow [22] original Traj++ EWTA does not work well with FEND,
to make the encoded feature with shape (C, H, W ) average probably because of causing overfit on hypernetwork. De-
pooled in the spatial dimension to get a C dimensional vec- spite of this, as shown in Table 2, the baseline without
tor, and perform PCL on it. We set a = 1 and no warmup. resampling can achieve both superior long-tail and over-
all performances with FEND. The performances of Traj++
4.5. Comparisons with others EWTA and Traj++ EWTA+contrastive on both ETH-UCY
and Nuscenes are tested on the provided pre-trained models
Quantitative comparisons on Traj++ EWTA on ETH- of [28].
UCY. To show the effectiveness of our methods, we select
the state-of-the-art method for long-tail trajectory predic- Quantitative comparisons on Y-Net on SDD. We also
tion [28], classical data resampling [41] and loss reweight- plug our module into Y-Net, the results are shown in Table
ing [9], and a head-tail performance balancing method 3. We reproduced the results of Y-Net using the official re-
[3] for comparison. For long-tailed classification meth- leased code of [30] with 42 as the random seed, since the
ods [3, 9, 41], we construct a classification head after the original method does not have a fix seed. Results show that
encoder of Traj++ EWTA to use it to classify the trajectories our method can achieve performance gains on both tail sam-
according to the discretization of Kalman filter errors, same ples and the whole dataset.
as Makansi et al. [28], and the classification loss is trained Qualitative comparison. Figure 3 shows some long-
along with the prediction loss. Table 1 summarizes our ex- tailed hard-case studies of our method on ETH-UCY. Those
perimental results on ETH-UCY using a best-of-20 evalu- cases contain some rare social interactions, and all the fu-
ation [10]. We can see that our method stably outperforms ture trajectories in them are non-trivial to be predicted. In
all comparing methods on all the top 1% − 5% long-tail all those samples, our method (blue) outperforms the origi-
1405
Top 1% Top 2% Top 3% Top 4% Top 5% Rest All
Y-Net* [30] 65.82/134.01 51.84/104.37 43.74/88.21 38.68/76.08 34.72/67.46 6.54/8.96 7.93/11.88
Y-Net*+FEND 57.58/108.51 46.33/86.93 39.22/75.02 35.05/66.24 31.27/57.98 6.64/9.24 7.87/11.68
Table 3. Prediction errors in the format of (minADE/minFDE) on seven kinds of testing samples on SDD dataset. * means the results are
reproduced using the official released code of [30].
Components Performance(ADE/FDE)
PCL F H Top 1% Top 2% Top 3% Top 4% Top 5% Rest All
0.98/2.54 0.79/2.07 0.71/1.81 0.65/1.63 0.60/1.50 0.14/0.26 0.17/0.32
✓ 0.96/2.41 0.79/2.03 0.70/1.77 0.62/1.56 0.57/1.41 0.15/0.27 0.17/0.32
✓ ✓ 0.89/2.23 0.72/1.84 0.66/1.61 0.60/1.44 0.55/1.30 0.15/0.27 0.17/0.32
✓ ✓ 0.90/2.28 0.72/1.87 0.65/1.61 0.58/1.43 0.54/1.30 0.15/0.27 0.17/0.32
✓ ✓ ✓ 0.84/2.13 0.68/1.68 0.61/1.46 0.56/1.30 0.52/1.19 0.15/0.27 0.17/0.32
Table 4. Ablation study of different modules in FEND. F means future enhanced clusters, H means the hypernetwork.
1406
(a) (b) (c)
Figure 5. TSNE results of (a)Traj++EWTA (b)Traj++ EWTA+contrastive (c)Traj++ EWTA+FEND on the univ scene. The red stars, the
green stars, and the yellow stars represent clusters of three kinds of hard tailed patterns, while the magenta and cyan dots represent clusters
of two kinds of easy head patterns. We can see from the figures that our method forms a more separately clustered feature space.
0.8
[28]. The CDF is averaged across the five scenes.
0.08
F=0.073, c=0.080, Limitations. The performances on the head samples are
0.6
slightly dropped, which can been seen in Figure 6 and Ta-
Tail part ble 1 2 3. We leave it as future works. In most experiments
0.4 we use the minADE/FDE as the prediction evaluation pro-
tocols. There are many better metrics such as the Nega-
0.2 tive Log-Likelihood (NLL) [2, 16, 17] or those which take
scene-compliance or socially acceptable prediction into ac-
0.0
1.0 2.0 3.0 4.0 5.0 6.0 count [18, 21]. The results of another evaluation protocol:
0.0 1.0 2.0 3.0 4.0 5.0
FDE NLL are in supplementaries.
x(final displacement errors in meters)
Discussion about single agent clustering. We use sin-
Figure 6. CDF curve and CDF bars of testing FDEs on ETH-UCY. gle agent full trajectory features for clustering, similar to
It can be seen that our method have a shorter tail region. other works using single trajectories to cluster or retrieve
[42, 51]. In our experiment we find out that the information
in single agent trajectories can already lead to good perfor-
Parameter sensitivity study. Table 5 shows the param- mances. We believe that it is a promising future direction to
eter sensitivity study of PCL loss weight a. We can see that include social features into the clustering process.
setting a = 50 initially will be the best choice. Other pa-
rameter sensitivity studies are provided in supplementaries. 5. Conclusion
Shaped feature embedding space. Figure 5 shows the
TSNE results of the feature space of our method and two In this paper, we propose a future enhanced contrastive
comparing methods, with two head patterns and three tailed feature space shaping method and a distribution-aware hy-
patterns. We can see from the figure that our future en- per decoder for long-tailed trajectory prediction. Quantitive
hanced PCL method can decently separate the tail patterns and qualitative experiment results show that our method can
and the head patterns, while there is still some overlap be- outperform state-of-the-art long-tail prediction methods on
tween the heads and the tails in the feature space of Traj++ the challenging tailed samples, while maintaining the aver-
EWTA and Traj++ EWTA+contrastive. Also, we can see aged performance on the whole datasets. Our method can
from Fig. 5 that our method can form different clusters for be generally plugged into many strong prediction networks.
different tailed patterns, while in the feature space of Traj++
EWTA+contrastive, all the samples of the three tail patterns Acknowledgement
are pushed together, as in Sec. 2 discussed.
FDE distribution bars. To illustrate the distribution of This work was supported by NSFC Projects (No.
the prediction errors across the dataset more clearly, We plot 62036008) and STI 2030—Major Projects (No.
the cumulative distribution function (CDF) curve of FDEs, 2021ZD0201300).
1407
References [14] John A Hartigan and Manchek A Wong. Algorithm as 136 a
k-means clustering algorithm. Journal of the royal statistical
[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, society. series c (applied statistics), 28(1):100–108, 1979. 3
Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. So- [15] Haibo He and Edwardo A Garcia. Learning from imbalanced
cial lstm: Human trajectory prediction in crowded spaces. In data. IEEE Transactions on knowledge and data engineer-
Proceedings of the IEEE conference on computer vision and ing, 21(9):1263–1284, 2009. 2
pattern recognition, pages 961–971, 2016. 1, 2, 5
[16] Ronny Hug, Wolfgang Hübner, and Michael Arens. Intro-
[2] Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Ac- ducing probabilistic bézier curves for n-step sequence pre-
curate and diverse sampling of sequences based on a “best of diction. In Proceedings of the AAAI Conference on Artificial
many” sample objective. In Proceedings of the IEEE Con- Intelligence, volume 34, pages 10162–10169, 2020. 8
ference on Computer Vision and Pattern Recognition, pages [17] Boris Ivanovic and Marco Pavone. The trajectron: Proba-
8485–8493, 2018. 8 bilistic multi-agent trajectory modeling with dynamic spa-
[3] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, tiotemporal graphs. In Proceedings of the IEEE/CVF Inter-
and Tengyu Ma. Learning imbalanced datasets with label- national Conference on Computer Vision, pages 2375–2384,
distribution-aware margin loss. Advances in neural informa- 2019. 8
tion processing systems, 32, 2019. 2, 6 [18] Boris Ivanovic and Marco Pavone. Injecting planning-
[4] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- awareness into prediction and detection evaluation. In 2022
otr Bojanowski, and Armand Joulin. Unsupervised learning IEEE Intelligent Vehicles Symposium (IV), pages 821–828.
of visual features by contrasting cluster assignments. Ad- IEEE, 2022. 8
vances in Neural Information Processing Systems, 33:9912– [19] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion,
9924, 2020. 2 Philippe Weinzaepfel, and Diane Larlus. Hard negative mix-
[5] Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir ing for contrastive learning. Advances in Neural Information
Anguelov. Multipath: Multiple probabilistic anchor tra- Processing Systems, 33:21798–21809, 2020. 2
jectory hypotheses for behavior prediction. arXiv preprint [20] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
arXiv:1910.05449, 2019. 2, 5 Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
[6] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and Dilip Krishnan. Supervised contrastive learning. Advances
W Philip Kegelmeyer. Smote: synthetic minority over- in Neural Information Processing Systems, 33:18661–18673,
sampling technique. Journal of artificial intelligence re- 2020. 2
search, 16:321–357, 2002. 2 [21] Parth Kothari, Sven Kreiss, and Alexandre Alahi. Human
[7] Guangyi Chen, Junlong Li, Nuoxing Zhou, Liangliang Ren, trajectory forecasting in crowds: A deep learning perspec-
and Jiwen Lu. Personalized trajectory prediction via distri- tive. IEEE Transactions on Intelligent Transportation Sys-
bution discrimination. In Proceedings of the IEEE/CVF In- tems, 23(7):7386–7400, 2021. 8
ternational Conference on Computer Vision, pages 15580– [22] Mihee Lee, Samuel S Sohn, Seonghyeon Moon, Sejong
15589, 2021. 2 Yoon, Mubbasir Kapadia, and Vladimir Pavlovic. Muse-
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- vae: multi-scale vae for environment-aware long term tra-
offrey Hinton. A simple framework for contrastive learning jectory prediction. In Proceedings of the IEEE/CVF Con-
of visual representations. In International conference on ma- ference on Computer Vision and Pattern Recognition, pages
chine learning, pages 1597–1607. PMLR, 2020. 2 2221–2230, 2022. 6
[9] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge [23] Junnan Li, Pan Zhou, Caiming Xiong, and Steven CH Hoi.
Belongie. Class-balanced loss based on effective number of Prototypical contrastive learning of unsupervised representa-
samples. In Proceedings of the IEEE/CVF conference on tions. arXiv preprint arXiv:2005.04966, 2020. 2, 3, 4
computer vision and pattern recognition, pages 9268–9277, [24] Tianhong Li, Peng Cao, Yuan Yuan, Lijie Fan, Yuzhe Yang,
2019. 2, 6 Rogerio S Feris, Piotr Indyk, and Dina Katabi. Targeted su-
[10] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, pervised contrastive learning for long-tailed recognition. In
and Alexandre Alahi. Social gan: Socially acceptable tra- Proceedings of the IEEE/CVF Conference on Computer Vi-
jectories with generative adversarial networks. In Proceed- sion and Pattern Recognition, pages 6918–6928, 2022. 1
ings of the IEEE conference on computer vision and pattern [25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
recognition, pages 2255–2264, 2018. 1, 6 Piotr Dollár. Focal loss for dense object detection. In Pro-
[11] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. ceedings of the IEEE international conference on computer
arXiv preprint arXiv:1609.09106, 2016. 2, 4 vision, pages 2980–2988, 2017. 2
[12] Marah Halawa, Olaf Hellwich, and Pia Bideau. Action-based [26] Yuejiang Liu, Qi Yan, and Alexandre Alahi. Social nce: Con-
contrastive learning for trajectory prediction. In European trastive learning of socially-aware motion representations. In
Conference on Computer Vision, pages 143–159. Springer, Proceedings of the IEEE/CVF International Conference on
2022. 2 Computer Vision, pages 15118–15129, 2021. 2
[13] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. [27] Yuanfu Luo, Panpan Cai, Aniket Bera, David Hsu, Wee Sun
Borderline-smote: a new over-sampling method in im- Lee, and Dinesh Manocha. Porca: Modeling and planning
balanced data sets learning. In International conference on for autonomous driving among many pedestrians. IEEE
intelligent computing, pages 878–887. Springer, 2005. 2 Robotics and Automation Letters, 3(4):3418–3425, 2018. 1
1408
[28] Osama Makansi, Özgün Çiçek, Yassine Marrakchi, and [40] Nasim Shafiee, Taskin Padir, and Ehsan Elhamifar. Introvert:
Thomas Brox. On exposing the challenging long tail in Human trajectory prediction via conditional 3d attention. In
future prediction of traffic actors. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vi-
IEEE/CVF International Conference on Computer Vision, sion and Pattern Recognition, pages 16815–16825, 2021. 2
pages 13147–13157, 2021. 1, 2, 4, 5, 6, 8 [41] Li Shen, Zhouchen Lin, and Qingming Huang. Relay back-
[29] Osama Makansi, Eddy Ilg, Ozgun Cicek, and Thomas Brox. propagation for effective learning of deep convolutional neu-
Overcoming limitations of mixture density networks: A sam- ral networks. In European conference on computer vision,
pling and fitting framework for multimodal future prediction. pages 467–482. Springer, 2016. 2, 6
In Proceedings of the IEEE/CVF Conference on Computer [42] Jianhua Sun, Yuxuan Li, Hao-Shu Fang, and Cewu Lu. Three
Vision and Pattern Recognition, pages 7144–7153, 2019. 2 steps to multimodal trajectory prediction: Modality clus-
[30] Karttikeya Mangalam, Yang An, Harshayu Girase, and Jiten- tering, classification and synthesis. In Proceedings of the
dra Malik. From goals, waypoints & paths to long term hu- IEEE/CVF International Conference on Computer Vision,
man trajectory forecasting. In Proceedings of the IEEE/CVF pages 13250–13259, 2021. 2, 8
International Conference on Computer Vision, pages 15233– [43] Chenxin Xu, Maosen Li, Zhenyang Ni, Ya Zhang, and Si-
15242, 2021. 2, 5, 6, 7 heng Chen. Groupnet: Multiscale hypergraph neural net-
[31] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh works for trajectory prediction with relational reasoning. In
Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Proceedings of the IEEE/CVF Conference on Computer Vi-
Long-tail learning via logit adjustment. arXiv preprint sion and Pattern Recognition, pages 6498–6507, 2022. 2
arXiv2007.07314, 2020. 2 [44] Chenxin Xu, Weibo Mao, Wenjun Zhang, and Siheng Chen.
[32] Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Remember intentions: Retrospective-memory-based trajec-
Christian Claudel. Social-stgcnn: A social spatio-temporal tory prediction. In Proceedings of the IEEE/CVF Conference
graph convolutional neural network for human trajectory on Computer Vision and Pattern Recognition, pages 6488–
prediction. In Proceedings of the IEEE/CVF Conference 6497, 2022. 2
on Computer Vision and Pattern Recognition, pages 14424– [45] Yuzhe Yang, Kaiwen Zha, Yingcong Chen, Hao Wang, and
14432, 2020. 2 Dina Katabi. Delving into deep imbalanced regression. In In-
[33] Sriram Narayanan, Ramin Moslemi, Francesco Pittaluga, ternational Conference on Machine Learning, pages 11842–
Buyu Liu, and Manmohan Chandraker. Divide-and-conquer 11851. PMLR, 2021. 2
for lane-aware diverse trajectory prediction. In Proceedings [46] Cunjun Yu, Xiao Ma, Jiawei Ren, Haiyu Zhao, and Shuai Yi.
of the IEEE/CVF Conference on Computer Vision and Pat- Spatio-temporal graph transformer networks for pedestrian
tern Recognition, pages 15799–15808, 2021. 2 trajectory prediction. In European Conference on Computer
[34] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- Vision, pages 507–523. Springer, 2020. 2
sentation learning with contrastive predictive coding. arXiv [47] Ye Yuan, Xinshuo Weng, Yanglan Ou, and Kris M Kitani.
preprint arXiv:1807.03748, 2018. 2 Agentformer: Agent-aware transformers for socio-temporal
[35] Bo Pang, Tianyang Zhao, Xu Xie, and Ying Nian Wu. Tra- multi-agent forecasting. In Proceedings of the IEEE/CVF
jectory prediction with latent belief energy-based model. In International Conference on Computer Vision, pages 9813–
Proceedings of the IEEE/CVF Conference on Computer Vi- 9823, 2021. 2
sion and Pattern Recognition, pages 11814–11824, 2021. 2 [48] Pu Zhang, Wanli Ouyang, Pengfei Zhang, Jianru Xue, and
[36] Tung Phan-Minh, Elena Corina Grigore, Freddy A Boulton, Nanning Zheng. Sr-lstm: State refinement for lstm to-
Oscar Beijbom, and Eric M Wolff. Covernet: Multimodal wards pedestrian trajectory prediction. In Proceedings of
behavior prediction using trajectory sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12085–12094, 2019. 1, 2, 5
Recognition, pages 14074–14083, 2020. 2 [49] Pu Zhang, Jianru Xue, Pengfei Zhang, Nanning Zheng, and
[37] Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Noriaki Wanli Ouyang. Social-aware pedestrian trajectory predic-
Hirose, Hamid Rezatofighi, and Silvio Savarese. Sophie: tion via states refinement lstm. IEEE transactions on pattern
An attentive gan for predicting paths compliant to social and analysis and machine intelligence, 2020. 1, 4
physical constraints. In Proceedings of the IEEE/CVF con- [50] Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Ben Sapp,
ference on computer vision and pattern recognition, pages Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai,
1349–1358, 2019. 1 Cordelia Schmid, et al. Tnt: Target-driven trajectory pre-
[38] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and diction. In Conference on Robot Learning, pages 895–904.
Marco Pavone. Trajectron++ dynamically-feasible trajectory PMLR, 2021. 2
forecasting with heterogeneous data. In European Confer- [51] He Zhao and Richard P Wildes. Where are you heading?
ence on Computer Vision, pages 683–700. Springer, 2020. 1, dynamic trajectory prediction with expert goal examples. In
2, 4, 5 Proceedings of the IEEE/CVF International Conference on
[39] Dvir Samuel and Gal Chechik. Distributional robustness loss Computer Vision, pages 7629–7638, 2021. 8
for long-tail learning. In Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pages 9495–9504,
2021. 1
1409