0% found this document useful (0 votes)
68 views7 pages

Ganet: Goal Area Network For Motion Forecasting

GANet is a framework for motion forecasting that predicts potential goal areas rather than precise goal coordinates. It uses goal areas as soft constraints to guide trajectory prediction in a more flexible way. The framework has three stages: 1) extracting semantic map features in goal areas, 2) predicting potential goal areas, and 3) estimating trajectories conditioned on the predicted goal areas. This overcomes limitations of prior goal-based methods by using goal areas that provide richer context than isolated points and allow for more tolerance. GANet achieved state-of-the-art performance on the Argoverse motion forecasting benchmark.

Uploaded by

Apoorva Ankad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views7 pages

Ganet: Goal Area Network For Motion Forecasting

GANet is a framework for motion forecasting that predicts potential goal areas rather than precise goal coordinates. It uses goal areas as soft constraints to guide trajectory prediction in a more flexible way. The framework has three stages: 1) extracting semantic map features in goal areas, 2) predicting potential goal areas, and 3) estimating trajectories conditioned on the predicted goal areas. This overcomes limitations of prior goal-based methods by using goal areas that provide richer context than isolated points and allow for more tolerance. GANet achieved state-of-the-art performance on the Argoverse motion forecasting benchmark.

Uploaded by

Apoorva Ankad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

GANet: Goal Area Network for Motion Forecasting

Mingkun Wang1 , Xinge Zhu2 , Changqian Yu3 , Wei Li4 , Yuexin Ma5 ,
Ruochun Jin6 ,Xiaoguang Ren7 , Dongchun Ren3 , Mingxu Wang8 and Wenjing Yang6∗

Abstract— Predicting the future motion of road participants networks (CNN) to process. Others [17], [1] use vectorized
is crucial for autonomous driving but is extremely challenging and graph-structured data to represent maps. For instance,
due to staggering motion uncertainty. Recently, most motion LaneGCN [4] applies a multi-stride graph neural network
forecasting methods resort to the goal-based strategy, i.e.,
predicting endpoints of motion trajectories as conditions to to encode maps. However, since the traveling mode of an
actor is highly diverse, the fixed size stride cannot effectively
arXiv:2209.09723v2 [cs.CV] 13 Feb 2023

regress the entire trajectories, so that the search space of


solution can be reduced. However, accurate goal coordinates model distant relevant map features and thus limits the
are hard to predict and evaluate. In addition, the point prediction performance (see Figure 3). While most works [15],
representation of the destination limits the utilization of a [22], [16], [4] focus on map encoding and motion history
rich road context, leading to inaccurate prediction results in
many cases. Goal area, i.e., the possible destination area, rather modeling, another family methods [2], [3], which is built
than goal coordinate, could provide a more soft constraint for on goal-based prediction, captures the actor’s intentions in
searching potential trajectories by involving more tolerance the future explicitly. Specifically, these methods follow a
and guidance. In view of this, we propose a new goal area- three-stage scheme: first, candidate goals are sampled from
based framework, named Goal Area Network (GANet), for the lane centerlines; second, a set of goals are selected by
motion forecasting, which models goal areas as preconditions for
trajectory prediction, performing more robustly and accurately. goal prediction; third, trajectories are estimated conditioning
Specifically, we propose a GoICrop (Goal Area of Interest) on selected goals. Although these methods have achieved
operator to effectively extract semantic lane features in goal competitive results, there remain two main drawbacks. (1)
areas and model actors’ future interactions, which benefits a These methods merely use a limited number of isolated goal
lot for future trajectory estimations. GANet ranks the 1st on the coordinates as conditions, which contain limited information
leaderboard of Argoverse Challenge among all public literature
(till the paper submission), and its source codes will be released. and hinder accurate motion forecasting. As goal coordinates of
different distances to the road edge carry different information,
I. I NTRODUCTION using a limited number of goal coordinates as conditions
As one of the most critical subtasks in autonomous driving, constrains the full utilization of a road context. (2) The
motion forecasting targets to understand and predict the competitive performance of these methods heavily depends
future behaviors of other road participants (called actors). on the well-designed goal space, which may be violated in
It is essential for the self-driving car to make safe and practice. Well-designed goal space is required for sampling,
reasonable decisions in the subsequent planning and control refining, and scoring candidate goals, due to the difficulty
module. The recent emergence of large-scale datasets with of predicting and evaluating accurate goal coordinates. For
high-definition maps (HD maps) and sensor data [10], [23], example, vehicles’ candidate goals are sampled from the lane
[24] has boosted the research in motion forecasting. These centerlines while pedestrians’ candidate goals are sampled
HD maps provide rich geometric and semantic information, from a virtual grid around themselves in Target-driveN
e.g., the map topology, that constrains the vehicle’s motion. Trajectory Prediction (TNT) [2]. However, these methods may
Meanwhile, actors also follow driving etiquette and interact fail once these hard-encoded candidate goals are violated in
with each other. Thus, how to effectively incorporate driving the real world.
context to predict multiple plausible and accurate trajectories Compared with accurate goal coordinates, a potential goal
becomes the core challenge for motion forecasting. area with a relatively richer context of the road is able to
Some works [22], [16] encode maps and motion tra- provide more tolerance and better guidance for accurate
jectories into 2D images and apply convolutional neural trajectory prediction through a soft constraint. Also, as driving
history of actors is critical for goal area estimation, we make
This work was supported by funding from the National Natural Science full use of this clue for accurate localization of goal areas.
Foundation of China (91948303-1) and the National Key R&D Program of
China (2021ZD0140301). For example, a fast-moving vehicle’s goal area may be far
*Corresponding author. away, while the goal area of a stationary vehicle should be
1 Peking University, [email protected]
2 The Chinese University of Hong Kong, [email protected]
limited around itself.
3 Meituan, [email protected], [email protected] Motivated by these observations, we propose Goal Area
4 Inceptio, [email protected] Network framework (GANet) that predicts potential goal areas
5 ShanghaiTech University, [email protected]
as conditions for motion forecasting. As shown in Figure
6 National University of Defense Technology, [email protected],
1, there are three stages in GANet, which are trained in an
[email protected]
7 Academy of Military Sciences, rxg [email protected] end-to-end way, and we construct a series of GANet models
8 Fudan University, wang [email protected] following this framework. They overcome the shortcomings
of the aforementioned goal-based prediction methods. First, such as pre-defined or model-based anchor trajectories.
an efficient encoding backbone is adopted to encode motion mmTransformer [6] designs a region-based training strategy,
history and scene context. Then, we predict approximate which ensures that each proposal captures a specific pattern.
goals and crop their surrounding goal areas as more robust Recently, goal-based forecasting methods [20] have proven
conditions. Moreover, we introduce a GoICrop operator to effective. TNT [2] first samples dense goal candidates along
explicitly query and aggregate the rich semantic features the lane and generates trajectories conditioned on high-scored
of lanes in the goal areas. Finally, we make the formal goals. LaneRCNN [1] regards each lane segment as an anchor.
motion forecasting conditioned on motion history, scene DenseTNT [3] introduces a trajectory prediction model to
context, and the aggregated goal area features. Extensive output a set of trajectories from dense goal candidates.
experiments on the large-scale Argoverse 1 and Argoverse 2 Heatmap-based methods [8] focus on outputting a heatmap
motion forecasting benchmark demonstrate the effectiveness to represent the trajectories’ future distribution. HOME [7]
and generality of our proposed framework, where GANet method predicts a future probability distribution heatmap and
achieves state-of-the-art performance. designs a deterministic sampling algorithm for optimization.
Our method is different from previous works as follows.
II. R ELATED W ORK (1) We give the definition of the goal area and propose a
Interactions. Early motion prediction methods mainly new goal area-based framework. We experimentally verify
focus on motion and interaction modeling. They attempt the effectiveness of modeling goal areas, predicting goal
to explain actors’ complex movements by exploring their areas, and fusing crucial distant map features slighted by
potential ”interactions.” Traditional methods such as Social previous methods. These map features provide more robust
Force [11] use hand-crafted features and rules to model information than the goal coordinates embedding. (2) We
interactions and constraints. Later, deep learning methods employ a GoICrop operator to extract rich semantic map
bring significant progress to this task. Social LSTM [12] and features in goal areas. It implicitly captures the interactions
SR-LSTM [13] use variants of LSTM to implicitly model between maps and trajectories in goal areas and constrains
interactions. GNN-TP [14] introduces a GNN method for the trajectories to follow driving rules and map topology in
interaction inference and trajectory prediction. The approach a data-driven manner. (3) Since our predicted goal is just a
of [15] applies multi-head attention to incorporate interaction. potential destination, we take it as a handle to model agents’
mmTransformer [6] applies a transformer architecture to fuse interactions in the future, which is also crucial for collision
actors’ motion histories, maps, and interactions. avoidance.
HD maps encoding. According to the HD maps’ process-
III. M ETHOD
ing manner, methods can be divided into three categories.
Rasterization-based methods rasterize the elements of HD This section describes our formulation and GANet frame-
maps and actors’ motion histories into an image. Then, work in a pipelined manner. An overview of the GANet
they use a CNN network to extract features and perform architecture is shown in Figure 2, and each module in this
coordinates prediction. IntentNet [22] develops a multi-task framework is pluggable.
model with a CNN-based detector to extract features from Formulation. Given a sequence of past observed states
rasterized maps. MultiPath [16] uses the Scene CNN to extract aP = [a−T 0 +1 , a−T 0 +2 , ..., a0 ] for an actor, we aim to predict
mid-level features and encodes the states of actors and their its future states aF = [a1 , a2 , ..., aT ] up to a fixed time
interactions on a top-down scene representation. However, step T . Running in a specific environment, each actor will
these 2D-CNN-based methods suffer from low efficiency interact with static HD maps m and the other dynamic
in extracting features of graph-structured maps. Graph- actors. Therefore, the probabilistic distribution we want to
based methods [1] construct graph-structured representations capture is p(aF |m, aP , aO O
P ), where aP denotes the other
from HD maps, which preserve the connectivity of lanes. actors’ observed states. The output of our model is AF =
VectorNet [17] encodes map elements and actor trajectories {akF }k∈[0,K−1] = {(ak1 , ak2 , ..., akT )}k∈[0,K−1] for each actor,
as polylines and then uses a global interactive graph to fuse while motion forecasting tasks and subsequent decision
map and actor features. LaneGCN [4] constructs a map node modules usually expect us to output a set of trajectories.
graph and proposes a novel graph convolution. Point cloud- TNT-like methods’ distribution can be approximated as
based methods use points to represent actors’ trajectories and X
p(τ |m, aP , aO O
P )p(aF |τ, m, aP , aP ) (1)
maps. TPCN [5] takes each actor as an unordered point set
τ ∈T (m,aP ,aO
P)
and applies a point cloud learning model.
Multimodality. The multi-modal prediction has become an where T (m, aP , aO P ) is the space of candidate goals depend-
indispensable part of motion forecasting, which deals with the ing on the driving context. However, the map space m is large,
uncertainty in motion forecasting. Generative methods, such and the goal space T (m, aP , aO P ) requires careful design.
as variational auto-encoder [26] and generative adversarial Some methods expect to accurately predict the ac-
network [18], can be used to generate multi-modal predictions. tor’s motion by extracting good features. For example,
However, each prediction requires independent sampling LaneGCN [4] tries to approximate p(aF |m, aP , aO P ) by
and forward pass, which cannot guarantee the diversity of modeling p(aF |Ma0 , aP , aO P ), where M a0 is a ”local” map
samples. Other methods [19], [16] add some prior knowledge, features that is related to the actor’s state a0 at final observed
Fig. 1. Illustration of GANet framework, which consists of three stages: (a) Context encoding encodes motion history and scene context; (b) Goal
prediction predicts possible goals. GoICrop retrieves and aggregates goal area map features and models the actors’ interactions in the future; (c) Motion
forecasting estimates multi-feasible trajectories and their corresponding confidence scores.

step t = 0. To extract Ma0 , they use a0 as an anchor to is highly multi-modal. For example, he or she may stop, go
retrieve its surrounding map elements and aggregate their ahead, turn left, or turn right when approaching an intersection.
features. We found that not only the ”local” map information Therefore, we try to make a multiple-goals prediction. We
is important, but also the goal area maps information is of construct a goal prediction header with two branches to
e
great importance for accurate trajectory prediction. So, we predict E possible goals Gn,end = {gn,end }e∈[0,E−1] and
e
reconstructed the probability as: their confidence scores Cn,end = {cn,end }e∈[0,E−1] , where
e
X gn,end is the e-th predicted goal coordinates and cen,end is
p(τ |Ma0 , aP , aO O
P )p(Mτ |m, τ )p(aF |Mτ , Ma0 , aP , aP ) the e-th predicted goal confidence of the n-th actor.
τ
(2) We train this stage using the sum of classification loss and
We directly predict possible goals τ based on actors’ motion regression loss. Given E predicted goals, we find a positive
histories and driving context. Therefore, GANet is genuinely goal ê that has the minimum Euclidean distance with the
end-to-end, adaptive, and efficient. Then, we apply the ground truth trajectory’s endpoint. For classification, we use
predicted goals as anchors to retrieve the map elements in the max-margin loss:
goal areas explicitly and aggregate their map features as Mτ . N X
1 X
Lcls end = max(0, cen,end +  − cên,end )
A. Motion history and scene context encoding N (E − 1) n=1
e6=ê
As shown in Figure 2, the first stage of motion forecasting (3)
is driving context encoding, which extracts actors’ motion fea- where N is the total number of actors and  = 0.2 is the
tures and maps features. We adopt LaneGCN’s [4] backbone margin. The margin loss expects each goal to capture a
to encode motion history and scene context for its outstanding specific pattern and pushes the goal closest to the ground
performance. Specifically, we apply a 1D CNN with Feature truth to have the highest score. For regression, we only apply
Pyramid Network (FPN) to extract actors’ motion features. the smooth L1 loss to the positive goals:
Following [4], we use a multi-scale LaneConv network to N
1 X
encode the vectorized map data, which is consisted of lane Lreg end = ê
reg(gn,end − a∗n,end ) (4)
centerlines and their connectivity. We construct a lane node N n=1
graph from the map data. Finally, A fusion network transfers where a∗n,end is the ground truth BEVP coordinates of the n-th
and aggregates feature among actors and lane nodes. After actor trajectory’s endpoint, reg(z) = i d(zi ), zi is the i-th
driving context encoding, we obtain a 2D feature matrix X element of z, and d(zi ) is a smooth L1 loss.
where each row Xi indicates the feature of the i-th actor, and Additionally, we also try to add a ”one goal prediction”
a 2D matrix Y where each row Yi indicates the feature of module at each trajectory’s middle position aggregating map
the i-th lane node. We can also use other methods to encode features to assist the endpoint goal prediction and the whole
motion history and scene context. For example, we implement trajectory prediction. Similarly, we apply a residual MLP to
a VectorNet++ method in the ablation study section. regress a middle goal gn,mid for the n-th actor. The loss term
B. Goal prediction for this module is given by:
In stage two, we predict possible goals for the i-th actor N
1 X
based on Xi . We apply intermediate supervision and calculate Lreg mid = reg(gn,mid − a∗n,mid ) (5)
N n=1
the smooth L1 loss between the best-predicted goal and the
ground-truth trajectory’s endpoint to backpropagate, making where a∗n,mid is the ground truth BEV coordinates of the
the predicted goal close to the actual goal as much as n-th actor trajectory’s middle position.
possible. The goal prediction stage serves as a predictive The total loss at the goal prediction stage is:
test to locate goal areas, which is different from goal-based
L1 = α1 Lcls end + β1 Lreg end + ρ1 Lreg mid (6)
methods using the predicted goals as the final predicted
trajectories’ endpoint. In practice, a driver’s driving intent where α1 = 1, β1 = 0.2 and ρ1 = 0.1.
Fig. 2. The GANet M 3 model overview. (a) A feature extracting model encodes and fuses map and motion features. (b) The ”one goal prediction”
module predicts a goal area in the trajectory’s middle position and aggregates its features. (c) The ”three goals predictions” module predicts three goal areas,
aggregates their features, and models the actors’ future interactions. (d) The final prediction stage predicts K trajectories and their confidence scores.

C. GoICrop of i-th actor is smaller than 100 meters. In this case, yj in


equation 7 denotes the features of j-th actor, vi denotes the
We choose the predicted goal with the highest confidence
anchor’s coordinates of i-th actor, and vj denotes the anchor’s
among E goals as an anchor. This anchor is the approximate
coordinates of j-th actor in ∆i,j = φ(M LP (vi − vj )).
destination with the highest possibility that the actor may
reach based on its motion history and driving context. Because D. Motion estimation and scoring
the actors’ motion is highly uncertain, we crop maps within 6 We take the updated actor features X as input to predict
meters of the anchor as the goal area of interest, which relaxes K final future trajectories and their confidence scores in
the strict goal prediction requirement. The actual endpoint is stage three. Specifically, we construct a two-branch multi-
more likely to appear in candidate areas compared with being modal prediction header similar to the goal prediction stage,
hit by scattered endpoint predictions. Moreover, the actor’s with one regression branch estimating the trajectories and
behavior highly depends on its destination area’s context, i.e., one classification branch scoring the trajectories. For each
the maps and other actors. Although previous works have actor, we regress K sequences of BEV coordinates An,F =
explored the interactions between actors, the interactions {(akn,1 , akn,2 , ..., akn,T )}k∈[0,K−1] , where akn,t denotes the n-
between actors and maps in goal areas and the interactions th actor’s future coordinates of the k-th mode at t-th step.
among actors in the future have received less attention. Thus, For the classification branch, we output K confidence scores
we retrieve the lane nodes in goal areas and apply a GoICrop Cn,cls = {ckn }k∈[0,K−1] corresponding to K modes. We find
module to aggregate these map node features as follows: a positive trajectory of mode k̂, whose endpoint has the
X minimum Euclidean distance with the ground truth endpoint.
x0i = φ1 (xi W0 + φ2 (concat(xi W1 , ∆i,j , yj )W2 ))W3 For classification, we use the margin loss Lcls similar to
j the goal prediction stage. For regression, we apply the smooth
(7)
L1 loss on all predicted steps of the positive trajectories:
where xi is the feature of i-th actor and and yj is the feature
N T
of j-th lane node, Wi is a weight matrix, φi is a layer 1 XX
Lreg = reg(ak̂n,t − a∗n,t ) (8)
normalization with ReLU function, and ∆i,j = φ(M LP (vi − N T n=1 t=1
vj )), where vi denotes the anchor’s coordinates of i-th actor
and vj denotes the j-th lane node’s coordinates. GoICrop where a∗n,t is the n-th actor’s ground truth coordinates.
To emphasize the importance of the goal, we add a loss
serves as spatial distance-based attention and updates the goal
term stressing the penalty at the endpoint:
area lane nodes’ features back to the actors. We transpose xi
N
with W1 as a query embedding. The relative distance feature 1 X
between the anchor of i-th actor and j-th lane node are Lend = reg(ak̂n,end − a∗n,end ) (9)
N n=1
extracted by ∆i,j . Then, we concatenate the query embedding,
relative distance feature, and lane node feature. An M LP where a∗n,end is the n-th actor’s ground truth endpoint
is employed to transpose and encode these features. Finally, coordinates and ak̂n,end is the n-th actor’s predicted positive
the goal area features are aggregated for i-th actor. trajectory’s endpoint.
Previous motion forecasting methods usually focus on The loss function for training at this stage is given by:
the interactions in the observation history. However, actors L2 = α2 Lcls + β2 Lreg + ρ2 Lend (10)
will interact with each other in the future to follow driving
etiquette, such as avoiding collisions. Since we have per- where α2 = 2, β2 = 1 and ρ2 = 1.
formed predictive goal predictions and gotten possible goals E. Training
for each actor, our framework can model the actors’ future As all the modules are differentiable, we train our model
interactions. Hence, we utilize the predicted anchor positions with the loss function:
and apply a GoICrop module as equation 7 to implicitly
model actors’ interactions in the future. We consider the L = L1 + L2 (11)
other actors whose future anchor’s distance from the anchor The parameters are chosen to balance the training process.
TABLE I
R ESULTS ON A RGOVERSE 1 ( UPPER SET ) AND A RGOVERSE 2 ( LOWER SET ) MOTION FORECASTING TEST DATASET. T HE ”-” DENOTES THAT THIS RESULT
WAS NOT REPORTED IN THEIR PAPER .

b-minFDE MR minFDE minADE minFDE minADE MR


Method
(K=6) (K=6) (K=6) (K=6) (K=1) (K=1) (K=1)
LaneRCNN [1] 2.147 0.123 1.453 0.904 3.692 1.685 0.569
TNT[2] 2.140 0.166 1.446 0.910 4.959 2.174 0.710
DenseTNT (MR)[3] 2.076 0.103 1.381 0.911 3.696 1.703 0.599
LaneGCN [4] 2.059 0.163 1.364 0.868 3.779 1.706 0.591
mmTransformer[6] 2.033 0.154 1.338 0.844 4.003 1.774 0.618
GOHOME [8] 1.983 0.105 1.450 0.943 3.647 1.689 0.572
HOME [7] - 0.102 1.45 0.94 3.73 1.73 0.584
DenseTNT (FDE)[3] 1.976 0.126 1.282 0.882 3.632 1.679 0.584
TPCN [5] 1.929 0.133 1.244 0.815 3.487 1.575 0.560
GANet(Ours) 1.790 0.118 1.161 0.806 3.455 1.592 0.550
DirEC 3.29 0.52 2.83 1.26 6.82 2.67 0.73
drivingfree 3.03 0.49 2.58 1.17 6.26 2.47 0.72
LGU 2.77 0.37 2.15 1.05 6.91 2.77 0.73
Autowise.AI(GNA) 2.45 0.29 1.82 0.91 6.27 2.47 0.71
Timeformer [28] 2.16 0.20 1.51 0.88 4.71 1.95 0.64
QCNet 2.14 0.24 1.58 0.76 4.79 1.89 0.63
OPPred w/o Ensemble [31] 2.03 0.180 1.389 0.733 4.70 1.84 0.615
TENET w/o Ensemble [30] 2.01 - - - - - -
Polkach(VILaneIter) 2.00 0.19 1.39 0.71 4.74 1.82 0.61
GANet(Ours) 1.969 0.171 1.352 0.728 4.475 1.775 0.597

IV. E XPERIMENTS using a batch size of 128 with the Adam optimizer for 42
epochs. The initial learning rate is 1 x 10-3, decaying to 1 x
A. Experimental settings 10-4 at 32 epochs.
Dataset. Argoverse 1 [10] is a large-scale motion forecasting
dataset, which consists of over 30K real-world driving B. Comparison with State-of-the-art
sequences, split into train, validation, and test sequences We compare our approach with state-of-the-art methods.
without geographical overlap. Each training and validation As shown in Table I, our GANet outperforms existing goal-
sequence is 5 seconds long, while each test sequence presents based approaches such as TNT [2], LaneRCNN [1], and
only 2 seconds to the model, and another 3 seconds are DenseTNT [3]. Specifically, we make a detailed comparison
withheld for the leaderboard evaluation. Each sequence with LaneGCN because we adopt their backbone to encode
includes one interesting tracked actor labeled as the ”agent.” motion history and scene context. Public results on the
Given an initial 2-second observation, the task is to predict official motion forecasting challenge leaderboard show that
the agent’s future coordinates in the next 3 seconds. our GANet method significantly beats LaneGCN by decreases
Spanning 2,000+ km over six geographically diverse cities, of 28%, 15%, 13% and 9% in MR6, minFDE6, brier-
Argoverse 2 [23] is a high-quality motion forecasting dataset minFDE6, and minFDE1, respectively, which demonstrate
whose scenario is paired with a local map. Each scenario is 11 the effectiveness of GANet. We also conduct experiments on
seconds long. We observe five seconds and predict six seconds Argoverse 2 Motion Forecasting Dataset [23], and GANet
for the leaderboard evaluation. Compared to Argoverse 1, the is the winner that achieves state-of-the-art performance in
scenarios in Argoverse 2 are approximately twice longer and CVPR 2022 Argoverse Motion Forecasting Challenge, whose
more diverse. top ten entries are shown in Table I. Since many methods,
Metrics. We follow the widely used evaluation metrics [1], such as TENET and OPPred, apply model ensemble to boost
[3], [5]. Specifically, MR is the ratio of predictions where their performance, we report their results without an ensemble
none of the predicted K trajectories is within 2.0 meters for a fair comparison.
of ground truth according to the endpoint’s displacement
error. Minimum Final Displacement Error (minFDE) is the L2 C. Ablation studies
distance between the endpoint of the best-forecasted trajectory Component study. We perform ablation studies on the
and the ground truth. Minimum Average Displacement validation set to investigate the effectiveness of each compo-
Error (minADE) is the average L2 distance between the nent. Taking the LaneGCN model as a baseline, we add other
best-forecasted trajectory and the ground truth. Argoverse components progressively. First, to emphasize the motion’s
Motion Forecasting leaderboard is ranked by Brier minimum temporal modeling, we construct an enhanced version of
Final Displacement Error (brier-minFDE6), which adds a LaneGCN called LaneGCN++. Specifically, we apply an
probability-related penalty to the endpoint’s L2 distance error. LSTM network on FPN’s output features and use two identical
Implementation. We train our model on 2 A100 GPUs parallel networks to enhance the motion history encoding.
Fig. 3. Qualitative results on the Argoverse 1 validation set. Lanes are shown in grey, the agent’s past trajectory is in orange, the ground truth future
trajectory is in red, and the predicted six trajectories are in green. The results of different methods are shown in different columns.
TABLE II
implement a VectorNet++ method as another backbone, whose
A BLATION STUDY RESULTS ON THE A RGOVERSE 1 VALIDATION SET.
polylines idea is similar to VectorNet [17]. We construct
minFDE minADE minFDE minADE our GANet models adopting the VectorNet++ backbone. As
Method shown in Table II, the performance improves by 9.9% and
(K=6) (K=6) (K=1) (K=1)
5.2% in minFDE6 and minADE6, respectively, which shows
LaneGCN 1.080 0.710 3.010 1.359
LaneGCN++ 1.076 0.703 2.819 1.286 the generality of GANet when adopting different scene context
GANet 1 0.961 0.684 2.743 1.269 encoding methods.
GANet 3 0.949 0.679 2.719 1.264
GANet M 3 0.934 0.673 2.707 1.259 D. Qualitative results
GANet 2 0.971 0.689 2.756 1.280 We visualize the predicted results on the validation set.
GANet 6 0.966 0.683 2.784 1.289 For challenging sequences, almost all results of GANet
GANet 9 0.967 0.685 2.759 1.282 models are more reasonable and smoother following map
VectorNet [17] - - 3.67 1.66 constraints than outputs of LaneGCN. We show the multi-
VectorNet++ 1.156 0.772 3.256 1.507 modal prediction of two cases in Figure 3 and compare
GANet 1 1.076 0.744 3.050 1.429 GANet with LaneGCN qualitatively. For illustration purpose,
GANet M 3 1.042 0.732 3.100 1.449 we only draw the agent’s trajectory for an intuitive check
while other actors are omitted. The first row shows a case
As shown in Table II, LaneGCN++ improves the ADE1 and where the direction of the lane has changed over a long
FDE1 metrics’ performance. However, the enhanced bigger distance. LaneGCN is unaware of this distant change and
network shows little improvement in multi-modal prediction. gives six straight predictions. GANet 1 model captures this
Second, to verify GANet’s effectiveness, we adopt change and generates trajectories that follow the lane topology,
LaneGCN++’s backbone and add a ”one goal prediction” while GANet M 3 model generates smoother trajectories than
module to construct the GANet 1 model, which only predicts GANet 1. The second row presents a case where the agent
M = 1 goals. Since we only predict one goal in this model, performs a right turn at a complex intersection. Due to the lack
we omit the classification loss term Lcls end and Lreg mid of motion history, maps are essential to produce reasonable
in L1 . The performance of the GANet 1 model outperforms trajectories. LaneGCN produces divergent, non-traffic-rule
LaneGCN++ dramatically, with more than 10% improvement compliant trajectories, while our method produces reasonable
on minFDE6. In addition, considering the multimodality, we trajectories following the lane topology.
apply a ”three goals predictions” module in our GANet 3
model, which performs better. Moreover, we also try to add V. C ONCLUSION
a ”one goal prediction” module at the trajectory’s middle This paper proposes a Goal Area Network (GANet), a new
position to aggregate the middle position’s map information in framework for motion forecasting. GANet predicts potential
GANet M 3. The performance has been further improved. Our goal areas as conditions for prediction. We design a GoICrop
models improve all the metrics compared to the LaneGCN++. operator to extract and aggregate the rich semantic lane
Number of goals. We also evaluate the effect of the goal features in goal areas. It implicitly models the interactions
number. Table II shows the model performance under different between trajectories and maps in the goal area and the
numbers of goals, where the goal number only has marginal interactions between actors in the future in a data-driven
effects on the overall performance. manner. Experiments on the Argoverse motion forecasting
Backbone. To demonstrate the generality of GANet, we benchmark demonstrate GANet’s effectiveness.
R EFERENCES [17] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid,
C. (2020). Vectornet: Encoding hd maps and agent dynamics from
vectorized representation. In Proceedings of the IEEE/CVF Conference
[1] Zeng, W., Liang, M., Liao, R., & Urtasun, R. Lanercnn: Distributed on Computer Vision and Pattern Recognition (pp. 11525-11533).
representations for graph-centric motion forecasting. In 2021 IEEE/RSJ [18] Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018).
International Conference on Intelligent Robots and Systems (pp. 532- Social gan: Socially acceptable trajectories with generative adversarial
539). IEEE. networks. In Proceedings of the IEEE conference on computer vision
[2] Zhao, H., Gao, J., Lan, T., Sun, C., Sapp, B., Varadarajan, B., ... & and pattern recognition (pp. 2255-2264).
Anguelov, D. (2021, October). TNT: Target-driven Trajectory Prediction. [19] Phan-Minh, T., Grigore, E. C., Boulton, F. A., Beijbom, O., & Wolff, E.
In Conference on Robot Learning (pp. 895-904). PMLR. M. (2020). Covernet: Multimodal behavior prediction using trajectory
[3] Gu, J., Sun, C., & Zhao, H. (2021). Densetnt: End-to-end trajectory sets. In Proceedings of the IEEE/CVF Conference on Computer Vision
prediction from dense goal sets. In Proceedings of the IEEE/CVF and Pattern Recognition (pp. 14074-14083).
International Conference on Computer Vision (pp. 15303-15312). [20] Zhang, L., Su, P. H., Hoang, J., Haynes, G. C., & Marchetti-Bowick,
[4] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., & Urtasun, M. (2021, October). Map-Adaptive Goal-Based Trajectory Prediction.
R. (2020, August). Learning lane graph representations for motion In Conference on Robot Learning (pp. 1371-1383). PMLR.
forecasting. In European Conference on Computer Vision (pp. 541-556). [21] Ngiam, J., Caine, B., Vasudevan, V., Zhang, Z., Chiang, H. T. L., Ling,
Springer, Cham. J., ... & Shlens, J. (2021). Scene transformer: A unified multi-task
[5] Ye, M., Cao, T., & Chen, Q. (2021). Tpcn: Temporal point cloud model for behavior prediction and planning. arXiv e-prints, arXiv-2106.
networks for motion forecasting. In Proceedings of the IEEE/CVF [22] Casas, S., Luo, W., & Urtasun, R. (2018, October). Intentnet: Learning
Conference on Computer Vision and Pattern Recognition (pp. 11318- to predict intention from raw sensor data. In Conference on Robot
11327). Learning (pp. 947-956). PMLR.
[6] Liu, Y., Zhang, J., Fang, L., Jiang, Q., & Zhou, B. (2021). Multimodal [23] Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S.,
motion prediction with stacked transformers. In Proceedings of the ... & Hays, J. (2021, August). Argoverse 2: Next Generation Datasets
IEEE/CVF Conference on Computer Vision and Pattern Recognition for Self-Driving Perception and Forecasting. In Thirty-fifth Conference
(pp. 7577-7586). on Neural Information Processing Systems Datasets and Benchmarks
[7] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., & Moutarde, F. Track (Round 2).
Home: Heatmap output for future motion estimation. In 2021 IEEE [24] Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., ...
International Intelligent Transportation Systems Conference (pp. 500- & Anguelov, D. (2021). Large scale interactive motion forecasting for
507). IEEE. autonomous driving: The waymo open motion dataset. In Proceedings
[8] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., & Moutarde, F. of the IEEE/CVF International Conference on Computer Vision (pp.
Gohome: Graph-oriented heatmap output for future motion estimation. 9710-9719).
In 2022 International Conference on Robotics and Automation (pp. [25] Zhou, Zikang and Ye, Luyao and Wang, Jianping and Wu, Kui and
9107-9114). IEEE. Lu Kejie. (2022). HiVT: Hierarchical Vector Transformer for Multi-
[9] Huang, Z., Mo, X., & Lv, C. Multi-modal motion prediction with Agent Motion prediction. Proceedings of the IEEE/CVF Conference
transformer-based neural network for autonomous driving. In 2022 on Computer Vision and Pattern Recognition.
International Conference on Robotics and Automation (pp. 2605-2611). [26] Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H., & Chandraker,
IEEE. M. (2017). Desire: Distant future prediction in dynamic scenes with
interacting agents. In Proceedings of the IEEE conference on computer
[10] Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett,
vision and pattern recognition (pp. 336-345).
A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with
[27] Varadarajan, B., Hefny, A., Srivastava, A., Refaat, K. S., Nayakanti,
rich maps. In Proceedings of the IEEE/CVF Conference on Computer
N., Cornman, A., ... & Sapp, B. (2022, May). Multipath++: Efficient
Vision and Pattern Recognition (pp. 8748-8757).
information fusion and trajectory aggregation for behavior prediction.
[11] Helbing, D., & Molnar, P. (1995). Social force model for pedestrian In 2022 International Conference on Robotics and Automation (pp.
dynamics. Physical review E, 51(5), 4282. 7814-7821). IEEE.
[12] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., [28] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., & Moutarde,
& Savarese, S. (2016). Social lstm: Human trajectory prediction in F. (2021, September). THOMAS: Trajectory Heatmap Output with
crowded spaces. In Proceedings of the IEEE conference on computer learned Multi-Agent Sampling. In International Conference on Learning
vision and pattern recognition (pp. 961-971). Representations.
[13] Zhang, P., Ouyang, W., Zhang, P., Xue, J., & Zheng, N. (2019). Sr-lstm: [29] Ye, M., Xu, J., Xu, X., Cao, T., & Chen, Q. (2022). DCMS:
State refinement for lstm towards pedestrian trajectory prediction. In Motion Forecasting with Dual Consistency and Multi-Pseudo-Target
Proceedings of the IEEE/CVF Conference on Computer Vision and Supervision. arXiv preprint arXiv:2204.05859.
Pattern Recognition (pp. 12085-12094). [30] Wang, Y., Zhou, H., Zhang, Z., Feng, C., Lin, H., Gao, C., ... &
[14] Wang, M., Shi, D., Guan, N., Zhang, T., Wang, L., & Li, R. Unsu- Zhang, C. (2022). TENET: Transformer Encoding Network for Effective
pervised pedestrian trajectory prediction with graph neural networks. Temporal Flow on Motion Prediction. arXiv e-prints, arXiv-2207.
In 2019 IEEE 31st International Conference on Tools with Artificial [31] Zhang, C., Sun, H., Chen, C., & Guo, Y. (2022). Technical Report for
Intelligence (pp. 832-839). IEEE. Argoverse2 Challenge 2022–Motion Forecasting Task. arXiv preprint
[15] Mercat, J., Gilles, T., El Zoghby, N., Sandou, G., Beauvois, D., & arXiv:2206.07934.
Gil, G. P. (2020, May). Multi-head attention for multi-modal joint [32] Lu, Qiujing , et al. ”KEMP: Keyframe-Based Hierarchical End-to-
vehicle motion forecasting. In 2020 IEEE International Conference on End Deep Model for Long-Term Trajectory Prediction.” (2022). arXiv
Robotics and Automation (pp. 9638-9644). IEEE. preprint arXiv:2205.04624
[16] Chai, Y., Sapp, B., Bansal, M., & Anguelov, D. (2019, January).
MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for
Behavior Prediction. In CoRL.

You might also like