Ganet: Goal Area Network For Motion Forecasting
Ganet: Goal Area Network For Motion Forecasting
Mingkun Wang1 , Xinge Zhu2 , Changqian Yu3 , Wei Li4 , Yuexin Ma5 ,
Ruochun Jin6 ,Xiaoguang Ren7 , Dongchun Ren3 , Mingxu Wang8 and Wenjing Yang6∗
Abstract— Predicting the future motion of road participants networks (CNN) to process. Others [17], [1] use vectorized
is crucial for autonomous driving but is extremely challenging and graph-structured data to represent maps. For instance,
due to staggering motion uncertainty. Recently, most motion LaneGCN [4] applies a multi-stride graph neural network
forecasting methods resort to the goal-based strategy, i.e.,
predicting endpoints of motion trajectories as conditions to to encode maps. However, since the traveling mode of an
actor is highly diverse, the fixed size stride cannot effectively
arXiv:2209.09723v2 [cs.CV] 13 Feb 2023
step t = 0. To extract Ma0 , they use a0 as an anchor to is highly multi-modal. For example, he or she may stop, go
retrieve its surrounding map elements and aggregate their ahead, turn left, or turn right when approaching an intersection.
features. We found that not only the ”local” map information Therefore, we try to make a multiple-goals prediction. We
is important, but also the goal area maps information is of construct a goal prediction header with two branches to
e
great importance for accurate trajectory prediction. So, we predict E possible goals Gn,end = {gn,end }e∈[0,E−1] and
e
reconstructed the probability as: their confidence scores Cn,end = {cn,end }e∈[0,E−1] , where
e
X gn,end is the e-th predicted goal coordinates and cen,end is
p(τ |Ma0 , aP , aO O
P )p(Mτ |m, τ )p(aF |Mτ , Ma0 , aP , aP ) the e-th predicted goal confidence of the n-th actor.
τ
(2) We train this stage using the sum of classification loss and
We directly predict possible goals τ based on actors’ motion regression loss. Given E predicted goals, we find a positive
histories and driving context. Therefore, GANet is genuinely goal ê that has the minimum Euclidean distance with the
end-to-end, adaptive, and efficient. Then, we apply the ground truth trajectory’s endpoint. For classification, we use
predicted goals as anchors to retrieve the map elements in the max-margin loss:
goal areas explicitly and aggregate their map features as Mτ . N X
1 X
Lcls end = max(0, cen,end + − cên,end )
A. Motion history and scene context encoding N (E − 1) n=1
e6=ê
As shown in Figure 2, the first stage of motion forecasting (3)
is driving context encoding, which extracts actors’ motion fea- where N is the total number of actors and = 0.2 is the
tures and maps features. We adopt LaneGCN’s [4] backbone margin. The margin loss expects each goal to capture a
to encode motion history and scene context for its outstanding specific pattern and pushes the goal closest to the ground
performance. Specifically, we apply a 1D CNN with Feature truth to have the highest score. For regression, we only apply
Pyramid Network (FPN) to extract actors’ motion features. the smooth L1 loss to the positive goals:
Following [4], we use a multi-scale LaneConv network to N
1 X
encode the vectorized map data, which is consisted of lane Lreg end = ê
reg(gn,end − a∗n,end ) (4)
centerlines and their connectivity. We construct a lane node N n=1
graph from the map data. Finally, A fusion network transfers where a∗n,end is the ground truth BEVP coordinates of the n-th
and aggregates feature among actors and lane nodes. After actor trajectory’s endpoint, reg(z) = i d(zi ), zi is the i-th
driving context encoding, we obtain a 2D feature matrix X element of z, and d(zi ) is a smooth L1 loss.
where each row Xi indicates the feature of the i-th actor, and Additionally, we also try to add a ”one goal prediction”
a 2D matrix Y where each row Yi indicates the feature of module at each trajectory’s middle position aggregating map
the i-th lane node. We can also use other methods to encode features to assist the endpoint goal prediction and the whole
motion history and scene context. For example, we implement trajectory prediction. Similarly, we apply a residual MLP to
a VectorNet++ method in the ablation study section. regress a middle goal gn,mid for the n-th actor. The loss term
B. Goal prediction for this module is given by:
In stage two, we predict possible goals for the i-th actor N
1 X
based on Xi . We apply intermediate supervision and calculate Lreg mid = reg(gn,mid − a∗n,mid ) (5)
N n=1
the smooth L1 loss between the best-predicted goal and the
ground-truth trajectory’s endpoint to backpropagate, making where a∗n,mid is the ground truth BEV coordinates of the
the predicted goal close to the actual goal as much as n-th actor trajectory’s middle position.
possible. The goal prediction stage serves as a predictive The total loss at the goal prediction stage is:
test to locate goal areas, which is different from goal-based
L1 = α1 Lcls end + β1 Lreg end + ρ1 Lreg mid (6)
methods using the predicted goals as the final predicted
trajectories’ endpoint. In practice, a driver’s driving intent where α1 = 1, β1 = 0.2 and ρ1 = 0.1.
Fig. 2. The GANet M 3 model overview. (a) A feature extracting model encodes and fuses map and motion features. (b) The ”one goal prediction”
module predicts a goal area in the trajectory’s middle position and aggregates its features. (c) The ”three goals predictions” module predicts three goal areas,
aggregates their features, and models the actors’ future interactions. (d) The final prediction stage predicts K trajectories and their confidence scores.
IV. E XPERIMENTS using a batch size of 128 with the Adam optimizer for 42
epochs. The initial learning rate is 1 x 10-3, decaying to 1 x
A. Experimental settings 10-4 at 32 epochs.
Dataset. Argoverse 1 [10] is a large-scale motion forecasting
dataset, which consists of over 30K real-world driving B. Comparison with State-of-the-art
sequences, split into train, validation, and test sequences We compare our approach with state-of-the-art methods.
without geographical overlap. Each training and validation As shown in Table I, our GANet outperforms existing goal-
sequence is 5 seconds long, while each test sequence presents based approaches such as TNT [2], LaneRCNN [1], and
only 2 seconds to the model, and another 3 seconds are DenseTNT [3]. Specifically, we make a detailed comparison
withheld for the leaderboard evaluation. Each sequence with LaneGCN because we adopt their backbone to encode
includes one interesting tracked actor labeled as the ”agent.” motion history and scene context. Public results on the
Given an initial 2-second observation, the task is to predict official motion forecasting challenge leaderboard show that
the agent’s future coordinates in the next 3 seconds. our GANet method significantly beats LaneGCN by decreases
Spanning 2,000+ km over six geographically diverse cities, of 28%, 15%, 13% and 9% in MR6, minFDE6, brier-
Argoverse 2 [23] is a high-quality motion forecasting dataset minFDE6, and minFDE1, respectively, which demonstrate
whose scenario is paired with a local map. Each scenario is 11 the effectiveness of GANet. We also conduct experiments on
seconds long. We observe five seconds and predict six seconds Argoverse 2 Motion Forecasting Dataset [23], and GANet
for the leaderboard evaluation. Compared to Argoverse 1, the is the winner that achieves state-of-the-art performance in
scenarios in Argoverse 2 are approximately twice longer and CVPR 2022 Argoverse Motion Forecasting Challenge, whose
more diverse. top ten entries are shown in Table I. Since many methods,
Metrics. We follow the widely used evaluation metrics [1], such as TENET and OPPred, apply model ensemble to boost
[3], [5]. Specifically, MR is the ratio of predictions where their performance, we report their results without an ensemble
none of the predicted K trajectories is within 2.0 meters for a fair comparison.
of ground truth according to the endpoint’s displacement
error. Minimum Final Displacement Error (minFDE) is the L2 C. Ablation studies
distance between the endpoint of the best-forecasted trajectory Component study. We perform ablation studies on the
and the ground truth. Minimum Average Displacement validation set to investigate the effectiveness of each compo-
Error (minADE) is the average L2 distance between the nent. Taking the LaneGCN model as a baseline, we add other
best-forecasted trajectory and the ground truth. Argoverse components progressively. First, to emphasize the motion’s
Motion Forecasting leaderboard is ranked by Brier minimum temporal modeling, we construct an enhanced version of
Final Displacement Error (brier-minFDE6), which adds a LaneGCN called LaneGCN++. Specifically, we apply an
probability-related penalty to the endpoint’s L2 distance error. LSTM network on FPN’s output features and use two identical
Implementation. We train our model on 2 A100 GPUs parallel networks to enhance the motion history encoding.
Fig. 3. Qualitative results on the Argoverse 1 validation set. Lanes are shown in grey, the agent’s past trajectory is in orange, the ground truth future
trajectory is in red, and the predicted six trajectories are in green. The results of different methods are shown in different columns.
TABLE II
implement a VectorNet++ method as another backbone, whose
A BLATION STUDY RESULTS ON THE A RGOVERSE 1 VALIDATION SET.
polylines idea is similar to VectorNet [17]. We construct
minFDE minADE minFDE minADE our GANet models adopting the VectorNet++ backbone. As
Method shown in Table II, the performance improves by 9.9% and
(K=6) (K=6) (K=1) (K=1)
5.2% in minFDE6 and minADE6, respectively, which shows
LaneGCN 1.080 0.710 3.010 1.359
LaneGCN++ 1.076 0.703 2.819 1.286 the generality of GANet when adopting different scene context
GANet 1 0.961 0.684 2.743 1.269 encoding methods.
GANet 3 0.949 0.679 2.719 1.264
GANet M 3 0.934 0.673 2.707 1.259 D. Qualitative results
GANet 2 0.971 0.689 2.756 1.280 We visualize the predicted results on the validation set.
GANet 6 0.966 0.683 2.784 1.289 For challenging sequences, almost all results of GANet
GANet 9 0.967 0.685 2.759 1.282 models are more reasonable and smoother following map
VectorNet [17] - - 3.67 1.66 constraints than outputs of LaneGCN. We show the multi-
VectorNet++ 1.156 0.772 3.256 1.507 modal prediction of two cases in Figure 3 and compare
GANet 1 1.076 0.744 3.050 1.429 GANet with LaneGCN qualitatively. For illustration purpose,
GANet M 3 1.042 0.732 3.100 1.449 we only draw the agent’s trajectory for an intuitive check
while other actors are omitted. The first row shows a case
As shown in Table II, LaneGCN++ improves the ADE1 and where the direction of the lane has changed over a long
FDE1 metrics’ performance. However, the enhanced bigger distance. LaneGCN is unaware of this distant change and
network shows little improvement in multi-modal prediction. gives six straight predictions. GANet 1 model captures this
Second, to verify GANet’s effectiveness, we adopt change and generates trajectories that follow the lane topology,
LaneGCN++’s backbone and add a ”one goal prediction” while GANet M 3 model generates smoother trajectories than
module to construct the GANet 1 model, which only predicts GANet 1. The second row presents a case where the agent
M = 1 goals. Since we only predict one goal in this model, performs a right turn at a complex intersection. Due to the lack
we omit the classification loss term Lcls end and Lreg mid of motion history, maps are essential to produce reasonable
in L1 . The performance of the GANet 1 model outperforms trajectories. LaneGCN produces divergent, non-traffic-rule
LaneGCN++ dramatically, with more than 10% improvement compliant trajectories, while our method produces reasonable
on minFDE6. In addition, considering the multimodality, we trajectories following the lane topology.
apply a ”three goals predictions” module in our GANet 3
model, which performs better. Moreover, we also try to add V. C ONCLUSION
a ”one goal prediction” module at the trajectory’s middle This paper proposes a Goal Area Network (GANet), a new
position to aggregate the middle position’s map information in framework for motion forecasting. GANet predicts potential
GANet M 3. The performance has been further improved. Our goal areas as conditions for prediction. We design a GoICrop
models improve all the metrics compared to the LaneGCN++. operator to extract and aggregate the rich semantic lane
Number of goals. We also evaluate the effect of the goal features in goal areas. It implicitly models the interactions
number. Table II shows the model performance under different between trajectories and maps in the goal area and the
numbers of goals, where the goal number only has marginal interactions between actors in the future in a data-driven
effects on the overall performance. manner. Experiments on the Argoverse motion forecasting
Backbone. To demonstrate the generality of GANet, we benchmark demonstrate GANet’s effectiveness.
R EFERENCES [17] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid,
C. (2020). Vectornet: Encoding hd maps and agent dynamics from
vectorized representation. In Proceedings of the IEEE/CVF Conference
[1] Zeng, W., Liang, M., Liao, R., & Urtasun, R. Lanercnn: Distributed on Computer Vision and Pattern Recognition (pp. 11525-11533).
representations for graph-centric motion forecasting. In 2021 IEEE/RSJ [18] Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018).
International Conference on Intelligent Robots and Systems (pp. 532- Social gan: Socially acceptable trajectories with generative adversarial
539). IEEE. networks. In Proceedings of the IEEE conference on computer vision
[2] Zhao, H., Gao, J., Lan, T., Sun, C., Sapp, B., Varadarajan, B., ... & and pattern recognition (pp. 2255-2264).
Anguelov, D. (2021, October). TNT: Target-driven Trajectory Prediction. [19] Phan-Minh, T., Grigore, E. C., Boulton, F. A., Beijbom, O., & Wolff, E.
In Conference on Robot Learning (pp. 895-904). PMLR. M. (2020). Covernet: Multimodal behavior prediction using trajectory
[3] Gu, J., Sun, C., & Zhao, H. (2021). Densetnt: End-to-end trajectory sets. In Proceedings of the IEEE/CVF Conference on Computer Vision
prediction from dense goal sets. In Proceedings of the IEEE/CVF and Pattern Recognition (pp. 14074-14083).
International Conference on Computer Vision (pp. 15303-15312). [20] Zhang, L., Su, P. H., Hoang, J., Haynes, G. C., & Marchetti-Bowick,
[4] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., & Urtasun, M. (2021, October). Map-Adaptive Goal-Based Trajectory Prediction.
R. (2020, August). Learning lane graph representations for motion In Conference on Robot Learning (pp. 1371-1383). PMLR.
forecasting. In European Conference on Computer Vision (pp. 541-556). [21] Ngiam, J., Caine, B., Vasudevan, V., Zhang, Z., Chiang, H. T. L., Ling,
Springer, Cham. J., ... & Shlens, J. (2021). Scene transformer: A unified multi-task
[5] Ye, M., Cao, T., & Chen, Q. (2021). Tpcn: Temporal point cloud model for behavior prediction and planning. arXiv e-prints, arXiv-2106.
networks for motion forecasting. In Proceedings of the IEEE/CVF [22] Casas, S., Luo, W., & Urtasun, R. (2018, October). Intentnet: Learning
Conference on Computer Vision and Pattern Recognition (pp. 11318- to predict intention from raw sensor data. In Conference on Robot
11327). Learning (pp. 947-956). PMLR.
[6] Liu, Y., Zhang, J., Fang, L., Jiang, Q., & Zhou, B. (2021). Multimodal [23] Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S.,
motion prediction with stacked transformers. In Proceedings of the ... & Hays, J. (2021, August). Argoverse 2: Next Generation Datasets
IEEE/CVF Conference on Computer Vision and Pattern Recognition for Self-Driving Perception and Forecasting. In Thirty-fifth Conference
(pp. 7577-7586). on Neural Information Processing Systems Datasets and Benchmarks
[7] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., & Moutarde, F. Track (Round 2).
Home: Heatmap output for future motion estimation. In 2021 IEEE [24] Ettinger, S., Cheng, S., Caine, B., Liu, C., Zhao, H., Pradhan, S., ...
International Intelligent Transportation Systems Conference (pp. 500- & Anguelov, D. (2021). Large scale interactive motion forecasting for
507). IEEE. autonomous driving: The waymo open motion dataset. In Proceedings
[8] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., & Moutarde, F. of the IEEE/CVF International Conference on Computer Vision (pp.
Gohome: Graph-oriented heatmap output for future motion estimation. 9710-9719).
In 2022 International Conference on Robotics and Automation (pp. [25] Zhou, Zikang and Ye, Luyao and Wang, Jianping and Wu, Kui and
9107-9114). IEEE. Lu Kejie. (2022). HiVT: Hierarchical Vector Transformer for Multi-
[9] Huang, Z., Mo, X., & Lv, C. Multi-modal motion prediction with Agent Motion prediction. Proceedings of the IEEE/CVF Conference
transformer-based neural network for autonomous driving. In 2022 on Computer Vision and Pattern Recognition.
International Conference on Robotics and Automation (pp. 2605-2611). [26] Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H., & Chandraker,
IEEE. M. (2017). Desire: Distant future prediction in dynamic scenes with
interacting agents. In Proceedings of the IEEE conference on computer
[10] Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett,
vision and pattern recognition (pp. 336-345).
A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with
[27] Varadarajan, B., Hefny, A., Srivastava, A., Refaat, K. S., Nayakanti,
rich maps. In Proceedings of the IEEE/CVF Conference on Computer
N., Cornman, A., ... & Sapp, B. (2022, May). Multipath++: Efficient
Vision and Pattern Recognition (pp. 8748-8757).
information fusion and trajectory aggregation for behavior prediction.
[11] Helbing, D., & Molnar, P. (1995). Social force model for pedestrian In 2022 International Conference on Robotics and Automation (pp.
dynamics. Physical review E, 51(5), 4282. 7814-7821). IEEE.
[12] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., [28] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., & Moutarde,
& Savarese, S. (2016). Social lstm: Human trajectory prediction in F. (2021, September). THOMAS: Trajectory Heatmap Output with
crowded spaces. In Proceedings of the IEEE conference on computer learned Multi-Agent Sampling. In International Conference on Learning
vision and pattern recognition (pp. 961-971). Representations.
[13] Zhang, P., Ouyang, W., Zhang, P., Xue, J., & Zheng, N. (2019). Sr-lstm: [29] Ye, M., Xu, J., Xu, X., Cao, T., & Chen, Q. (2022). DCMS:
State refinement for lstm towards pedestrian trajectory prediction. In Motion Forecasting with Dual Consistency and Multi-Pseudo-Target
Proceedings of the IEEE/CVF Conference on Computer Vision and Supervision. arXiv preprint arXiv:2204.05859.
Pattern Recognition (pp. 12085-12094). [30] Wang, Y., Zhou, H., Zhang, Z., Feng, C., Lin, H., Gao, C., ... &
[14] Wang, M., Shi, D., Guan, N., Zhang, T., Wang, L., & Li, R. Unsu- Zhang, C. (2022). TENET: Transformer Encoding Network for Effective
pervised pedestrian trajectory prediction with graph neural networks. Temporal Flow on Motion Prediction. arXiv e-prints, arXiv-2207.
In 2019 IEEE 31st International Conference on Tools with Artificial [31] Zhang, C., Sun, H., Chen, C., & Guo, Y. (2022). Technical Report for
Intelligence (pp. 832-839). IEEE. Argoverse2 Challenge 2022–Motion Forecasting Task. arXiv preprint
[15] Mercat, J., Gilles, T., El Zoghby, N., Sandou, G., Beauvois, D., & arXiv:2206.07934.
Gil, G. P. (2020, May). Multi-head attention for multi-modal joint [32] Lu, Qiujing , et al. ”KEMP: Keyframe-Based Hierarchical End-to-
vehicle motion forecasting. In 2020 IEEE International Conference on End Deep Model for Long-Term Trajectory Prediction.” (2022). arXiv
Robotics and Automation (pp. 9638-9644). IEEE. preprint arXiv:2205.04624
[16] Chai, Y., Sapp, B., Bansal, M., & Anguelov, D. (2019, January).
MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for
Behavior Prediction. In CoRL.