VectorNet
VectorNet
Vectorized Representation
Crosswalk
Abstract
Crosswalk Map
Completion
Lane Lane
Agent
Feature
Trajectory
Prediction
Agent Agent
Figure 2. An overview of our proposed VectorNet. Observed agent trajectories and map features are represented as sequence of vectors,
and passed to a local graph network to obtain polyline-level features. Such features are then passed to a fully-connected graph to model
the higher-order interactions. We compute two types of losses: predicting future trajectories from the node features corresponding to the
moving agents and predicting the node features when their features are masked out.
agent dynamics and structured scene context directly from supervised learning from sequential linguistic [11] and vi-
their vectorized form (Figure 1, right). The geographic ex- sual data [27], we propose an auxiliary graph completion
tent of the road features can be a point, a polygon, or a curve objective in addition to the behavior prediction objective.
in geographic coordinates. For example, a lane boundary More specifically, we randomly mask out the input node
contains multiple control points that build a spline; a cross- features belonging to either scene context or agent trajecto-
walk is a polygon defined by several points; a stop sign is ries, and ask the model to reconstruct the masked features.
represented by a single point. All these geographic entities The intuition is to encourage the graph networks to better
can be closely approximated as polylines defined by mul- capture the interactions between agent dynamics and scene
tiple control points, along with their attributes. Similarly, context. In summary, our contributions are:
the dynamics of moving agents can also be approximated
by polylines based on their motion trajectories. All these • We are the first to demonstrate how to directly incor-
polylines can then be represented as sets of vectors. porate vectorized scene context and agent dynamics in-
formation for behavior prediction.
We use graph neural networks (GNNs) to incorporate
these sets of vectors. We treat each vector as a node in • We propose the hierarchical graph network VectorNet
the graph, and set the node features to be the start location and the node completion auxiliary task.
and end location of each vector, along with other attributes • We evaluate the proposed method on our in-house be-
such as polyline group id and semantic labels. The context havior prediction dataset and the Argoverse dataset,
information from HD maps, along with the trajectories of and show that our method achieves on par or better per-
other moving agents are propagated to the target agent node formance over a competitive rendering baseline with
through the GNN. We can then take the output node fea- 70% model size saving and an order of magnitude re-
ture corresponding to the target agent to decode its future duction in FLOPs. Our method also achieves the state-
trajectories. of-the-art performance on Argoverse.
Specifically, to learn competitive representations with
GNNs, we observe that it is important to constrain the con-
2. Related work
nectivities of the graph based on the spatial and semantic Behavior prediction for autonomous driving. Behavior
proximity of the nodes. We therefore propose a hierarchi- prediction for moving agents has become increasingly im-
cal graph architecture, where the vectors belonging to the portant for autonomous driving applications [7, 9, 19], and
same polylines with the same semantic labels are connected high-fidelity maps have been widely used to provide context
and embedded into polyline features, and all polylines are information. For example, IntentNet [5] proposes to jointly
then fully connected with each other to exchange informa- detect vehicles and predict their trajectories from LiDAR
tion. We implement the local graphs with multi-layer per- points and rendered HD maps. Hong et al. [15] assumes
ceptrons, and the global graphs with self-attention [30]. An that vehicle detections are provided and focuses on behavior
overview of our approach is shown in Figure 2. prediction by encoding entity interactions with ConvNets.
Finally, motivated by the recent success of self- Similarly, MultiPath [6] also uses ConvNets as encoder,
but adopts pre-defined trajectory anchors to regress multi- Next we present the hierarchical graph network which ag-
ple possible future trajectories. PRECOG [23] attempts to gregates local information from individual polylines and
capture the future stochasiticity by flow-based generative then globally over all trajectories and map features. This
models. Similar to [6, 15, 23], we also assume the agent de- graph can then be used for behavior prediction.
tections to be provided by an existing perception algorithm.
3.1. Representing trajectories and maps
However, unlike these methods which all use ConvNets to
encode rendered road maps, we propose to directly encode Most of the annotations from an HD map are in the form
vectorized scene context and agent dynamics. of splines (e.g. lanes), closed shape (e.g. regions of inter-
Forecasting multi-agent interactions. Beyond the au- sections) and points (e.g. traffic lights), with additional at-
tonomous driving domain, there is more general interest to tribute information such as the semantic labels of the an-
predict the intents of interacting agents, such as for pedes- notations and their current states (e.g. color of the traffic
trians [2, 13, 24], human activities [28] or for sports play- light, speed limit of the road). For agents, their trajecto-
ers [12, 26, 32, 33]. In particular, Social LSTM [2] models ries are in the form of directed splines with respect to time.
the trajectories of individual agents as separate LSTM net- All of these elements can be approximated as sequences of
works, and aggregates the LSTM hidden states based on vectors: for map features, we pick a starting point and di-
spatial proximity of the agents to model their interactions. rection, uniformly sample key points from the splines at the
Social GAN [13] simplifies the interaction module and pro- same spatial distance, and sequentially connect the neigh-
poses an adversarial discriminator to predict diverse futures. boring key points into vectors; for trajectories, we can just
Sun et al. [26] combines graph networks [4] with varia- sample key points with a fixed temporal interval (0.1 sec-
tional RNNs [8] to model diverse interactions. The social ond), starting from t = 0, and connect them into vectors.
interactions can also be inferred from data: Kipf et al. [18] Given small enough spatial or temporal intervals, the result-
treats such interactions as latent variables; and graph atten- ing polylines serve as close approximations of the original
tion networks [16, 31] apply self-attention mechanism to map and trajectories.
weight the edges in a pre-defined graph. Our method goes Our vectorization process is a one-to-one mapping be-
one step further by proposing a unified hierarchical graph tween continuous trajectories, map annotations and the vec-
network to jointly model the interactions of multiple agents, tor set, although the latter is unordered. This allows us to
and their interactions with the entities from road maps. form a graph representation on top of the vector sets, which
Representation learning for sets of entities. Traditionally can be encoded by graph neural networks. More specifi-
machine perception algorithms have been focusing on high- cally, we treat each vector vi belonging to a polyline Pj as
dimensional continuous signals, such as images, videos or a node in the graph with node features given by
audios. One exception is 3D perception, where the inputs vi = [dsi , dei , ai , j] , (1)
are usually in the form of unordered point sets, given by
depth sensors. For example, Qi et al. propose the Point- where dsi and dei are coordinates of the start and end points
Net model [20] and PointNet++ [21] to apply permutation of the vector, d itself can be represented as (x, y) for 2D
invariant operations (e.g. max pooling) on learned point em- coordinates or (x, y, z) for 3D coordinates; ai corresponds
beddings. Unlike point sets, entities on HD maps and agent to attribute features, such as object type, timestamps for tra-
trajectories form closed shapes or are directed, and they jectories, or road feature type or speed limit for lanes; j is
may also be associated with attribute information. We there- the integer id of Pj , indicating vi ∈ Pj .
fore propose to keep such information by vectorizing the in- To make the input node features invariant to the locations
puts, and encode the attributes as node features in a graph. of target agents, we normalize the coordinates of all vectors
Self-supervised context modeling. Recently, many works to be centered around the location of target agent at its last
in the NLP domain have proposed modeling language con- observed time step. A future work is to share the coordinate
text in a self-supervised fashion [11, 22]. Their learned rep- centers for all interacting agents, such that their trajectories
resentations achieve significant performance improvement can be predicted in parallel.
when transferred to downstream tasks. Inspired by these 3.2. Constructing the polyline subgraphs
methods, we propose an auxiliary loss for graph represen-
tations, which learns to predict the missing node features To exploit the spatial and semantic locality of the nodes,
from its neighbors. The goal is to incentivize the model to we take a hierarchical approach by first constructing sub-
better capture interactions among nodes. graphs at the vector level, where all vector nodes belonging
to the same polyline are connected with each other. Con-
3. VectorNet approach sidering a polyline P with its nodes {v1 , v2 , ..., vP }, we
define a single layer of subgraph propagation operation as
This section introduces our VectorNet approach. We first n o
(l+1) (l) (l)
describe how to vectorize agent trajectories and HD maps. vi = ϕrel genc (vi ), ϕagg genc (vj ) (2)
(l)
Output Node where {pi } is the set of polyline node features, GNN(·)
Features
corresponds to a single layer of a graph neural network, and
A corresponds to the adjacency matrix for the set of poly-
line nodes.
Concat
The adjacency matrix A can be provided a heuristic,
such as using the spatial distances [2] between the nodes.
Permutation
Invariant For simplicity, we assume A to be a fully-connected graph.
Aggregator
Our graph network is implemented as a self-attention oper-
ation [30]: