IGP2 Emnvironment
IGP2 Emnvironment
Abstract— We propose an integrated prediction and planning in capturing the coupled evolution of traffic. In our view, one
arXiv:2002.02277v3 [cs.RO] 15 Mar 2021
system for autonomous driving which uses rational inverse plan- of the most significant limitations of this class of methods
ning to recognise the goals of other vehicles. Goal recognition (though see recent progress [18]) is the difficulty in extracting
informs a Monte Carlo Tree Search (MCTS) algorithm to plan
optimal maneuvers for the ego vehicle. Inverse planning and interpretable predictions in a form that is amenable to efficient
MCTS utilise a shared set of defined maneuvers and macro integration with planning methods that effectively represent
actions to construct plans which are explainable by means multi-dimensional and hierarchical task objectives.
of rationality principles. Evaluation in simulations of urban Our starting point is that in order to predict the future
driving scenarios demonstrate the system’s ability to robustly maneuvers of a vehicle, we must reason about why – that is,
recognise the goals of other vehicles, enabling our vehicle to
exploit non-trivial opportunities to significantly reduce driving to what end – the vehicle performed its past maneuvers, which
times. In each scenario, we extract intuitive explanations for will yield clues as to its intended goal [19]. Knowing the goals
the predictions which justify the system’s decisions. of other vehicles enables prediction of their future maneuvers
and trajectories, which facilitates planning over extended
I. I NTRODUCTION timescales. We show in our work (illustrated in Figure 2) how
The ability to predict the intentions and driving trajectories such reasoning can help to address the problem of overly-
of other vehicles is a key problem for autonomous driving [1]. conservative autonomous driving [20]. Further, to the extent
This problem is significantly complicated by the need to make that our predictions are structured around the interpretation
fast and accurate predictions based on limited observation of observed trajectories in terms of high-level maneuvers, the
data which originate from coupled multi-agent interactions. goal recognition process lends itself to intuitive interpretation
To make prediction tractable in such conditions, a standard for the purposes of system analysis and debugging, at a
approach in autonomous driving research is to assume that level of detail suggested in Figure 2. As we develop towards
vehicles use one of a finite number of distinct high-level making our autonomous systems more trustworthy [21], these
maneuvers, such as lane-follow, lane-change, turn, stop, etc. notions of interpretation and the ability to justify (explain)
[2], [3], [4], [5], [6], [7]. A classifier of some type is used the system’s decisions are key [22].
to detect a vehicle’s current executed maneuver based on its To this end, we propose Interpretable Goal-based Predic-
observed driving trajectory. The limitation in such methods is tion and Planning (IGP2) which leverages the computational
that they only detect the current maneuver of other vehicles, advantages of using a finite space of maneuvers, but extends
hence planners using such predictions are effectively limited the approach to planning and prediction of sequences (i.e.,
to the timescales of the detected maneuvers. An alternative plans) of maneuvers. We achieve this via a novel integration
approach is to specify a finite set of possible goals for each of rational inverse planning [23], [24] to recognise the goals of
other vehicle (such as road exit points) and to plan a full other vehicles, with Monte Carlo Tree Search (MCTS) [25] to
trajectory to each goal from the vehicle’s observed local state plan optimal maneuvers for the ego vehicle. Inverse planning
[8], [9], [10]. While this approach can generate longer-term and MCTS utilise a shared set of defined maneuvers to
predictions, a limitation is that the generated trajectories must construct plans which are explainable by means of rationality
be matched relatively closely by a vehicle in order to yield principles, i.e. plans are optimal with respect to given metrics.
high-confidence predictions of the vehicle’s goals. We evaluate IGP2 in simulations of diverse urban driving
Recent methods based on deep learning have shown promis- scenarios, showing that (1) the system robustly recognises the
ing results for trajectory prediction in autonomous driving goals of other vehicles, even if significant parts of a vehicle’s
[11], [12], [13], [14], [15], [16], [17]. Prediction models are trajectory are occluded, (2) goal recognition enables our
trained on large datasets that are becoming available through vehicle to exploit opportunities to improve driving efficiency
data gathering campaigns involving sensorised vehicles (e.g. as measured by driving time compared to other prediction
video, lidar, radar). Reliable prediction over several second baselines, and (3) we are able to extract intuitive explanations
horizons remains a hard problem, in part due to the difficulties for the predictions to justify the system’s decisions.
In summary, our contributions are:
S.A. is supported by a Royal Society Industry Fellowship. C.B., J.W., B.G.
were interns at Five AI with partial financial support from the Royal Society • A method for goal recognition and multi-modal trajectory
and UKRI. IGP2 code: https://github.com/uoe-agents/IGP2 prediction via rational inverse planning.
• Integration of goal recognition with MCTS planning to
generate optimised plans for the ego vehicle.
• Evaluation in simulated urban driving scenarios showing
accurate goal recognition, improved driving efficiency,
and ability to interpret the predictions and ego plans.
condition of the first maneuver in the macro action as well where β is a scaling parameter (we use β = 1). This likelihood
as optional additional conditions. The termination condition definition assumes that vehicles drive approximately rationally
of a macro action is given by the termination condition of (i.e., optimally) to achieve their goals while allowing for some
the last maneuver in the macro action. deviation. If a goal is infeasible, we set its probability to zero.
Algorithm 1 shows the pseudo code for our goal recognition
C. Velocity Smoothing algorithm, with further details in below subsections.
To obtain a feasible trajectory across maneuvers for vehicle 1) Goal Generation: A heuristic function is used to
i, we define a velocity smoothing operation which optimises generate a set of possible goals G i for vehicle i based on its
the target velocities in a given trajectory ŝi1:n . Let x̂t be the location and context information such as road layout. In our
longitudinal position on the reference path at ŝit and v̂t its system, we include goals for the visible end of the current
target velocity, for 1 ≤ t ≤ n. We define κ : x → v as the road and connecting roads (bounded by the ego vehicle’s view
piecewise linear interpolation of target velocities between region). In addition to such static goals, it is also possible
points x̂t . Given the time elapsed between two time steps, to add dynamic goals which depend on current traffic. For
∆t; the maximum velocity and acceleration, vmax /amax ; and example, in the dense merging scenario shown in Figure 2d,
setting x1 = x̂1 , v1 = v̂1 , we define velocity smoothing as stopping goals are dynamically added to model a vehicle’s
Xn n−1
X intention to allow the ego vehicle to merge in front.
min ||vt − κ(xt )||2 + λ ||vt+1 − vt ||2 2) Maneuver Detection: Maneuver detection is used to
x2:n ,v2:n
t=1 t=1 detect the current executed maneuver of a vehicle (at time t),
s.t. xt+1 = xt + vt ∆t (4) allowing inverse planning to complete the maneuver before
0 < vt < vmax , vt ≤ κ(xt ) planning onward. We assume a module which computes
|vt+1 − vt | < amax ∆t probabilities over current maneuvers, p(ω i ), for each vehicle i.
One option is Bayesian changepoint detection (e.g. [29]). The
where λ > 0 is the weight given to the acceleration part details of maneuver detection are outside the scope of our
of the optimisation objective. Eq. (4) is a nonlinear non- paper and in our experiments we use a simulated detector
convex optimisation problem which can be solved, e.g., using (cf. Sec IV-B). As different current maneuvers may hint at
a primal-dual interior point method (we use IPOPT [28]). different goals, we perform inverse planning for each possible
From the solution of the problem, (x2:n , v2:n ), we interpolate current maneuver for which p(ω i ) > 0. Thus, each current
to obtain the achievable velocities at the original points x̂t . maneuver produces its associated posterior probabilities over
goals, denoted by p(Gi | s1:t , ω i ).
D. Goal Recognition
3) Inverse Planning: Inverse planning is done using A*
We assume that each non-ego vehicle i seeks to reach search [30] over macro actions. A* starts after completing
one of a finite number of possible goals Gi ∈ G i , using the current maneuver ω i which produces the initial trajectory
plans constructed from our defined macro actions. We use the ŝ1:τ . Each search node q corresponds to a state s ∈ S, with
framework of rational inverse planning [23], [24] to compute initial node at state ŝτ , and macro actions are filtered by their
a Bayesian posterior distribution over i’s goals at time t applicability conditions applied to s. A* chooses the next
macro action leading to a node q 0 which has lowest estimated
p(Gi |s1:t ) ∝ L(s1:t |Gi )p(Gi ) (5)
total cost1 to goal Gi , given by f (q 0 ) = l(q 0 ) + h(q 0 ). The
where L(s1:t |Gi ) is the likelihood of i’s observed trajectory cost l(q 0 ) to reach node q 0 is given by the driving time
assuming its goal is Gi , and p(Gi ) specifies the prior from i’s location in the initial search node to its location in
probability of Gi . q 0 , following the trajectories returned by the macro actions
The likelihood is a function of the reward difference leading to q 0 . A* uses the assumption that all other vehicles
between two plans: the reward r̂ of the optimal trajectory from not planned for use a constant-velocity lane-following model
i’s initial observed state si1 to goal Gi after velocity smoothing, after their observed trajectories. We do not check for collisions
and the reward r̄ of the trajectory which follows the observed during inverse planning. The cost heuristic h(q 0 ) to estimate
trajectory until time t and then continues optimally to goal remaining cost from q 0 to goal Gi is given by the driving
Gi , with smoothing applied only to the trajectory after t. The time from i’s location in q 0 to goal via straight line at speed
likelihood is defined as
1 Here we use the term “cost” in keeping with standard A* terminology
L(s1:t |Gi ) = exp(β(r̄ − r̂)) (6) and to differentiate from the reward function defined in Sec. II.
Algorithm 1 Goal recognition algorithm Algorithm 2 Monte Carlo Tree Search algorithm
i Returns: optimal maneuver for ego vehicle ε in state st
Input: vehicle i, current maneuver ω , observations s1:t
Returns: goal probabilities p(Gi |s1:t , ω i ) Perform K simulations:
1: Generate possible goals Gi ∈ G i from state sit 1: Search node q.s ← st (root node)
2: Set prior probabilities p(Gi ) (e.g. uniform) 2: Search depth d ← 0
3: for all Gi ∈ G i do 3: for all i ∈ I \ {ε} do
4: ŝi1:n ← A*(ω i ) from ŝi1 = si1 to Gi 4: Sample current maneuver ω i ∼ p(ω i )
5: Apply velocity smoothing to ŝi1:n 5: Sample goal Gi ∼ p(Gi | s1:t , ω i )
6: r̂ ← reward Ri (ŝi1:n ) 6: Sample trajectory ŝi1:n ∈ {ŝi,k i i i,k
1:n | ω , G } with p(ŝ1:n )
7: s̄i1:m ← A*(ω i ) from s̄it to Gi , with s̄i1:t = si1:t 7: while d < dmax do
8: Apply velocity smoothing to s̄it+1:m 8: Select macro action µ for ε applicable in q.s
9: r̄ ← reward Ri (s̄i1:m ) 9: ŝτ :ι ← Simulate µ until it terminates, with non-ego vehicles
10: L(s1:t |Gi , ω i ) ← exp(β(r̄ − r̂)) following their sampled trajectories ŝi1:n
11: Return p(Gi |s1:t , ω i ) ∝ L(s1:t |Gi , ω i ) p(Gi ) 10: r←∅
11: if ego vehicle collides during ŝτ :ι then
12: r ← rcoll
13: else if ŝει achieves ego goal Gε then
limit. This definition of h(q 0 ) is admissible as per A* theory, 14: r ← Rε (ŝt:n )
which ensures that the search returns an optimal plan. After 15: else if d = dmax − 1 then
the optimal plan is found, we extract the complete trajectory 16: r ← rterm
ŝi1:n from the maneuvers in the plan and initial segment ŝ1:τ . 17: if r 6= ∅ then
4) Trajectory Prediction: Our system predicts multiple 18: Use (8) to backprop r along search branches (q, µ, q 0 ) that
plausible trajectories for a given vehicle and goal. This generated the simulation
is required because there are situations in which different 19: Start next simulation
20: q 0 .s = ŝι ; q ← q 0 ; d ← d + 1
trajectories may be (near-)optimal but may lead to different
Return maneuver for ε in st , µ ∈ arg maxµ Q(root, µ)
predictions which could require different behaviour on the part
of the ego vehicle. We run A* search for a fixed amount of
time and let it compute a set of plans with associated rewards
(up to some fixed number of plans). Any time A* search of trajectories uses a combination of proportional control and
finds a node that reaches the goal, the corresponding plan is adaptive cruise control (based on IDM [32]) to control a
added to the set of plans. Given a set of smoothed trajectories vehicle’s acceleration and steering. Termination conditions
{ŝi,k i i i i of maneuvers are monitored in each time step based on the
1:n |ω , G }k=1..K to goal G with initial maneuver ω and
i i,k
associated reward rk = R (ŝ1:n ), we compute a distribution vehicle’s observations. Collision checking is performed on
over the trajectories via a Boltzmann distribution ŝτ :ι to check whether the ego vehicle collided, in which case
we set the reward to r ← rcoll which is back-propagated using
p(ŝi,k
1:n ) ∝ exp(γ rk ) (7) (8), where rcoll is a method parameter. Otherwise, if the new
state ŝι achieves the ego goal Gε , we compute the reward
where γ is a scaling parameter (we use γ = 1). Similar to
for back-propagation as r ← Rε (ŝt:n ). If the search reached
Eq. (6), Eq. (7) encodes the assumption that trajectories which
its maximum depth dmax without colliding or achieving the
are closer to optimal are more likely.
goal, we set r ← rterm which can be a constant or based on
E. Ego Vehicle Planning heuristic reward estimates similar to A* search.
The reward r is back-propagated through search branches
To compute an optimal plan for the ego vehicle, we use the (q, µ, q 0 ) that generated the simulation, using a 1-step off-
goal probabilities and predicted trajectories to inform a Monte policy update function (similar to Q-learning [33])
Carlo Tree Search (MCTS) algorithm [25] (see Algorithm 2). (
The algorithm performs a number of closed-loop simula- δ −1 [r − Q(q, µ)] if q leaf node, else
Q(q, µ) ← Q(q, µ)+ −1
tions ŝt:n , starting in the current state ŝt = st down to some δ [maxµ0 Q(q 0 , µ0 ) − Q(q, µ)]
fixed search depth or until a goal state is reached. At the (8)
start of each simulation, for each non-ego vehicle, we first where δ is the number of times that macro action µ has
sample a current maneuver, then goal, and then trajectory for been selected in q. After the simulations are completed, the
the vehicle using the associated probabilities (cf. Section III- algorithm selects the best macro action for execution in st
D). Each node q in the search tree corresponds to a state from the root node, arg maxµ Q(root, µ).
s ∈ S and macro actions are filtered by their applicability
conditions applied to s. After selecting a macro action µ using IV. E VALUATION
some exploration technique (we use UCB1 [31]), the state We evaluate IGP2 in simulations of diverse urban driving
in the current search node is forward-simulated based on the scenarios, showing that: (1) our inverse planning method
trajectory generated by the macro action µ and the sampled robustly recognises the goals of non-ego vehicles; (2) goal
trajectories of non-ego vehicles, resulting in a partial trajectory recognition leads to improved driving efficiency measured by
ŝτ :ι and new search node q 0 with state ŝι . Forward-simulation driving time; and (3) intuitive explanations for the predictions
(a) Scenario 1 (S1) (b) Scenario 2 (S2) (c) Scenario 3 (S3) (d) Scenario 4 (S4)
Fig. 2: IGP2 in 4 test scenarios. Ego vehicle shown in blue. Bar plots show goal probabilities for non-ego vehicles. For each goal, up to
two of the most probable predicted trajectories to goal are shown with thickness proportional to probability. (a) S1: Ego’s goal is blue goal.
Vehicle V1 is on the ego’s road, V1 changes from left to right lane, biasing the ego prediction towards the belief that V1 will exit, since a
lane change would be irrational if V1 ’s goal was to go east. As exiting will require a significant slowdown, the ego decides to switch
lanes to avoid being slowed down too. (b) S2: Ego’s goal is blue goal. Vehicle V1 is approaching the junction from the east and vehicle
V2 from the west. As V1 approaches the junction, slows down and waits to take a turn, the ego’s belief that V1 will turn right increases
significantly, since it would be irrational to stop if the goal was to turn left or go straight. Since the ego recognised V1 ’s goal is to go
north, it predicts that V1 will wait until V2 has passed, giving the ego an opportunity to enter the road. (c) S3: Ego’s goal is green goal.
As V1 changes from the inside to the outside lane of the roundabout and decreases its speed, it significantly biases the ego prediction
towards the belief that V1 will take the south exit since that is the rational course of action for that goal. This encourages the ego to enter
the roundabout while V1 is still in roundabout. (d) S4: Ego’s goal is purple goal. With two vehicles stopped at the junction at a traffic
light, vehicle V1 is approaching them from behind, and vehicle V2 is crossing in the opposite direction. When V1 reaches zero velocity, the
goal generation function adds a stopping goal (orange) for V1 in its current position, shifting the goal distribution towards it since stopping
is not rational for the north/west goals. The interpretation is that V1 wants the ego to merge in front of V1 , which the ego then does.
Seconds
Seconds
in Tab. II since possible vehicle goals change after exit points are 60
20