Proceedings PDF
Proceedings PDF
Preface
The 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019)
is the flagship international conference in robotics and intelligent systems. It is co-sponsored
by the IEEE, the IEEE Robotics and Automation Society (RAS), the IEEE Industrial Elec-
tronics Society (IES), the Robotics Society of Japan (RSJ), the Society of Instruments and
Control Engineers (SICE), and the New Technology Foundation (NTF). IEEE is a non-profit,
technical professional association of more than 400,000 members in 160 countries. It is a lead-
ing authority in technical areas ranging from computer engineering, biomedical technology and
telecommunications, to electric power, aerospace and consumer electronics, among others.
This volume contains the papers presented at the workshop TCV2019: Towards Cogni-
tive Vehicles: perception, learning and decision making under real-world constraints. Is bio-
inspiration helpful? held on November 8, 2019 at IROS.
Objectives of the workshop:
Autonomous driving is only one out of many aspects of intelligence required for future trans-
portation systems. Human-machine interaction in a cognitive vehicle is an intriguing use case
that requires intelligence beyond the state of the art in machine learning, computer vision, and
AI. For safe and convenient human-machine interaction, the intelligent system such as a smart
vehicle needs to be able to perceive its environment and make decisions based on the received
data. Current state-of-the-art approaches to both intelligent perception and decision making
typically rely on machine learning with offline training of neural networks using elaborated
datasets. To enable truly adaptive intelligence, as we know it from biological systems, learning
that supports decision making and perception needs to happen in real time, in an online fashion.
But can such adaptive perceiving, deciding, and learning systems be safe enough to actually be
deployed in an intelligent vehicle?
While biological inspiration has led to some of the most successful approaches in perception
and machine learning – deep neural networks, – its deployment in real-world, safety-critical
settings is yet limited. We aim to explore and critically discuss what biological inspiration in
perception, learning, and decision making could bring in the future for increasing intelligence
of vehicles and other robotic systems.
Thus, the aim of the workshop is to discuss potential benefits and pitfalls in applying bio-
inspired approaches when developing intelligent real-world systems that perceive, interact, learn,
and make decisions. We will focus on the application area of intelligent, ”cognitive” vehicles and
will use an unconventional format: for each of three subtopics we invited 2-4 experts from dif-
ferent schools of thought (for example, traditional machine learning and brain-inspired learning,
conventional approach to planning and decision making and cognitive architecture-based ap-
proach, event-based bio-inspired vision and conventional machine visions, etc. ). Each speaker
will give a short introductory talk followed by a moderated panel discussion around each topic.
Furthermore, we will invite researchers from intelligent robotics and vehicles with a focus on
perception, learning and decision making to present their work in posters and short spot-light
talks.
The workshop will stimulate discussion of the role of biological inspiration in the develop-
ment of future AI systems in the context of real-world, safety-critical applications of robotic
systems in environments shared with humans.
Topics of interest:
Applications
i
TCV2019 preface
Perception
• Robust accountable and scalable perception with neural networks and without
• Multi-modal perception and sensory integration
• Attention and cognitive control in visual and tactile perception
• Gesture recognition
• Perception for action
Learning
Cognitive Architectures
We would like to thank the technical committee for cognitive robotics of the IEEE Robotics
& Automation Society, NEUROTECH, BMW Group and BOSCH for their support.
ii
TCV2019 Table of Contents
Table of Contents
Enhancing Object Detection in Adverse Conditions using Thermal Imaging . . . . . . . . . . . . . . . 1
Kshitij Agrawal and Anbumani Subramanian
Exploration for Objects Labelling Guided by Environment Semantics using UAVs . . . . . . . . 4
Reem Ashour, Tarek Taha, Jorge Dias, Lakmal Seneviratne and Nawaf Almousa
Towards game theoretic AV controllers: measuring pedestrian behaviour in Virtual
Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fanta Camara, Patrick Dickinson, Natasha Merat and Charles W. Fox
MSPRT action selection model for bio-inspired autonomous driving and intention
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Riccardo Donà, Gastone Pietro Rosati Papini and Giammarco Valenti
A dynamic neural model for endowing intelligent cars with the ability to learn driver
routines: where to go, when to arrive and how long to stay there? . . . . . . . . . . . . . . . . . . . . . . . . 15
Flora Ferreira, Weronika Wojtak, Wolfram Erlhagen, Paulo Vicente, Ankit Patel,
Sérgio Monteiro and Estela Bicho
Towards an Evaluation Methodology for the Environment Perception of Automotive
Sensor Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Maike Hartstern, Viktor Rack and Wilhelm Stork
Risk-Aware Reasoning for Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Majid Khonji, Jorge Dias and Lakmal Seneviratne
Cognitively-inspired episodic imagination for self-driving vehicles . . . . . . . . . . . . . . . . . . . . . . . . . 28
Sara Mahmoud, Henrik Svensson and Serge Thill
A Cognitively Informed Perception Model for Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Alice Plebe and Mauro Da Lio
Cognitive Wheelchair: A Personal Mobility Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Mahendran Subramanian, Suhyung Park, Pavel Orlov and Aldo Faisal
A Frontal Cortical Loop For Autonomous Vehicles Using Neuralized Perception-Action
Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
David Windridge and Seyed Ali Ghorashi
Following Social Groups: Socially-Compliant Autonomous Navigation in Dense Crowds . . . 44
Xinjie Yao, Ji Zhang and Jean Oh
1
TCV2019 Author Index
Author Index
Agrawal, Kshitij 1
Almousa, Nawaf 4
Ashour, Reem 4
Bicho, Estela 15
Camara, Fanta 7
Da Lio, Mauro 32
Dias, Jorge 4, 23
Dickinson, Patrick 7
Donà, Riccardo 11
Erlhagen, Wolfram 15
Faisal, Aldo 36
Ferreira, Flora 15
Fox, Charles W. 7
Hartstern, Maike 19
Khonji, Majid 23
Mahmoud, Sara 28
Merat, Natasha 7
Monteiro, Sérgio 15
Oh, Jean 44
Orlov, Pavel 36
Park, Suhyung 36
Patel, Ankit 15
Plebe, Alice 32
Rack, Viktor 19
Rosati Papini, Gastone Pietro 11
Seneviratne, Lakmal 4, 23
Stork, Wilhelm 19
Subramanian, Anbumani 1
Subramanian, Mahendran 36
Svensson, Henrik 28
1
TCV2019 Author Index
Taha, Tarek 4
Thill, Serge 28
Valenti, Giammarco 11
Vicente, Paulo 15
Windridge, David 40
Wojtak, Weronika 15
Yao, Xinjie 44
Zhang, Ji 44
2
TCV2019 Keyword Index
Keyword Index
3D reconstruction 4
chance constraints 23
Cognitive architectures and neural networks 15
Cognitive architectures for action selection Pedestrian Behaviour 7
Cognitive Vehicle 36
Cognitive Vehicles 40
convergence-divergence zones 32
Cost function 4
deep learning 32
Deep Learning 40
Deep Reinforcement Learning 28
Driver routines 15
Episodic Imagination 28
external vehicle sensors 19
Eye Gaze 36
Game Theory 7
Gaze tracking 36
MSPRT 11
1
TCV2019 Keyword Index
object detection 1
Online learning 15
perception 19
Perception-Action Learning 40
risk-aware planning 23
thermal imaging 1
Utility function 4
variational autoencoder 32
vehicle sensor setup configuration 19
virtual testing 19
WTA 11
2
TCV2019 Program Committee
Program Committee
1
Enhancing Object Detection in Adverse Conditions using Thermal
Imaging
Anonymous submission
Abstract— Autonomous driving relies on deriving under- The features extracted from an image can help identify
standing of objects and scenes through images. These images are an object in good lighting and normal weather conditions.
often captured by sensors in the visible spectrum. For improved However, images obtained using camera systems in low light
detection capabilities we propose the use of thermal sensors to
augment the vision capabilities of an autonomous vehicle. In this conditions - night, dusk and dawn, and adverse weather
paper, we present our investigations on the fusion of visible and conditions - rain and snow, contain partially illuminated
thermal spectrum images using a publicly available dataset, and objects, low contrast and low information content. These
use it to analyze the performance of object recognition on other images are often difficult for object detection algorithms.
known driving datasets. We present an comparison of object The primary contribution of our work is to investigate the
detection in night time imagery and qualitatively demonstrate
that thermal images significantly improve detection accuracies. nature of object detectors in the thermal spectrum in driving
scenarios for autonomous navigation. We utilize the FLIR
I. INTRODUCTION ADAS [11] dataset that consists of annotated thermal images
Object detection is one of the primary component for and time synchronized visible images. Datasets like KAIST
scene understanding in an autonomous vehicle. The detected [12] exist for similar purpose, however they are limited to
objects are used to plan the trajectory of a vehicle. Cameras annotations of only people.
are used to capture images of the environment, which are The next sections are organized as follows: in Section 2,
then input to a myriad of computer vision tasks, including we will cover related research, Section 3, we deal with the
object detection. datasets, generation of a ground truth for the visible and
While significant progress has been acheived in using thermal pairs in the FLIR ADAS dataset and the setup of
visible spectrum for object detection algorithms, it poses our experiment. In Section 4 we will present our result and
inherent limitations due to the response from cameras in subsequent conclusion in Section 5.
visible spectrum. Some of the shortcomings include low
dynamic range, slow exposure adjustment, inefficiencies in II. RELATED WORK
high contrast scenes etc, while being subject to weather Object detection consists of recognition and localization
conditions like fog and rain. Bio inspired vision, like infra- of object boundaries within an image. Early work in the
red based thermal vision, could be an effective tool to computer vision field has focused on building task based
augment the shortcomings of imagers that operate in the classifiers using specific image properties. In some of the
visible spectrum. earlier approaches a sliding window is used to classify parts
Other sensing modalities like LIDAR based systems are of an image based on feature pyramids [15], histogram of
sufficient to detect depth in a scene. However, the data may oriented gradients (HoG) with a combination of SVM has
be too coarse to detect objects at further distances and may been used to classify pedestrians [13] and features pools of
lack resolution to classify objects. Thermal imagers on the Haar features [14] have been employed for face detection.
other hand can easily visualize objects that emit infra-red A more generalized form of object detection has evolved
radiation due to their inherent heat. Due to this property, over the years due to the advancement in deep learning.
thermal imagers can visualize important participants on the The exhaustive search for classification has been replaced
road like people, cars and animals at any time of the day. by convolutional classifiers. Object detection models have
Augmenting the detection of objects in the thermal spectrum been proposed to work with relative good accuracy on the
could be a good way to enable robust object detection for visible spectrum using models that consist of a) a two stage
safety critical systems like autonomous vehicles. system a classifier connected with a region proposal network,
Object detection methods have progressed significantly RCNN [16], Fast-RCNN [17] and Faster-RCNN [18] b) a
over the years from simple contour based methods using single stage network with the classification and localization
support vector machines (SVM) [1-7] to ones using deep layers in a cohesive space, like YOLO [19] and SSD [20].
classification models [16]–[20] that utilize hierarchal repre- Models trained on large scale datasets are known to
sentation of data. Data driven models are the flavor of the perform to quite a good extent. With driving datasets like
day by dominating the detection benchmarks on large scale KITTI [21], Cityscapes [22] the object detection models have
datasets like PASCAL VOC [8] and COCO [9]. been employed to detect pedestrians, cars and bicycles.
There is a large body of work done for recognizing Some work has been done in the detection of objects
and localizing objects in the visible spectrum to recognize thermal images [23-26], especially focusing on human de-
objects like people [13, 14], vehicles [10] and traffic lights. tection. Since some of the work has been from static camera,
methods we are able to translate the annotations to the visible
space as well resulting in about 8000 training and 1247
validation images with 42-58 split in night vs day. In the rest
of our work we refer to this translated dataset as the FLIR
RGB dataset. Fig 1 shows the translation of bounding boxes
from the thermal images to the corresponding registered
image in the RGB domain. The input images as part of
the FLIR dataset are uncorrected images and slight radial
distortions due to the lens can be visualized. The drawback
of our technique is that the points closer to the center can
be registered, however, the points radially distant from the
center do not align well.
TABLE I
S CHEME SHOWING MAPPING OF LABELS
Reem Ashour1, Tarek Taha2, Jorge Dias1, Lakmal Seneviratne1, and Nawaf Almoosa1
ACKNOWLEDGMENT
This publication is based upon work supported by the Khalifa
University of Science and Technology under Award No.
RC1-2018-KUCARS.
REFERENCES
Abstract— Understanding pedestrian interaction is of great towards each other at an unmarked intersection as in Fig. 1.
importance for autonomous vehicles (AVs). The present study In the model, this process occurs over discrete space as in
investigates pedestrian behaviour during crossing scenarios Fig. 2 and discrete times (‘turns’) during which the agents
with an autonomous vehicle using Virtual Reality. The
self-driving car is driven by a game theoretic controller which can adjust their discrete speeds, simultaneously selecting
adapts its driving style to pedestrian crossing behaviour. We speeds of either 1 square per turn or 2 squares per turn,
found that subjects value collision avoidance about 8 times at each turn. Both agents want to pass the intersection as
more than saving 0.02 seconds. A previous lab study found soon as possible to avoid travel delays, but if they collide,
time saving to be more important than collision avoidance in they are both bigger losers as they both receive a negative
a highly unrealistic board game style version of the game. The
present result suggests that the VR simulation reproduces real utility Ucrash . Otherwise if the players pass the intersection,
world road-crossings better than the lab study and provides a each receives a time delay penalty −TUT , where T is the
reliable test-bed for the development of game theoretic models time from the start of the game and UT represents the value
for AVs. of saving one turn of travel time.
Keywords: Autonomous Vehicles; Game Theory; Cog-
nitive architectures for action selection; Pedestrian Be-
haviour;
I. I NTRODUCTION
The upcoming arrival of autonomous vehicles on the roads
poses several concerns regarding their future interaction with
other road users, in particular with pedestrians and cyclists,
whose behaviour is more complex and unpredictable. Pedes-
trian interaction is challenging due to multiple uncertainties
in their pose estimation, gestures and intention recognition.
We thus recently proposed a game theory model for such in-
teractions [3], where a pedestrian encounters an autonomous Fig. 2: Sequential Chicken Game
vehicle at an unsignalized intersection.
The model assumes that the two players choose their ac-
tions (speeds) aY , aX ∈ {1, 2} simultaneously then implement
them simultaneously, at each of several discrete-time turns.
There is no lateral motion (positioning within the lanes of
the roads) or communication between the agents other than
via their visible positions. The game is symmetric, as both
players are assumed to know that they have the same utility
functions (Ucrash ,UT ), hence they both have the same optimal
strategies. These optimal strategies are derivable from game
theory together with meta-strategy convergence, via recur-
Fig. 1: Two agents negotiating for priority at an intersection sion. Sequential Chicken can be viewed as a sequence of
one-shot sub-games, whose payoffs are the expected values
In this model, two agents (e.g. pedestrian and/or human of new games resulting from the actions, and are solvable
or autonomous driver) called Y and X are driving straight by standard game theory.
1 Institute for Transport Studies (ITS), University of Leeds, UK
The (discretized) locations of the players can be repre-
2 Lincoln Centre for Autonomous Systems (LCAS), School of Computer sented by (y, x,t) at turn t and their actions aY , aX ∈ {1, 2}
Science, University of Lincoln, UK for speed selection. The new state at turn t + 1 is given by
3 Ibex Automation Ltd, UK
(y + aY , x + aX ,t + 1). Define vy,x,t = (vYy,x,t , vXy,x,t ) as the value
This project has received funding from the European Union’s Horizon
2020 Research and Innovation programme under grant agreement No (expected utility, assuming all players play optimally) of the
723395 (InterACT) game for state (y, x,t). As in standard game theory the value
of each 2 × 2 payoff matrix can then be written as, collision avoidance. The present study aims to extend these
experiments and put participants in more realistic interaction
v(y − 1, x − 1,t + 1) v(y − 1, x − 2,t + 1)
vy,x,t = v( ), (1) scenarios with a game theoretic autonomous vehicle in a
v(y − 2, x − 1,t + 1) v(y − 2, x − 2,t + 1)
virtual environment.
which can be solved using dynamic programming assum-
ing meta-strategy convergence equilibrium selection. Under
III. M ETHODS
some approximations based on the temporal gauge invariance
described in [3], we may remove the dependencies on the
A. VR Setup
time t in our implementation so that only the locations (y, x)
are required in computation of vy,x and optimal strategy The study was conducted using an HTC Vice Pro head
selection. mounted display (HMD). Participants did not use the HTC
Virtual Reality (VR) offers the opportunity to experiment Vice controllers, as no interactions other than walking were
on human behaviour in similated real world environments required. The HMD was used with the HTC wireless adapter
that can be dangerous or difficult to study, such as pedestrian in order to facilitate easier movement during the simulation.
road crossing. The present study uses VR to run the game We used an area of approximately 6m by 3m to conduct the
thereotic model on a virtual autonomous vehicle and then simulation (as shown in Fig. 3), which was mapped using
examines the responses of human participants to that. the usual HTC Vive room mapping system. The size of this
Contributions: To our best knowledge, this is the first area slightly exceeds that recommended by the manufacturer;
attempt to evaluate pedestrian behaviour during interaction however, we experienced no technical problem with tracking
scenarios with a game theoretic autonomous vehicle in a or system performance. The start position on the floor was
virtual reality environment. It examines pedestrian road- marked with an ’X’ using floor tape, so that participants
crossing preferences (Ucrash ,UT ) when interacting with the knew where to stand at the start of each simulation, prior to
virtual autonomous vehicle and demonstrates the importance placing the HMD on their head. The simulation was created
of VR for the development of the model. using the Unity 3D engine, and was run under Windows 10
on a PC based on an Intel Core i7-7700K CPU, with 32GB
II. R ELATED W ORK
of RAM, and an Nvidia GeForce GTX 1080 GPU.
There are few previous studies which investigated on
interactions between autonomous vehicles and other road
users in VR. Wang et al. [7] developed 5 different behaviours
for an autonomous vehicle. The vehicle behaviour was suc-
cessfully tested in different simulated traffic scenarios such
as at intersections and lane changing, in a simulated city
and highway road networks. Keferböck et al. [5] studied
autonomous vehicles interactions with pedestrians in a virtual
environment. In one of their experiments, participants are
asked to cross a road in front of them while a car is
approaching. This experiment differs from ours in that the
AV stops and shows (or not) a stop intent to pedestrians.
This study aimed to show the importance of substituting
communications between pedestrians and drivers by some Fig. 3: VR Lab
explicit communication forms for self-driving cars. Pillai [6]
used task analysis to divide pedestrian-vehicle interaction as B. Car behaviour model
a sequence of actions giving two outcomes, either the vehicle
passes first or the pedestrian crosses and perform some The virtual AV was designed to drive using the Sequential
experiments with participants on their crossing behavior Chicken model described above. The car began driving 40
using virtual reality. Hartmann et al. [4] proposed a tesing meters away from the intersection, its full speed was 30km/h
procedure of pedestrian collision avoidance for autonomous and lowest speed was 15km/h. The vehicle moved and
vehicles using VR techniques. This test bed can take into adapted its behaviour to participants motion. Every 0.02s,
account different factors that could influence pedestrian the car observed the current position of the pedestrian and
behaviour such as their understanding of the environment, made its decision based on the game theory model. The car
their body movement and their personality. was designed not to stop for any pedestrian. Indeed, in the
We previously performed laboratory experiments to fit data sequential chicken model, if the two players play optimally,
to the game theory model [3]. We first asked participants to then there must exist a non-zero probability for a collision to
play this game as a board game in [2]. Secondly, participants occur. Intuitively, if we consider an AV to be one player that
were asked to play the game in person moving on squares always yields, it will make no progress as the other player
[1]. These previous laboratory experiments have shown unre- will always take advantage over it, hence there must be some
alistic results, participants preferring time saving rather than threat of collision.
Fig. 4: Virtual Autonomous Vehicle
C. Human experiment
We invited 11 participants, 10 males and 1 female aged
Fig. 7: Pedestrian behaviour preference
between 19 and 37 years old, to take part to the study,
under University of Lincoln Research Ethics. 7 participants
had previous experience with VR. Participants were asked to
Similar to the optimal solution computation method de-
cross a road in front of them as they would do in everyday
veloped in the laboratory experiments [2] [1], we obtain
life. They should stop moving on their other side of the
an optimal parameter, θ = Ucrash /UT = −60/8 = 7.5, for
road, when they reached a yellow cube, located there for
participants, as shown in Fig. 7. This reveals that pedestrians
safety reasons. A vehicle approaches from their right hand
valued avoidance of a crash 8 times more more than a
side. Participants began walking about 4 meters away from
0.02s time saving per turn, resulting in pedestrians being
the intersection. Prior to the experiment, participants were
less assertive in crossing the road. In comparison, previous
introduced to the experimental setup and trained on walking
laboratory experiments found that participants valued time
within the VR environment with the VR headset. There were
saving more than collision avoidance [2][1].
6 trials per participants in the virtual environment with the
first trials considered as training data.
Abstract— This paper proposes the usage of a bio-inspired Several theories have been proposed in literature on how
action selection mechanism, known as multi-hypothesis sequen- animals perform effective decision making [9]. For instance,
tial probability ratio test (MSPRT), as a decision making tool in [10] the affordance competition concept underlines a
in the field of autonomous driving. The focus is to investigate
the capability of the MSPRT algorithm to effectively select the parallel processing of multiple actions competing against
optimal action whenever the autonomous agent is required to each other until the selection of the winning behavior. Such
drive the vehicle or, to infer the human driver intention when a modeling framework is based on the definition of criteria
the agent is acting as an intention prediction mechanism. After for assessing the worthiness of the action and the selection
a brief introduction to the agent, we present numerical simu- process itself.
lations to demonstrate how simple action selection mechanisms
may fail to deal with noisy measurements while the MSPRT We exploit this concept of parallel competing actions
provides the robustness needed for the agent implementation in the context of the European Projects SafeStrip1 and
on the real vehicle. Dreams4Cars2 . In particular, in SafeStrip we take advantage
of the mirroring mechanism introduced in [11] to infer the
I. INTRODUCTION
human driver intended action in several dangerous scenarios,
Autonomous vehicles (AVs) require effective algorithms to like in the proximity of a pedestrian crossing, in a road
perform robust decision making in the shortest time frame work zone or in an intersection. In the latter case a more
possible. Indeed, in a dynamic environment such as the one complex mirroring is performed, taking into account the
faced by the AVs, the capability of reacting promptly is a right of way rules and mirroring other vehicles. This is
major factor in potentially avoiding collisions and saving made through vehicle to vehicle and vehicle to infrastructure
lives. The inherent complexity of the process is worsened communication [12].
by the presence of sensors’ noise and uncertainties which Such an inference process boils down to the selection
affect the way the behavioural level selects the proper action. among a set of longitudinal maneuvers, called motor prim-
In the early days of autonomous driving, tactical/behavioral itives, of the one matching the driver intended action in
level planning typically relied on manually engineered state terms of instantaneous jerk j0 . Each motor primitive has
machines, this approach has been adopted by many com- an optimality-based formulation characterized by an initial
petitors of the 2007 DARPA Grand Challenge (a.k.a. ur- jerk associated with. By defining the jerk space as a one 1-
ban challenge) [1], [2]. Despite some participants actually dimensional grid we can explore a set of possible actions
managed to succeed, state machines inherently lack the taking also into account infrastructure-based information.
capability of safely generalizing to unmodeled scenarios. In Dreams4Cars we utilize a similar optimality-based mo-
More recent autonomous driving softwares are built on top of tor primitives approach for the synthesis of an autonomous
probabilistic approaches including Markov Decision Process driving agent called Co-driver [13]. In addition to the longitu-
[3] or machine learning-based techniques such as behaviour dinal manoeuvres, we also generate set of lateral manoeuvres
networks [4] or support vector machines [5]. A promising by defining a 1-dimensional grid on instantaneous lateral jerk
method is the adoption of reinforcement learning (RL) as a r0 . By combining the two grids we devise a 2-dimensional
high level biasing mechanism for learning an optimal action matrix where each entry is a pair of (j0 , r0 ) which encodes
selection policy [6] or oppositely, the exploitation of the a latent action. Each pair is then assign a merit via the
inverse reinforcement learning (IRL) framework to learn the definition of a scenario dependent salience.
reward function from human data [7]. Common to both the project there is the need to select
Conversely, the problem of action selection is not a the best action after the computations of the grids. The rest
peculiar feature of AVs, instead any agent (both artificial and of this paper is devoted to demonstrate how we can perform
biological) dealing with complex dynamical environments such a task taking advantage of a biologically inspired action
where multiple mutually exclusive behaviours are possible, selection mechanism.
shares similar dilemmas. Indeed there exists a huge amount
of ethology literature investigating “behaviour switching” II. THE MOTOR CORTEX CONCEPT
and “decision making” [8], the common jargon among cog-
In order to better clarify how the affordances competition
nitive scientists to refer to the action selection problem in
process takes place, let us inspect an example simulation
robotics.
1 Department of Industrial Engineering, University of Trento, 38123 1 https://www.safestrip.eu
�
the potential maneuvers and the one currently performed
○ by the driver. After the computation of the scenario-based
merit for each initial control, a bias function measuring the
proximity of the driver maneuver to each action is applied
-�
to the motor cortex as shown in [16] for the longitudinal
-���� � ���� control only.
� (�/�/�)
Overall the behaviour of the MSPRT algorithm can be Fig. 3: MSPRT vs. WTA channels selection errors. Parame-
shaped by adjusting the hyper-parameters in Table I. ters of the simulation as in Table I.
Abstract— For many people, driving a car is a routine recall of the past sequences (of destinations) with time
activity where they tend to go to the same places at about information can be used to predict what is the driver’s
the same time of day or day of week. We propose a learning intent. Learning occurs implicitly – driver does not need
system – based on dynamic neural fields– that allows cognitive
vehicles/cars to acquire sequential information about driver to be asked for his/her destinations – and is a continuous
destinations and corresponding time properties. Importantly, process modeled in the form of coupled dynamic neural fields
the learning system allows to memorize long sequences, to (DNFs). The theoretical framework of DNFs has been proven
deal with different temporal scales, and the destinations do to provide key processing mechanisms to implement work-
not need to be fixed in advance. Learning occurs implicitly ing memory, prediction, and decision making in cognitive
and it is a continuous process. Memory recall allows the car
to predict driver’s destination intention, when she/he intends systems (e.g. [9], [10]), including the learning of sequential
to arrive/leave, and for how long she/he intends to stay at a tasks ([12], [13], [14]).
destination. Such personalized information can be used to plan In this study, the central idea is to explore learning
the next trip. mechanisms able to learn not only the sequence of driver
I. I NTRODUCTION destinations but also time properties, e.g. (i) when to be at
a destination and (ii) when to leave. Memory recall allows
Many studies report that human mobility is characterized the car to predict driver’s destination intention, when she/he
by a high degree of regularity [1], [2], a significant tendency intends to arrive, and for how long she/he intends to stay
to spend most of the time in a few locations [3] and a there.
tendency to visit specific destinations at specific times [4],
[5]. For example, for many drivers, weekdays consist of II. T HE APPROACH
leaving home in the morning, driving to children’s school, The approach presented in this paper is based on pre-
work, again to children’s school and returning home in the vious work on memory mechanisms for order and timing
evening. A person’s daily routines are typically coupled with of sequential processes [11], [13], [15] based on Dynamic
routines across other temporal scales, such as going to the Neural Fields (DNFs) [16], [17]. The central idea of dynamic
gym or the church, at specific days of the week. field models is that relevant information is expressed by
Several different approaches, most of them statistical supra-threshold bumps of neural activity where each bump
models [6], [7], have been proposed for predicting the next represents a specific parameter value. Input from external
location in human mobility, in which a big data is necessary. sources, such as information from a sensor, causes activation
Traditional Markov models work well for specific set of in the corresponding population that remains active with
behaviors but have difficulty incorporating temporal patterns no further external input due to recurrent excitatory and
across different timescales [8] and destinations need to be inhibitory interactions within the populations. Those interac-
fixed in advance.Here, we propose a dynamic neural model tions are able to hold auto-sustained multi-bump patterns. We
for learning information about the sequence of places and assume that the vehicle GPS coordinates and the information
timing on the habits of individual drivers. The fundamental whether the car is turning on or off are available. We consider
assumption is that driving is mostly a routine, and memory as input the GPS coordinates (latitude and longitude) when
*The work received financial support from European Structural and the vehicle is turning off or on, which represent the GPS
Investment Funds in the FEDER component, through the Operational coordinates of a destination at the time the driver arrives
Competitiveness and Internationalization Programme (COMPETE 2020)
and national funds, through the FCT (Project “Neurofield”, ref POCI-
or departs, respectively. Figure 1 illustrates an overview of
01-0145FEDER-031393) and ADI (Project “Easy Ride:Experience is the model architecture consisting of several interconnected
everything”, ref POCI-01-0247-FEDER-039334), FCT PhD fellowship neural fields assuming as input the GPS coordinates when
PD/BD/128183/2016 and the Pluriannual Funding Programs of the research
centres CMAT and Algoritmi.
the car departs or arrives. For concreteness, we assume the
1 Flora Ferreira, Weronika Wojtak, Wolfram Erlhagen and Paulo Vi- case of “arrives signals” to describe the model.
cente are with Center of Mathematics, University of Minho, Portugal The 2D field and the four 1D fields on top of the
[email protected] figure implement the encoding and memorizing of the GPS
2 Weronika Wojtak, Paulo Vicente, Ankit R. Patel, Sérgio Monteiro and
Estela Bicho are with Center Algoritmi, University of Minho, Portugal coordinates (latitude and longitude) of the places where the
[email protected] driver was and the relative timing between the moments in
which the driver arrived at those places. At the moment
the driver turns off the car at specific GPS coordinates ∂ u(r,t)
Z
w(r, r0 ) f u(r0 ,t) dr
trigger the evolution of a bump in the input encode ON/OFF τ = −u(r,t) + S(r,t) + h +
∂t Ω
field uION/OFF . This activity is projected to corresponding (1)
neurons in two 1D fields: latitude perception field uPlat
where u(r,t) represents the activity at time t at position r
and longitude perception field uPlong resulting in a local-
on the domain Ω as a subset of Rd with d = 1 or d = 2. The
ized bump in each of these fields. Each of these bumps
constant τ > 0 defines the time scale of the field dynamics.
triggers through excitatory connections the evolution of a
The function S(r,t) represents the time-dependent, localized
self-sustained activity pattern at the corresponding site in
input to the field. The global inhibition, h < 0 defines the
latitude sequence memory field uSMlat and longitude sequence
baseline level of activation to which field excitation de-
memory field uSMlong , respectively. Inhibitory feedback from
cays without external stimulation. The connectivity function
uSMlat to uPlat (from uSMlong to uPlong ) destabilizes the existing
w(r, r0 ) models how a population of neurons at position r
bump in the perception field. This ensures that newly arrived
in the field interacts with a population at position r0 . For
localized input to uPlat (uPlong ) will automatically drive the
the fields on which only one bump at a time should evolve
evolution of a bump at a different field location even if
(e.g. uPlat , uD ), we use a standard kernel of lateral inhibition
the specific cue value is repeated during the course of the
type [18]. To enable multi-bump solutions in the fields (e.g.
sequence. The series of GPS coordinates at the moments
uPlat , uD ) we assume a kernel with oscillatory rather than
when the car was turned off creates a multi-bump pattern in
monotonic decay [16], [17]:
uSMlat and in uSMlong that stores the last sequence of visited
places with a strength of activation decreasing from bump to w(r) = Aw e−br (b sin (αr) + cos (αr)) (2)
bump as a function of elapsed time since sequence onset. p
To guarantee the memory of successive routines for the where r = |x| for 1D and r = x2 + y2 for 2D, the parameters
same period of time, a dynamic building of a long term Aw > 0, b > 0 and α > 0 control the amplitude, the rate at
memory in uLMlat (uLMlong ) is generated through excitatory which the oscillations in w decay with distance, and the zero
connection from in uSMlat ( uSMlong ). Multi-bump pattern of crossings of w, respectively. The firing function f is taken
uSMlat (uSMlong ) is projected via excitatory connections in uPlat as the Heaviside step function with threshold 0.
(uPlong ) ensuring the robustness of the encoding process in the A. Memory of interval timing between the destinations
face of noisy and potentially incomplete sensory information. To establish a stable activation gradient in the sequence
The resulting preshaping in uPlat (uPlong ) based on prior memory fields we consider the following state-dependent
experience modulates perception thresholds and speeds up dynamics [11], [13], [14]:
the processing of inputs. When prediction are needed, the
sequence recall mechanism is activated. During the recall the ∂ h(x,t)
τh = (1 − f (u(x,t))) (−h(x,t) + h0 ) + k f (u(x,t))
four 1D fields and the 2D field on bottom of Figure 1 become ∂t
(3)
active. The latitude decision field uDlat (uDlong ) receives the
multi-bump pattern of uLMlat (uLMlong ) as subthreshold input. where h0 defines the level to which h relaxes without
By a continuous increase of the baseline activity in uDlat suprathreshold activity at position x and k > 0 measures the
(uDlong ), all subpopulations are brought closer to the thresh- growth rate when it is present. The adaptation of the resting
old for the evolution of self-stabilized bumps. When the level h is performed locally at field sites with suprathreshold
currently most active population reaches this threshold, the activity. The memory of interval timing between the places
corresponding output in uR is triggered. At the same time, the in which the car arrives or departs is memorized in the peak
excitatory-inhibitory connections between associated popu- amplitudes. Considering as input the GPS coordinates of a
lations in uDlat (uDlong ) and latitude working memory field place at the moment in which the car is turned off, the dif-
uW Mlat (uW Mlong ) guarantee that the suprathreshold activity ference between two successive peak amplitudes represents
representing the latest sequence event becomes first stored the interval timing between two successive places that the
in working memory field and subsequently suppressed. The car was parked.
global initial value of h in uDlat (uDlong ) is proportional to the B. Memory of time spent in each destination
sequence duration (e.g. 24 hours) minus the time early (e.g.
10 minutes) in which arrive time to or depart time from a To create a memory of time duration in each place a 2D
specific place should be predicted. dynamic neural field is used. During recall this field receives
The population dynamics in each memory field is governed as input the corresponding localized activation from output
by the an integro-differential equation, which describes the recall OFF and a representation of the time duration spent in
activation of interconnected neurons along a one or two each place is obtained by applying the following dynamics
dimensional domain [18]: for h:
∂ h(x, y,t)
τh = k f (uROFF (x, y,t))(1 − f (uRON (x, y,t))). (4)
∂t
The h value increases locally as a function of elapsed
time only in the presence of activation in uROFF and when
GPS Coordinates when the car departs/arrives
Sequence
Latitude
Learning
with time
constraints
Longitude
𝑢𝑃𝑙𝑎𝑡:Latitude Perception 𝑢𝑀𝑙𝑎𝑡 : Latitude 𝑢𝑃𝑙𝑜𝑛𝑔 :Longitude Perception 𝑢𝑀𝑙𝑜𝑛𝑔 :Longitude Sequence memory
Sequence memory
Memory
𝑢𝐿𝑀𝑙𝑎 𝑡 :Latitude Long term memory 𝑢𝐿𝑀𝑙𝑜𝑛g :Longitude Long term memory
Sequence
Recall
𝑢𝑊𝑀 𝑙𝑎𝑡 :Latitude 𝑢𝐷𝑙𝑎𝑡 :Latitude Decision 𝑢𝑊𝑀 𝑙𝑜𝑛𝑔:Longitude Working Memory 𝑢𝐷𝑙𝑜𝑛𝑔 :Longitude Decision
Working Memory
Latitude
Longitude
Excitatory connection
Prediction driver's intention destination and when to depart/arrive Inhibitory connection
Fig. 1. Schematic view of the model architecture with several interconnected neural fields implementing sequence learning, memory and sequence recall
for the depart/arrive signals (GPS coordinates when the car arrives or departs at destination as input). For details see the text.
simultaneously corresponding subpopulation in uRON is not take them to the gym and finally come back to home.
activated (see Figure 2). To simulate this example we assume as input a real GPS
coordinates of a Portuguese city (Guimarães) at a realistic
Prediction of where and when the car arrives Memory of time spent in each destination
time. Figure 3 (A) illustrate the sequence memory of where
and when the car departs. The GPS coordinates memorized
Latitude
Longitude than 400 meters in this case) represent the same place. The
𝑢𝑅𝑂𝐹𝐹:Output Recall OFF 0
Lat
itu
de de
(School and Work) were visited by the driver twice. Stable
gitu
Lon
activation patterns corresponding to memory of the GPS
Longitude
𝑢𝑅𝑂N :Output Recall ON coordinates (latitude and longitude) sequence are represented
in two 1D fields where the bump amplitudes reflect the order
Fig. 2. Left: snapshot of activation of a bump in output recall fields uROFF of places where the car departed from and relative timing
and uRON . Right: snapshot of a stable activation pattern corresponding to
the duration time in each place already recalled. between them. Figure 3 (B) shows the representation in a
2D field of the time duration that the car was parked in each
Having a field with information about the time spent in place during a day (from 0:00 to 24:00). Each P symbol
each place will be useful, for example, to predict the time on the map corresponds to a bump in the 2D field, and
the car will stay in a specific destination. the amplitude represents the duration in each location. The
higher amplitudes represent the places in which the car was
III. R ESULTS parked for a longer time (i.e. work and home).
As an example, we consider a day driver routine: depart
IV. C ONCLUSION
from home to take the kids to school, next go to work, go
to the restaurant to have lunch, come back to work and We have presented an approach to learn ordinal and tempo-
in the evening go to pick up the children from school, ral aspects of driver routines using the theoretical framework
(A) 6
Sequence of places where the car departed during a day
1st
2nd
3rd 4th
5th 6th
7th
0 (B)
Memory of time spent in each place during a day
41.455
4th
4
3rd
3
5th
Activation
Latitude
2nd
2
6th
6
7th
de
7
itu
Lat
1st
Longitude
200 m 41.437
6
Fig. 3. (A) Part of Guimarães map in Portugal generated from Google Maps showing a sequence memory of the GPS coordinates of where the car
departed during a day. The GPS coordinates represented in the bump centers are marked with letter P symbol. The closer points represent the same place.
The seven marks are in five different places (Home, School, Work, Restaurant and Gym) in which two places have two marks. (B) Representation of time
duration in each visited place.
of dynamic neural fields. The learning is implicit, continuous [7] M. Boukhechba, A. Bouzouane, S. Gaboury, C. Gouin-Vallerand,
and can be scaled to different temporal scales. The model can S. Giroux, and B. Bouchard, Prediction of next destinations from
irregular patterns, Journal of Ambient Intelligence and Humanized
be instantiated for each day of the weak, and hence different Computing, 9(5), 2018, pp.1345-1357.
routines can be learned. There are several possible uses for [8] B.P. Clarkson, Life patterns: structure from wearable sensors, PhD
such learning memories. In terms of navigation systems, diss, Massachusetts Institute of Technology, 2002.
[9] G. Schöner, Dynamical systems approaches to cognition. Cambridge
smarter route selection/recommendation could be provided handbook of computational cognitive modeling, 2008, pp.101-126.
through the integration of these memories with other factors [10] Y. Sandamirskaya, S. K. Zibner, S. Schneegans, and G. Schöner, Using
such as traffic conditions without requiring input from the dynamic field theory to extend the embodiment stance toward higher
cognition. New Ideas in Psychology, 31(3), 2013, pp.322-339.
driver. The car could predict the next destination and the [11] F. Ferreira, W. Erlhagen, and E. Bicho, A dynamic field model of
desired time of arrival and alert the driver if she is getting ordinal and timing properties of sequential events, In International
late to come to the car. Predicting the next departure time Conference on Artificial Neural Networks, Springer, Berlin, Heidel-
berg, 2011, pp. 325-332.
could be used for preparing in advance the cockpit’s comfort [12] Y. Sandamirskaya, and G. Schoner, Dynamic field theory of sequential
– e.g. demist/defrost the windows and pleasant temperature action: A model and its implementation on an embodied agent. In 2008
– sometime before the driver (and occupants) enter the car. 7th IEEE international conference on development and learning, 2008,
pp. 133-138.
Future work concerns implementing and testing this learning [13] F. Ferreira, W. Erlhagen, E. Sousa, L. Louro, and E. Bicho, Learning
system in real driving scenarios, in the scope of the joint a musical sequence by observation: A robotics implementation of
project UMinho and Bosch – “Easy Ride: Experience is a dynamic neural field model, In 4th International Conference on
Development and Learning and on Epigenetic Robotics, 2014, pp.
everything” (ref POCI-01-0247-FEDER-039334). 157-162.
[14] F. Ferreira, W. Wojtak, E. Sousa, L. Louro, E. Bicho, and W. Erlhagen,
Rapid learning of complex sequences with time constraints: A dynamic
R EFERENCES neural field model, (Under review)
[15] W. Wojtak, F. Ferreira, L. Louro, E. Bicho, and W. Erlhagen, Towards
[1] N. Eagle and A. S. Pentland, Eigenbehaviors: Identifying structure in temporal cognition for robots: A neurodynamics approach, In 2017
routine, Behavioral Ecology and Sociobiology, 63(7), 2009, pp.1057- Joint IEEE International Conference on Development and Learning
1066. and Epigenetic Robotics (ICDL-EpiRob), 2017, pp. 407-412.
[2] C. Song, Z. Qu, N. Blumm, and A.L: Barabási, Limits of predictability [16] C.R. Laing, W.C. Troy, B. Gutkin, and G.B. Ermentrout, Multiple
in human mobility, Science, 327(5968), 2010, pp.1018-1021. bumps in a neuronal model of working memory. SIAM Journal on
[3] C. Song, T. Koren, P. Wang, and A.L. Barabási, Modelling the scaling Applied Mathematics, 63(1), 2002, pp.62-97.
properties of human mobility, Nature Physics, 6(10), 2010, p.818. [17] F. Ferreira, W. Erlhagen, and E. Bicho, Multi-bump solutions in
[4] S. Jiang, J. Ferreira, M.C. and González, Clustering daily patterns of a neural field model with external inputs. Physica D: Nonlinear
human activities in the city, Data Mining and Knowledge Discovery, Phenomena, 326, 2016, pp.32-51.
25(3), 2012, pp.478-510. [18] S. Amari, Dynamics of pattern formation in lateral-inhibition type
[5] S. Rinzivillo, L. Gabrielli, M. Nanni, L. Pappalardo, D. Pedreschi, neural fields. Biol. Cyber. 27, 1977, pp. 77–87
and F. Giannotti, The purpose of motion: Learning activities from
individual mobility networks. In 2014 International Conference on
Data Science and Advanced Analytics (DSAA), 2014, pp. 312-318.
[6] R. Simmons, B. Browning, Y. Zhang, Y. and V.Sadekar, Learning to
predict driver route and destination intent, In 2006 IEEE Intelligent
Transportation Systems Conference, 2006, pp. 127-132.
Towards an Evaluation Methodology for the
Environment Perception of Automotive Sensor Setups
How to define optimal sensor setups for future vehicle concepts?
Abstract— With increasing degree of automation, vehicles Sensor Configuration: The setup can be established with
require more and more perception sensors to observe their diverse sensors that are working according to different measure-
surrounding environment. Car manufacturers are facing the ment principles. A redundant sensor arrangement shall ensure
challenge of defining a suitable sensor setup that covers all that important functions are still executable if one sensor drops
requirements. Besides the sensors’ performance and field of view out or a particular sensor technology is weak in a specific situ-
coverage, other factors like setup costs, vehicle integration and ation. The setup has to cover the vehicle’s surrounding without
design aspects need to be taken into account. Additionally, a dangerous blind spots. Aside from that, three different areas of
redundant sensor arrangement and the sensors’ sensitivity to interest have to be covered: the far field (highway driving), near
environmental influences are of crucial importance for safety. It is
field (urban driving) and ultra-near field (parking/start driving).
not feasible to explore every possible sensor combination in test
drives. This paper presents a new simulation-based evaluation Sensor Integration: Besides optimal mounting positions and
methodology, which allows the configuration of arbitrary sensor sensor alignment concerning Field Of View (FOV) coverage
setups and enables virtual test drives within specific scenarios to and sensor functionality, the feasibility of the geometrical inte-
evaluate the environmental perception in early development gration and design aspects have to be considered. In addition,
phases with metrics and key performance indicators. This evalua- environmental influences like sensor occlusion due to dirt, wea-
tion suite is an important tool for researchers and developers to ther and lighting conditions are crucial for sensor performance.
analyze setup correlations and to define optimal setup solutions.
Sensor Benchmarking: Sensor specifications like FOV,
Keywords— external vehicle sensors, perception, vehicle sensor
range, detection probability and accuracy are crucial. However,
setup configuration, sensor performance, virtual testing, simulation another important factor is the costs of the overall setup. Cost-
benefit analyses can reveal e.g. whether two adjoining sensors
I. INTRODUCTION with small FOV can replace an expensive sensor with high FOV.
Defining an adequate sensor setup is a complex task. So far,
Advanced Driver Assistance Systems (ADAS) support the
there is a lack of a consistent evaluation procedure and suitable
driver with functions like Adaptive Cruise Control (ACC),
tool to support developers to solve this task in a time- and cost-
Emergency Braking and Parking Assistant [1, 2]. Vehicles are
efficient way. Thus, we are addressing the research question:
equipped with several external perception sensors to provide
these functions with information about the car’s environment. “Which evaluation methodology can be applied to determine
The automotive development is now heading towards Highly the performance of automotive sensor setups regarding their
Automated Driving (HAD), Fully Automated Driving (FAD) environmental perception in an early development phase?”
and finally towards driverless Autonomous Driving (AD) to In this paper, we introduce a simulation-based evaluation
enhance the driving comfort and road safety. This rising degree concept that assists the procedure of evolving an optimal sensor
of automation comes along with an increasing number of setup in the context of automated driving based on a reliable
required perception sensors like cameras, radars, lidars and evaluation methodology. This also helps researchers to analyze
ultrasonic sensors to ensure sufficient coverage of the vehicle’s setup correlations and influences of sensor parameters.
surroundings. While the sensor setup for a typical ADAS was
still manageable with one radar, a few cameras and four ultra- II. RELATED WORK AND BASICS
sonic sensors at the front and back respectively, future setups Perception sensors are the first part of a complex data
will be more complex. To satisfy all requirements resulting from processing chain, which is visualized schematically in Fig. 1.
the higher automation degree like 360° surround view, far and
Preprocessing
Sensor 1
Environment
Environment
Sensor
Model
The sensor configuration and evaluation suite contains three To consider the position confidence of the estimated objects, the
modules: First, the ground truth data enter the data fusion Normalized Estimation Error Squared (NEES) in (4), also
designer with sensor models and the setup configuration. In this known as Mahalanobis distance [15], includes the position error
part, the ground truth data are modified according to the settings (𝑔 – 𝑒) and the covariance matrix 𝑃𝑔𝑒 .
of probabilistic sensor models. For each virtual sensor of the
𝑇
setup, a measurement model, a detection model as well as a track 𝑁𝐸𝐸𝑆(𝑔𝑖 , 𝑒𝑗 ) = √(𝑔𝑖 − 𝑒𝑗 ) 𝑃𝑔𝑒 −1 (𝑔𝑖 − 𝑒𝑗 )
management strategy is selected via a graphical user interface
(GUI). For this, the individual sensor parameters are set VI. CONCLUSION
according to the sensor properties maintained from the three-
step-procedure (1) Identification of sensor specifications. In We presented a new simulation-based concept that is suit-
addition, it is possible to modify the available sensor models or able for the evaluation of perception sensor setups for automated
to add new models by the programmatic usage of a Software vehicles regarding particular sensor mounting positions, diverse
Development Kit (SDK). The simulated sensor data are fused in setup configurations and different sensor technologies. Our
the data fusion module, which relies on extended Kalman filters established framework is an important instrument that assists
and feeds the subsequent environment model. Afterwards, the automotive system engineers during the early stages of develop-
simulated and fused sensor data are compared with the initial ment and supports them with a performance overview for
ground truth data in the evaluation metrics module that was built different setups in relevant scenarios. The evaluation suite
especially for this workflow. In this part, custom evaluation allows configuring the sensor setup via a simple GUI while it is
metrics can be added. Based on those metrics, a KPI report still possible to access the software code to modify the proba-
containing the results for all scenarios is created to assess the bilistic sensor models and to implement evaluation metrics. By
perception performance quantitatively in a compact overview. studying the correlations within the sensor data processing
After that, KPI reports of different setups are compared and chain, requirement profiles in terms of a roadmap for future
further aspects like cost-benefit considerations are analyzed to sensor technologies can be derived. In future work, the method
decide whether a particular sensor setup is worth being tested in could be adapted to a physics-based simulation like PreScan [19]
the next step of the three-step-procedure: (3) Test drives. to extend the sensor evaluation by considering physical effects.
Scenario Generation
Data Fusion
Environment Model
+
Benefit
Evaluation
Cost
Sensor Models
s
+ Metrics
Sensor Setup
</>
Collection
Configuration
Scenario
[GUI] + -
* % KPI
[SDK] Report
Fig. 1.
REFERENCES
[1] A. Ziebinski, R. Cupek, D. Grzechca, and L. Chruszczyk, “Review of [10] R. Schubert, N. Mattern, and R. Bours, “Simulation of sensor models
advanced driver assistance systems (ADAS),” in AIP Conference for the evaluation of advanced driver assistance systems,”
Proceedings1906, p. 120002. ATZelektronik worldwide, vol. 9, no. 3, pp. 26–29, 2014.
[2] K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, and H. [11] C. van Driesten and T. Schaller, “Overall approach to standardize AD
Winner, “Three decades of driver assistance systems: Review and future sensor interfaces: Simulation and real vehicle,” in Proceedings,
perspectives,” IEEE Intelligent Transportation Systems Magazine, vol. Fahrerassistenzsysteme 2018, T. Bertram, Ed., Wiesbaden: Springer
6, no. 4, pp. 6–22, Winter 2014. Fachmedien Wiesbaden, 2019, pp. 47–55.
[3] J. van Brummelen, M. O’Brien, D. Gruyer, and H. Najjaran, [12] C. Goodin, R. Kala, A. Carrrillo, and L. Y. Liu, “Sensor modeling for
“Autonomous vehicle perception: The technology of today and the virtual autonomous navigation environment,” in 2009 IEEE Sensors,
tomorrow,” Transportation Research Part C: Emerging Technologies, Christchurch, New Zealand, Oct. 2009 - Oct. 2009, pp. 1588–1592.
vol. 89, pp. 384–406, 2018. [13] S. Matzka and R. Altendorfer, “A comparison of track-to-track fusion
[4] S. Hasirlioglu, A. Kamann, I. Doric, and T. Brandmeier, “Test algorithms for automotive sensor fusion,” in Lecture Notes in Electrical
methodology for rain influence on automotive surround sensors,” IEEE Engineering, Multisensor Fusion and Integration for Intelligent
19th International Conference on Intelligent Transportation Systems Systems, H. Hahn, H. Ko, and S. Lee, Eds., Berlin, Heidelberg: Springer
(ITSC), pp. 2242–2247, 2016. Berlin Heidelberg, 2009, pp. 69–81.
[5] T. Dörfler, “Testing ADAS sensors in early development phases,” [14] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics: The MIT
ATZelektronik worldwide, vol. 2018, no. 13, p. 54. Press, 2005.
[6] P. Viswanath, M. Mody, S. Nagori, J. Jones, and H. Garud, “Virtual [15] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with
simulation platforms for automated driving: Key care-about and usage applications to tracking and navigation: Theory algorithms and
model,” Electronic Imaging, vol. 2018, no. 17, 164-1-164-6, 2018. software, 1st ed.: Wiley-Interscience, 2001.
[7] D. Gruyer, S. Choi, C. Boussard, and B. d'Andrea-Novel, “From virtual [16] IPG Automotive GmbH, CarMaker. [Online] Available: https://ipg-
to reality, how to prototype, test and evaluate new ADAS: Application automotive.com/de/produkte-services/simulation-software/carmaker/.
to automatic car parking,” in 2014 IEEE Intelligent Vehicles Symposium Accessed on: Jun. 24 2019.
Proceedings, MI, USA, 2014, pp. 261–267. [17] Baselabs GmbH, Data fusion for automated driving: Data fusion
[8] W. Huang, K. Wang, Y. Lv, and F. Zhu, “Autonomous vehicles testing results. [Online] Available: https://www.baselabs.de/. Accessed on:
methods review,” in 2016 IEEE 19th International Conference on Mar. 21 2019.
Intelligent Transportation Systems (ITSC): Windsor Oceanico Hotel, [18] D. Schuhmacher, B.-T. Vo, and B.-N. Vo, “A consistent metric for
Rio de Janeiro, Brazil, November 1-4, 2016, Rio de Janeiro, Brazil, performance evaluation of multi-object filters,” IEEE Trans. Signal
2016, pp. 163–168. Process., vol. 56, no. 8, pp. 3447–3457, 2008.
[9] T. Hanke, N. Hirsenkorn, B. Dehlink, A. Rauch, R. Rasshofer, and E. [19] Leneman, F., D. Verburg, and S. Buijssen, Ed., PreScan, testing and
Biebl, “Classification of sensor errors for the statistical simulation of developing active safety applications through simulation, 2008.
environmental perception in automated driving systems,” in 2016 IEEE
19th International Conference on Intelligent Transportation Systems
(ITSC): Windsor Oceanico Hotel, Rio de Janeiro, Brazil, November 1-
4, 2016, Rio de Janeiro, Brazil, 2016, pp. 643–648.
Risk-Aware Reasoning for Autonomous Vehicles
Majid Khonji, Jorge Dias, and Lakmal Seneviratne
Abstract— A significant barrier to deploying autonomous dividual events becomes of critical importance as the pub-
vehicles (AVs) on a massive scale is safety assurance. Several lic would require transparency and explainable AI. Recent
technical challenges arise due to the uncertain environment in AV fatal crashes raise further debates among scholars and
which AVs operate such as road and weather conditions, errors
in perception and sensory data, and also model inaccuracy. In pioneers in the industry concerning how an autonomous
this paper, we propose a system architecture for risk-aware vehicle should act when human safety is at risk. On a more
AVs capable of reasoning about uncertainty and deliberately philosophical level, a study [2] sheds light on the major
bounding the risk of collision below a given threshold. We challenges of understanding societal expectations about the
discuss key challenges in the area, highlight recent research principles that should guide the decision making in life-
developments, and propose future research directions in three
subsystems. First, a perception subsystem that detects objects critical situations. As an illustrative example, suppose a self-
within a scene while quantifying the uncertainty that arises driving vehicle, experiencing a partial system failure, forced
from different sensing and communication modalities. Second, into an ultimatum choice between running over pedestrians
an intention recognition subsystem that predicts the driving- or sacrificing itself and its passenger to save them. What
style and the intention of agent vehicles (and pedestrians). should be the reasoning behind such a situation, and more
Third, a planning subsystem that takes into account the uncer-
tainty, from perception and intention recognition subsystems, fundamentally, what should be the moral choice? Despite
and propagates all the way to control policies that explicitly the profound philosophical dilemma and the impact on the
bound the risk of collision. We believe that such a white-box public perception of AI as a whole and the regulatory aspects
approach is crucial for future adoption of AVs on a large scale. for AVs in particular, the current state-of-the-art of the
technological stack of AVs does not explicitly capture and
I. I NTRODUCTION
propagate uncertainty sufficiently well throughout decision
Over the past hundred years, innovation within the auto- processes in order to accurately assess these edge scenarios.
motive industry has created more efficient, affordable, and In this work, we discuss algorithmic pipeline and a techni-
safer vehicles, but progress has been incremental so far. cal stack for AVs to capture and propagate uncertainty from
The industry now is on the verge of a substantial change the environment throughout perception, prediction, planning,
due to the advancements in Artificial Intelligence (AI) and and control. An AV has to be able to plan and optimize trajec-
Autonomous Vehicle (AV) sensing technologies. These ad- tories from its current location to a goal while avoiding static
vancements offer the possibility of significant benefits to and dynamic (moving) obstacles, while meeting deadlines
society, saving lives, and reducing congestion and pollution. and efficiency constraints. The risk of collision should be
Despite the progress, a significant barrier to large scale bounded by a given safety threshold that meets governmental
deployment is safety assurance. Most technical challenges regulations, while meeting deadlines should meet a quality
are due to the uncertain environment in which AVs operate of service threshold.
such as road and weather conditions, errors in perception To expand AV perception range, we consider the Vehicular
and sensory input data, and uncertainty in the behavior of Ad-Hoc Network (VANET) communication model. Vehicle-
the pedestrians and agent vehicles. A robust AV control to-Vehicle (V2V), Vehicle-to-Infrastructure (V2I), and more
algorithm should account for different sources of uncertainty recently Vehicle-to-Everything (V2X), are technologies that
and generate control policies that are quantifiably safe. In enable vehicles to exchange safety and mobility information
addition, algorithms that respect precise safety measures can between each other and with the surrounding agents, in-
assist policymakers addressing legislative issues related to cluding pedestrians with smart phones and smart wearables.
AVs, such as insurance policies and ultimately convince the Vehicles can collect information en route, such as road
public for a wide deployment of AVs. conditions and position estimates of static and dynamic
One of the most prevalent measures for AV safety is objects, and can use this information to continuously predict
the number of crashes per million miles [1]. Although actions performed by other vehicles and infrastructure. V2V
such a measure provides some estimate on overall safety messages would have a range of approximately 300 meters,
performance in a particular environment, it fails to capture which exceeds the capabilities of systems with cameras,
unique differences and the richness of individual scenarios. ultrasonic sensors, and LIDAR, allowing greater capability
As AVs become more prevalent, the reasoning behind in- and time to warn vehicles.
In this work, we propose a system architecture (Sec. II)
Majid Khonji, Jorge Dias, and Lakmal Seneviratne are with and discuss key challenges in quantifying uncertainty at dif-
KU Center for Autonomous Robotic Systems, Khalifa University,
Abu Dhabi, UAE (email {majid.khonji, jorge.dias, ferent levels of abstractions: scene representation (Sec. III),
lakmal.seneviratne}@ku.ac.ae). intention recognition (Sec. IV), risk-bounded planning
(Sec. V), and control (Sec. VI). We highlight current state- III. P ROBABILISTIC S CENE R EPRESENTATION
of-the-art, and propose research directions at each level. Scene understanding is research topic with strong impact
on technologies for autonomous vehicles. Most of the ef-
II. S YSTEM A RCHITECTURE forts have been concentrated on understanding the scenes
surrounding the ego-vehicle (autonomous vehicle itself). This
is composed by sensor data processing pipeline that includes
Demonstrating PFT Library
Trajectories Motion Model Generator different stages such as low-level vision tasks, detection,
tracking and segmentation of the surrounding traffic envi-
High Level Planner ronment –e.g., pedestrian, cyclists and vehicles. However,
Renegotiate Short Term
Intent Recognizer for an autonomous vehicle, these low-level vision tasks
objectives Objectives Agent are insufficient to comprehensive scene understanding. It is
vehicle Observed Vehicle States
Intentions necessary to include reasoning about the past and the present
Short Horizon Planner: Probabilistic Scene Representation
Risk-Bounded POMDP of the scene participants. This paper intends to guide future
Perception/V2X
research on interpretation of traffic scene in autonomous
Control Actions for Ego-vehicle Raw Sensing & V2X Data driving from a probabilistic event reasoning perspective.
Ego-vehicle
al
A. Probabilistic Context Layout for Driving
Go
Scene representation includes context representations that
include spatially geometrical relationships [5] among differ-
ent traffic elements with certain semantic labels. It is different
from the semantic segmentation frameworks [6], [7], because
Fig. 1: Risk-aware AV stack. the context representation does not only contain the static
In the following, we present the architecture of a risk- components of traffic scene (typical technique for this aspect
aware AV stack with six technical objectives in mind: is simultaneous localization and mapping (SLAM)), such as
road, the type of traffic lanes, traffic direction, and participant
• A probabilistic perception and object representation orientation, but also consists of several kinds of dynamic
system that takes into consideration uncertainty that elements, e.g., motion correlation of participants. The study
arises from hardware modalities and sensor fusion. The [8],[9] has given a detailed review on semantic segmentation,
system will capture uncertainty in object classification, taking the traffic geometry inferring into consideration.
bounding geometries, and temporal inconsistencies un- A key aspect of context representation is to extract salient
der diverse conditions. features from a large set of sensor data. For that purpose, it is
• Leverage the communication network to gain knowl- necessary to establish a saliency mechanism, that is a critical
edge of the surrounding agents (vehicles and pedes- region extraction and information simplification technique
trians) that are beyond line-of-sight, and then improve that is widely used for attractive region selection in images.
upon scene representation. Over the past few decades, saliency has been generally
• An intention recognition system that takes into account formulated as bottom-up and top-down modes. Bottom-up
all dynamic objects (vehicles and pedestrians), from modes [10], [11] are fast, data-driven, pre-attentive and task-
perception and V2X communication, and estimates a independent. Top-down approaches [12], [13], [14], [15]
distribution over potential future trajectories. often entail supervised learning with pre-collected task labels
• Generalize upon recently developed risk-aware opti- by a large set of training examples and are task-oriented and
mization algorithms [3], [4], in order to ensure that vary in different environments.
movements are safe. A recent work [16] presents a fast algorithm that obtains
• On a higher level, propose goal-directed autonomous a probabilistic occupancy model for dynamic obstacles in
planners that strive to meet the passenger goals and the scene with few sparse LIDAR measurements. Typically
preferences, and help the passengers to think through the occupancy states exhibit highly nonlinear patterns that
adjustments to their goals, when they can’t be safely cannot be captured with a simple linear classification model.
met. Therefore, deep learning models and kernel-based models
• To ensure that decisions are made in a timely manner, can be considered as potential candidates. However, these
design polynomial-time approximation algorithms that approaches require either a massive amount of data or a
offer formal bounds on sub-optimality, and which pro- high number of hyper-parameters to tune. A promising future
duce near-optimal results. direction is to extend this approach to account for different
In addition, by specifying the probability that a plan is object classes (rather than occupancy map) and other sensors
executed successfully, the system operator or policymaker as well such as cameras.
can set the desired level of conservatism in the plan in B. Beyond Line-of-sight
a meaningful manner and can trade conservatism against Any sensing modality has blind spots. For objects that
performance. Fig. 1 shows the interaction between key lie beyond-line-of-sight, one can consider a communication
components of the system as we illustrate throughout the network to improve upon the scene representation. This can
paper.
be critical in certain edge scenarios. For example, in Fig. 2, maneuvers comprises of a set of collected trajectories. Due
the ego-vehicle (red) has two options: either maintain speed to the uncertainties in the motions of human-driven vehicles,
or overtake the vehicle ahead. Suppose that another agent we learn a compact motion representation called Probabilistic
vehicle is approaching from a distance that is not detected Flow Tube (PFT) [25] from demonstrating trajectories to
by onboard sensors of the ego-vehicle. In this scenario, both capture human-like driver styles and uncertainties for each
the speed and location of the distant vehicle might not be maneuver. A library of pre-learned PFTs can be used to esti-
accurately estimated, therefore maneuver A2 leading to a mate the current maneuver as well as predict the probabilistic
collision. motion of each agent vehicle using a Bayesian approach.
V. R ISK - BOUNDED P LANNING
Deterministic optimization approaches have been well de-
veloped and widely used in several disciplines and industries,
in order to optimize processes both off-line and on-line.
In this work, we characterize uncertainty in a probabilistic
manner and find the optimal sequence of ego-vehicle trajec-
Fig. 2: V2V communication.
tory control, subject to the constraint that the probability of
There has been substantial progress for the standardization failure must be below a certain threshold. Such constraint
of vehicle-to-everything/V2X (V2V/V2I/V2P) communica- is known as a chance constraint. In many applications, the
tion protocols. The major V2X standards are known as DSRC probabilistic approach to uncertainty modeling has a number
(Dedicated Short-Range Communications) [17] as well as 5G of advantages over a deterministic approach. For instance,
[18]. The introduction of 5G’s millimeter-wave transmissions disturbances such as vehicle wheel slip can be represented
brings a new paradigm to wireless communications. Depend- using a stochastic model. When using a Kalman Filter for
ing on the application, 5G positioning can also enhance enhancing localization, the location estimate is provided as
tracking techniques, which leverage short-term historical a probabilistic distribution. In addition, by specifying the
data (local signatures and key features). Uncertainty can be probability that a plan is executed successfully, the system
captured by probabilistic models (e.g., Gaussian) through operator or policymaker can set the desired level of conser-
sampling temporal inconsistencies in historical data streams vatism in the plan in a meaningful manner and can trade
such as localization data, and parameter tuning. conservatism against performance. Therefore, robustness is
IV. I NTENTION R ECOGNITION achieved by designing solutions that guarantee feasibility
as long as disturbances do not exceed these bounds. Fur-
This subsystem involves prediction and machine learning thermore, if the passenger goals cannot be safely achieved,
tasks to reliably estimate the future trajectories of uncontrol- then the chance constraints can be analyzed to pinpoint the
lable agents in the scene, including pedestrians and other sources of risk, and the user goals can be adjusted, based on
agent vehicles. Many existing trajectory prediction algo- their preferences, in order to restore safety.
rithms [19], [20] obtain deterministic results quite efficiently. Reasoning under uncertainty has several challenges. The
However, these approaches fail to capture the uncertain optimization problem of trajectory optimization is non-
nature of human actions. Probabilistic predictions are bene- convex, due to discrete choices and the presence of obstacles
ficial in many safety-critical tasks such as collision checking in the feasible space. One approach to tackle the challenges
and risk-aware motion planning. They can express both is by introducing multiple layers of abstractions. Instead of
the intrinsically uncertain prediction task at hand (human solving high-level problems (e.g., route planning) and low-
nature) and reasoning about the limitations of the prediction level problems (e.g., steering wheel angle, acceleration, and
method (knowing when an estimate could be wrong [21]). To brake commands) in a single shot, one can decouple them
incorporate uncertainties into prediction results, data-driven into sub-problems. We achieve such hierarchy through a
approaches can learn common characteristics from datasets high-level planner, short-horizon planner, and precomputed
of demonstrated trajectories [22], [23]. These methods often and learned maneuver trajectories as we illustrate below.
express uni-modal predictions, which may not perform well
in sophisticated urban scenarios where the driver can choose A. High Level Planner
among multiple actions. A recent work [24] presents a hybrid High-level planning involves route planning, applying traf-
approach using a variational neural network that predicts fic rules, and consequently setting short-term objectives (aka
future driver trajectory distributions for the ego-vehicle based set points), which will be fed into Short Horizon Planner
on multiple sensors in urban scenarios. The work can be (as shown in Fig. 1). The planner adjusts those short-term
extended in future to predict trajectories for agent-vehicles objectives when no safe solution exists. To be able to model
using V2V data streams, if available. the feasibility of an obtained plan, we leverage Temporal
We propose a simple intent recognition that is divided into Plan Networks (TPN) [26]. A TPN is a graph where the
two steps. First we continuously record high-level maneuvers nodes represent events, and the edges represent activities. In
of surrounding vehicles (both off-line and online). Examples temporal planning, the ego-vehicle is presented with a series
of such maneuvers are merge left, merge right, accelerate all of events and must decide precisely when to schedule them.
at different velocities and variations and so on. Each of these STNs with Uncertainty (STNUs) is an extension allowing
to reason over stochastic, or uncontrollable, actions and policies. It also uses a risk heuristic to prune the search
their corresponding durations [27]. Such formalism allows to space, removing high-risk branches that violate the chance
check the feasibility of a high-level plan and prompt the user constraints. Hence, at each level, the action that maximizes
to adjust his or her intermediate goals and time constraints to expected reward and meets chance constrained is selected
output smooth intermediate plans, fed into the short horizon for the vehicle. However, one of the drawbacks of RAO* is
planner. that it does not always return optimal solutions and also does
B. Short Horizon Planner not provide any bound on the sub-optimality gap. In a recent
work [3], we provide an algorithm that provides guarantee on
Planning under uncertainty is a fundamental area in ar- optimality (namely, a fully polynomial time approximation
tificial intelligence. For the application of AV, it is crucial scheme (FPTAS)) while preserving safety constraints, all
to plan for potential contingencies instead of planning a within polynomial running time.
single trajectory into the future. This often occurs in dynamic Recently [31] applied RAO* for the application of self-
environments where the vehicle has to react quickly (in driving vehicles under restricted settings (e.g., known dis-
milliseconds) to any potential event. Partially observable tribution of actions taken by agent-vehicles). CC-POMDP,
Markov decision processes (POMDP)[28], [29] provide a while otherwise expressive, allow only for sequential, non-
model for optimal planning under actuator and sensor uncer- durative actions. This poses restrictions in modeling real-
tainty, where the goal is to find policies (contingency plans) world planning problems. In our recent ongoing work, we
that maximize (or minimize) some measure of expected extend the framework of CC-POMDP to account for durative
utility (or cost). actions, and leverage heuristic forward search to prune the
In many real-world applications, a single measure of search space to improve upon the running time.
performance is not sufficient to capture all requirements (e.g., VI. M OTION M ODEL G ENERATOR
an AV tasked to minimize commute time while keeping
Based on each driving scenario, we compute a library
the distance from obstacle below a given threshold). This
of maneuvers. Each maneuver is associated with nominal
extension is often called constrained POMDP (C-POMDP)
control signals by solving a model predictive control (MPC)
[30]. When constraints involve stochasticity (e.g., distance
optimization problem [31]. The set of possible maneuver
following a probabilistic model), the problem is modeled
actions are constrained by traffic rules and vehicle dynamics
as chance-constrained POMDP (CC-POMDP) [4], where we
and are informed by the expected evolution of the situation.
have a bound on the probability of violating constraints. To
Computing the actions can be accomplished through offline
calculate the risk of each decision, one can leverage the
and online computation, and also through publicly available
probabilistic flow-tube (PFTs) concept to model a set of
datasets (e.g., Berkeley DeepDrive BDD100k).
possible trajectories [25]. The current state-of-the-art solver The size of the search space of CC-POMDP, described
of CC-POMDP is called RAO* [4]. RAO* generates a above, is sensitive to the number of maneuver actions.
conditional plan based on action and risk models and likely To tackle this issue, we consider three different levels for
possible scenarios for agent vehicles. abstractions. i) Micro Actions are primitive actions like
Accelerate, Decelerate, Maintain. ii) Maneuver Actions are
sequences of micro actions like Merge left, Merge right, iii)
Macro Actions are sequences of maneuver actions such as
pass the front vehicle, go straight until next intersection [32].
To calculate the risk of collision, we leverage PFT, which
represents a sequence of probabilistic reachable sets. PFTs
show probabilistic future predictions for states of the vehicles
under a selected action. In this context, the intersection
between two, temporally aligned, PFT trajectories represents
the risk of collision. To construct PFTs, we use vehicle dy-
namics and also probabilistic information about uncertainties,
Fig. 3: CC-POMDP Hypergraph: Nodes are the probability distri- as well as through learning from datasets. By propagating
butions of states (belief states) of ego vehicle. At each node, there the probability distributions of uncertainties through the
are n possible actions that can be taken by the ego vehicle. At continuous dynamics of the vehicle, we construct probability
each level, belief state is updated with respect to chosen action and distributions for the locations of the vehicle over a finite
observations of the environment.
planning horizon.
RAO* explores from a probability distribution of vehicle VII. C ONCLUSION
states (belief state), by incrementally constructing a hyper- In this work, we proposed a system architecture for risk-
graph, called the explicit hyper-graph shown in Fig. 3. At aware AVs that can deliberately bound the risk of collision
each node of the hyper-graph, the planner considers possible below a given threshold, defined by the policymaker. We
actions provided by Motion Model Generator (see Fig. 1) presented the related work, discussed key challenges, and
and receives several possible observations. At each level, it proposed research directions in three key subsystems: per-
utilizes a value heuristic to guide the search towards optimal ception, intention recognition, and risk-aware planning. We
believe that our white-box approach is crucial for a better [20] H. Woo, Y. Ji, H. Kono, Y. Tamura, Y. Kuroda, T. Sugano, Y. Ya-
understanding of AV decision making and ultimately for mamoto, A. Yamashita, and H. Asama, “Lane-change detection based
on vehicle-trajectory prediction,” IEEE Robotics and Automation Let-
future adoption of AVs on a large scale. ters, vol. 2, no. 2, pp. 1109–1116, 2017.
[21] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian
deep learning for computer vision?” in Advances in neural information
R EFERENCES processing systems, 2017, pp. 5574–5584.
[22] D. Vasquez, T. Fraichard, and C. Laugier, “Growing hidden markov
[1] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of models: An incremental tool for learning and predicting human and
driving would it take to demonstrate autonomous vehicle reliability?” vehicle motion,” The International Journal of Robotics Research,
Transportation Research Part A: Policy and Practice, vol. 94, pp. 182– vol. 28, no. 11-12, pp. 1486–1506, 2009.
193, 2016. [23] J. Wiest, M. Höffken, U. Kreßel, and K. Dietmayer, “Probabilistic
[2] J.-F. Bonnefon, A. Shariff, and I. Rahwan, “The social dilemma of trajectory prediction with gaussian mixture models,” in 2012 IEEE
autonomous vehicles,” Science, vol. 352, no. 6293, pp. 1573–1576, Intelligent Vehicles Symposium. IEEE, 2012, pp. 141–146.
2016. [24] X. Huang, S. McGill, B. C. Williams, L. Fletcher, and G. Rosman,
“Uncertainty-aware driver trajectory prediction at urban intersections,”
[3] M. Khonji, A. Jasour, and B. Williams, “Approximability of
arXiv preprint arXiv:1901.05105, 2019.
constant-horizon constrained pomdp,” in Proceedings of the Twenty-
[25] S. Dong and B. Williams, “Learning and recognition of hybrid
Eighth International Joint Conference on Artificial Intelligence,
manipulation motions in variable environments using probabilistic flow
IJCAI-19. International Joint Conferences on Artificial Intelligence
tubes,” International Journal of Social Robotics, vol. 4, no. 4, pp. 357–
Organization, 7 2019, pp. 5583–5590. [Online]. Available: https:
368, 2012.
//doi.org/10.24963/ijcai.2019/775
[26] A. G. Hofmann and B. C. Williams, “Temporally and spatially flexible
[4] P. Santana, S. Thiébaux, and B. Williams, “Rao*: an algorithm for plan execution for dynamic hybrid systems,” Artificial Intelligence,
chance constrained pomdps,” in Proc. AAAI Conference on Artificial vol. 247, pp. 266–294, 2017.
Intelligence, 2016. [27] N. Bhargava and B. C. Williams, “Faster dynamic controllability
[5] C. Landsiedel and D. Wollherr, “Road geometry estimation for urban checking in temporal networks with integer bounds,” in Proceedings
semantic maps using open data,” Advanced Robotics, vol. 31, no. 5, of the Twenty-Eighth International Joint Conference on Artificial
pp. 282–290, 2017. Intelligence, IJCAI-19. International Joint Conferences on Artificial
[6] E. Levinkov and M. Fritz, “Sequential bayesian model update under Intelligence Organization, 7 2019, pp. 5509–5515. [Online]. Available:
structured scene prior for semantic road scenes labeling,” in Proceed- https://doi.org/10.24963/ijcai.2019/765
ings of the IEEE International Conference on Computer Vision, 2013, [28] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and
pp. 1321–1328. acting in partially observable stochastic domains,” Artificial intelli-
[7] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation gence, vol. 101, no. 1-2, pp. 99–134, 1998.
for autonomous driving with deep densely connected mrfs,” in Pro- [29] E. J. Sondik, “The optimal control of partially observable markov
ceedings of the IEEE Conference on Computer Vision and Pattern decision processes.” PhD the sis, Stanford University, 1971.
Recognition, 2016, pp. 669–677. [30] P. Poupart, A. Malhotra, P. Pei, K.-E. Kim, B. Goh, and M. Bowling,
[8] J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision for “Approximate linear programming for constrained partially observable
autonomous vehicles: Problems, datasets and state-of-the-art,” arXiv markov decision processes,” in Twenty-Ninth AAAI Conference on
preprint arXiv:1704.05519, 2017. Artificial Intelligence, 2015.
[9] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning- [31] X. Huang, A. Jasour, M. Deyo, A. Hofmann, and B. C. Williams, “Hy-
based fine-grained object classification and semantic segmentation,” brid risk-aware conditional planning with applications in autonomous
International Journal of Automation and Computing, vol. 14, no. 2, vehicles,” in 2018 IEEE Conference on Decision and Control (CDC).
pp. 119–135, 2017. IEEE, 2018, pp. 3608–3614.
[10] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency [32] S. Omidshafiei, A.-A. Agha-Mohammadi, C. Amato, and J. P. How,
detection: a boolean map approach,” IEEE transactions on pattern “Decentralized control of partially observable markov decision pro-
analysis and machine intelligence, vol. 38, no. 5, pp. 889–902, 2015. cesses using belief space macro-actions,” in 2015 IEEE International
[11] Q. Wang, Y. Yuan, P. Yan, and X. Li, “Saliency detection by multiple- Conference on Robotics and Automation (ICRA). IEEE, 2015, pp.
instance learning,” IEEE transactions on cybernetics, vol. 43, no. 2, 5962–5969.
pp. 660–672, 2013.
[12] S. He, R. W. Lau, and Q. Yang, “Exemplar-driven top-down saliency
detection via deep association,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 5723–5732.
[13] J. Yang and M.-H. Yang, “Top-down visual saliency via joint crf
and dictionary learning,” IEEE transactions on pattern analysis and
machine intelligence, vol. 39, no. 3, pp. 576–588, 2016.
[14] J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E.
O’Connor, “Shallow and deep convolutional networks for saliency
prediction,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 598–606.
[15] Y. Xia, D. Zhang, A. Pozdnoukhov, K. Nakayama, K. Zipser, and
D. Whitney, “Training a network to attend like human drivers saves
it from common but misleading loss functions,” arXiv preprint
arXiv:1711.06406, 2017.
[16] R. Senanayake, A. Tompkins, and F. Ramos, “Automorphing kernels
for nonstationarity in mapping unstructured environments.” in CoRL,
2018, pp. 443–455.
[17] H. Hartenstein and K. Laberteaux, VANET: vehicular applications and
inter-networking technologies. Wiley Online Library, 2010, vol. 1.
[18] F. G. . S. L. Gante, J., “Deep learning architectures for accu-
rate millimeter wave positioning in 5g,” Neural Process Letters
https://doi.org/10.1007/s11063-019-10073, pp. 1–28, 2019.
[19] A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao, “Vehicle trajec-
tory prediction based on motion model and maneuver recognition,”
in 2013 IEEE/RSJ international conference on intelligent robots and
systems. IEEE, 2013, pp. 4363–4369.
Cognitively-inspired episodic imagination for self-driving vehicles
Sara Mahmoud1 , Henrik Svensson1 and Serge Thill2,1
Abstract— The controller of an autonomous vehicle needs Another instance of episodic simulations is found in
the ability to learn how to act in different driving scenarios dreams, during which the brain is more or less cut off from
that it may face. A significant challenge is that it is difficult, sensory input and motor output. Although the function of
dangerous, or even impossible to experience and explore various
actions in situations that might be encountered in the real dreams remains heavily debated, some theories suggest that
world. Autonomous vehicle control would therefore benefit they might help to prepare agents for action and can improve
from a mechanism that allows the safe exploration of action performance in the wake state. For example, Revonsuo [7]
possibilities and their consequences, as well as the ability to suggests the “Threat Simulation Theory”, according to which
learn from experience thus gained to improve driving skills. a major function of dreams is to rehearse possibly threat-
In this paper we demonstrate a methodology that allows a
learning agent to create simulations of possible situations. These ening situations. Others have hypothesized that rudimentary
simulations can be chained together in a sequence that allows mental simulations during early childhood interact with wake
the progressive improvement of the agent’s performance such behavior to facilitate the formation of more mature mental
that the agent is able to appropriately deal with novel situations simulations during development [8], [9]. It follows from this
at the end of training. This methodology takes inspiration that the importance lies not only in some general reactivation
from the human ability to imagine hypothetical situations using
episodic simulation; we therefore refer to this methodology as of previous sensorimotor activity, but also in the content
episodic imagination. since this might influence the usefulness of the simulation
An interesting question in this respect is what effect the for future behaviors.
structuring of such a sequence of episodic imaginations has
on performance. Here, we compare a random process to a B. Implementations of simulations in artificial agents
structured one and initial results indicate that a structured
sequence outperforms a random one. Simulation theories of various kinds have previously also
been implemented in various artificial agents to investigate
I. INTRODUCTION how such an ability affects behavior and can improve perfor-
A. Simulation abilities in humans mance [10], [11], [12]. For example, an early approach was
adopted by Mel [13] who created a robot arm that by means
The ability to internally simulate what has or will happen
of forward models could plan its movements by “imagining”
in past and future situations provides agents with increased
its future movements. Many other approaches have since
flexibility when interacting with the world. In humans, these
then utilized the ability for internal re-creation of sensory
mental simulations occur in many forms, ranging from low-
and motor states to assist in various tasks (see e.g. [12],
level embodied simulations to higher level episodic simula-
[14], [8]). Many of the previous attempts of implementing
tion [1]. These can briefly be described as follows:
mental simulations in robots have been rather simplistic and
Embodied simulations, in which the sensorimotor systems
have been more related to embodied simulations rather than
of the brain are extensively reactivated in similar ways as
episodic simulation due to the nature of the mechanisms
during overt interaction with the world have been shown
used.
to improve subsequent motor performance in, for example,
The hallmark of episodic simulations is increased flex-
path navigation [2], sports activities [3], and rehabilitation
ibility and diversity with regards to the content of the
[4]. Thus, embodied simulations seems to facilitate learning
simulations not being dictated to the same degree by the
despite absence of direct feedback from the environment.
physical constraints of body and environment as would be
Episodic simulations, on the other hand, refer to simulations
the case in embodied simulations [1]. Implementing a more
concerning more abstract aspects of interactions not directly
strongly biologically inspired mechanism [8], [7] would
affecting motor performance, but rather being more flexible
require such simulations to be more flexible with respect
and diverse in terms of the content of the simulations
to their content. Deep neural networks [15] and Generative
and influencing action selection on a higher level, such
Adversarial Networks (GAN) [16], for example, may provide
as contemplating different places for the next vacation or
viable approaches for implementing episodic like simulations
preparing your arguments in the next salary negotiation, or
in artificial agents. GANs, for example, provide the required
imagining where you’ll be in 10 years [5], [6].
flexibility because they are able to create previously unseen
Funding: the authors acknowledge financial support from the EC H2020 data in a useful way. As such, GANs have been used for
research project Dreams4Cars (no. 731593) imagination to generate video scenes similar to collected
1 Interaction Lab, University of Skövde, 54128 Skövde, Sweden
real world video data, which subsequently was used to run
[email protected], [email protected]
2 Donders Institute for Brain, Cognition, and Behaviour, Radboud Univer- a reinforcement-learning based driving agent in the gener-
sity, 6525 HR Nijmegen, Netherlands [email protected] ated scenes [17]. Initial results showed that a trained GAN
generated simulated images very close to the real data. Thus,
rather than recreating very simple sensor data these networks
are able create more episodic like images [18]. In other work,
using a setup with a variational auto-encoder combined with
a recurrent neural network, Ha and Schmidhuber [19] also
showed promising results of using episodic like simulations
(or dreams/hallucinations in their terms) in the OpenAI gym Fig. 1: Imagination types used for training the driving
[20] and VizDoom [21] environments. agent:no imagination, stochastic structure and systematic
While previous work has put much effort into image structure
generation mechanisms, it is not clear what variables affect
the learning process when learning and generating behaviors
are instead based on episodic simulations. In particular, it
is an open question whether the structure of the content of
episodic simulations affects the learning performance. This
may be critical for autonomous driving [22]. For example,
one could vary the number of vehicles encountered when
learning to overtake in such simulations, but is it enough,
as has been done in previous work, to merely randomly
hallucinate different overtaking scenarios [19], [8] until per- Fig. 2: Episodic generator system architecture for self-driving
formance converges to a satisfactory level, or should there car on OpenDS simulation
be some guiding structure to the process?
In the remainder of this paper, we investigate this using
a lane-keeping task for an autonomous vehicle. Since the A. System Architecture
focus is not on the image generation process per se, but
In a nutshell, the system architecture consists of four
on how the content of episodic simulations interacts with
main components (see Figure 2): (1) OpenDS, the physical
the learning process, we here use the rendered simulation
simulation, in which the training and testing driving is
in a driving simulator directly as a model for embodied
executed (Figure 3), (2) the learning agent that is trained in
and episodic simulation. The simulation consists of both
different imagination conditions, (3) a middleware connector
embodied aspects, such as the the physics model of the
that converts the simulation into a RL environment and (4)
vehicle, and episodic aspects, such as the type of road
the road generator that describes the road specifications used
environment. However, since the study only manipulates the
in the simulations.
road environment, we use the term episodic imagination for
the test conditions in the study. This allows us to create a tool We use the middleware connector to calculate the reward
that is able to flexibly create new episodic-like simulations function (optimized for a lane keeping task) at each step (see
and focus on the question of how their content may affect the Eqns. 1–3). The function depends on the lateral distance from
learning and subsequent performance. It should also be noted the left lane margin of the road (Eqn. 1), and the car heading
that the imagination mechanism proposed here differs from angle between the lane and the car (Eqn. 2), as shown in
the common approach of manually designing the simulations Figure 3b.
– here, these are automatically generated by the proposed
system architecture. The work here thus also contributes re = min(dl , w − dl ) (1)
to the development of more effective means of learning
from imagination by developing an automatic imagination rh = 2 ∗ e(−15∗|lh |) (2)
mechanism.
rt = re + rh (3)
The remainder of the paper is structured as follows:
Section II describes the research method. Section III presents
the results and Section IV concludes the paper.
II. METHODS
Abstract— Deep learning is responsible for the current re- B. Convergence–Divergence Zones
newed success of artificial intelligence. Applications that in
the recent past were considered beyond imagination, now
Although the simulation theory is one of the most estab-
appear to be feasible. The best example is autonomous driving. lished, it does not identify how simulation takes place at
However, despite the growing research aimed at implementing neural level. A prominent proposal in this direction is the
autonomous driving, no artificial intelligence can claim to have formulation of the convergence-divergence zones (CDZs) [3].
reached or closely approached the driving performance of They highlight the “convergent” aspect of certain neuron
humans, yet. Deep learning is an evolution of artificial neural
networks introduced in the ’80s with the Parallel Distributed
ensembles, located downstream from primary sensory and
Processing (PDP) project. There is a fundamental difference in motor cortices. Such convergent structure consists in the
aims between the first generation of artificial neural networks projection of neural signals on multiple cortical regions in a
and deep neural models. The former was motivated primarily many-to-one fashion. On the other hand, the neuron ensem-
by the exploration of cognition. Current deep neural models are bles have the ability to reciprocate feedforward projections
instead developed with engineering goals in mind, without any
ambition or interest in exploring cognition. Some important
with feedback projections in a one-to-many fashion, realizing
components of deep learning – for example reinforcement the divergent flow.
learning or recurrent networks – owe indeed an inspiration to The primary purpose of convergence is to exploit synaptic
neuroscience and cognitive science, as PDP far legacy. But this plasticity in order to record which patterns of features –
connection is now neglected, what matters is only the pragmatic coded as knowledge fragments in the early cortices – occur
success in applications. We argue that it urges to reconnect
in relation with a specific higher-level concept. Such records
artificial modeling with an updated knowledge of how complex
tasks are realized by the human mind and brain. In this paper, are built through experience, by interacting with objects. The
we will first try to distill concepts within neuroscience and convergent flow is dominant during perceptual recognition,
cognitive science relevant for the driving behavior. Then, we while the divergent flow dominates imagery.
will identify possible algorithmic counterparts of such concepts, Convergent-divergent connectivity patterns can be identi-
and finally build an artificial neural model exploiting these
fied for specific sensory modalities, but also in higher order
components for the visual perception task of an autonomous
vehicle. association cortices. It should be stressed that CDZs are
rather different from a conventional processing hierarchy,
where processed patterns are transferred from earlier to
I. FROM THE COGNITIVE SIDE
higher cortical areas. In CDZs, part of the knowledge about
A. The Simulation Theory perceptual objects is retained in the synaptic connections of
the convergent-divergent ensemble. This allows to reinstate
A well-established theory in cognitive science is the one an approximation of the original multi-site pattern of a
proposed by Jeannerod and Hesslow, the so-called simu- recalled object or scene.
lation theory of cognition, which proposes that thinking
is essentially a simulated interaction with the environment C. Transformational Abstraction
[1], [2]. In their view, simulation is a general principle of One major challenge in cognitive science is explaining the
cognition, which can be expressed in at least three different mental mechanisms by which we build conceptual abstrac-
components: perception, actions and anticipation. tions. The conceptual space is the mental scaffolding the
The most simple case of simulation is mental imagery, brain gradually learns through experience, as internal repre-
especially in visual modality. This is the case, for example, sentation of the world. In particular, conceptual abstraction
when a person tries to picture an object or a situation. During is derived mostly from perceptual experience, which fits
this phenomenon, the primary visual cortex (V1) is activated perfectly with the approach implemented by artificial neural
with a simplified representation of the object of interest, but networks.
the visual stimulus is not actually perceived. As highlighted by [4] CDZs are a valid systemic candidate
for how the formation of high-level concepts takes place
This work was developed inside the EU Horizon 2020 Dreams4Cars at brain level. However, the idea of CDZs is just sketched
Research and Innovation Action project, supported by the European Com- and cannot provide a detailed mechanism for conceptual
mission under Grant 731593. The Authors want also to thank the Deep
Learning Lab at the ProM Facility in Rovereto (TN) for supporting this abstractions. A difficulty with acquiring abstract categories
research with computational resources funded by Fondazione CARITRO. lies in the inconsistent manifestations of the characteristic
1 Alice Plebe is with the Dept. of Information Engineering and Computer
features across real exemplars.
Science, University of Trento, Italy [email protected]
2 Mauro Da Lio is with the Dept. of Industrial Engineering, University A suggested solution to this difficult issue is the trans-
of Trento, Italy [email protected] formational abstraction [5], [6] performed by a hierarchy
of cortical operations, as in the ventral visual cortex. The space back into input space. There is a clear correspondence
essence of transformational abstraction, from a mathematical between the encoder and the convergence zone in the CDZ
point of view, lies in the combination of two operations: neurocognitive concept, and similarity between the decoder
linear convolutional filtering and nonlinear downsampling. and the divergence zone.
Operations of this sort have been identified in the V1 [7], Then how exactly convergence–divergence can be
[8], and are well recognized in the primate ventral visual achieved inside autoencoders? An interesting approach is
path as well [9], [10]. the one closely related to the transformational abstraction
hypothesis described in §I-C: the deep convolutional neu-
D. The Predictive Theory ral networks (DCNNs). They implement the hierarchy of
The reason why cognition is mainly explicated as simulation, convolutional filtering alternated with nonlinear downsam-
according to Hesslow or Jeannerod, is because the brain can pling, and are considered the essence of transformational
achieve through simulation the most precious information of abstraction. In addition, there is growing evidence of striking
an organism: a prediction of the state of affairs in the future analogies between patterns in DCNN models and patterns
environment. The need of prediction, and how it molds the of voxels in the brain visual system. Several studies have
entire cognition, has become the core of another popular the- successfully related results of deep learning models with the
ory popular known as “Bayesian brain”, “predictive brain”, visual system [15], [16], finding reasonable agreement be-
or “free-energy principle for the brain” introduced by Friston tween features computed by DCNN models and fMRI data.
[11]. According to him the behavior of the brain – and of an Convolutional–deconvolutional autoencoders are therefore a
organism as a whole – can be conceived as minimization of highly biologically plausible implementation for the CDZ
free-energy, a quantity that can be expressed in several ways theory, at least in the case of visual information.
depending on the kind of behavior and the brain systems
involved. B. Predictive Brain as Variational Autoencoder
Free-energy is a concept originated in thermodynamics, as In the last few years there has been renewed interest
a measure of the amount of work that can be extracted from in the area of Bayesian probabilistic inference in learning
a system. What is borrowed by Friston is not the thermody- models of high dimensional data. The Bayesian framework,
namic meaning of the free-energy, but its mathematical form variational inference in particular, has found a fertile ground
only, which is derived from the framework of variational in combination with neural models. Two concurrent and
Bayesian methods in statistical physics We will see in §II-B unrelated developments [17], [18] have made this theoretical
how the same probabilistic framework will be used in the advance possible, connecting autoencoders and variational
derivation of a deep neural model. For example, this is his inference. This new approach became quickly popular under
free-energy formulation in the case of perception [12, p.427]: the term variational autoencoder, and a variety of neural
models have been proposed over the years.
FP = ∆KL p̌(c|z)kp(c|x, a) − log p(x|a) (1)
The loss function for a variational autoencoder is defined
where x is the sensorial input of the organism, c is the as follows:
collection of the environmental causes producing x, a are L(Θ, Φ|x) = ∆KL qΦ (z|x)kpΘ (z) +
actions that act on the environment to change sensory
samples, and z are inner representations of the brain. The − Ez∼qΦ (z|x) [log pΘ (x|z)] (2)
quantity p̌(c|z) is the encoding in the brain of the estimate where x is a high dimensional random variable, z the
of causes of sensorial stimuli. The quantity p(c|x, a) is the representation of the variable in the low-dimensional latent
conditional probability of sensorial input conditioned by the space. Θ and Φ are parameters describing, respectively, the
actual environmental causes c. The discrepancy between the decoder and encoder of the network. pΘ is computed by
estimated probability and the actual probability is given by the decoder and represents the desired approximation of
the Kullback-Leibler divergence ∆KL . The minimization of the unknown input distribution p, and qΦ is the auxiliary
FP in equation (1) optimizes z. distribution computed by the encoder from which to sample
II. TO THE ARTIFICIAL SIDE z. E[·] is the expectation operator, and ∆KL is the Kullback-
Leibler divergence.
A. Convergence–divergence as Autoencoder It is evident how this mathematical formulation is im-
In the realm of artificial neural networks, the computational pressively similar to the concept of free energy in Friston.
idea that most closely resonate with CDZ is the autoencoder. Despite this close analogy, all the proposers of variational
It is an idea that has been around for a long time, it was the autoencoder are either unaware or fully disinterested of this
cornerstone of the evolution from shallow to deep neural coincidence. It is not so surprising because mainstream deep
architectures [13], [14]. More recently, autoencoders have learning is driven by engineering goals without any interest in
been widely adopted for their ability to capture compact connections with cognition. We believe instead that a strong
information from high dimensional data. The basic structure connection between a well established cognitive theory and
of an autoencoder is composed of a feature-extracting part a computational solution greatly argues in favor of adopting
called encoder and a decoder part mapping from feature such a solution.
VISUAL INPUT So, in our implementation the entire latent vector z represents
inside the visual space, and at the same time two inner
segments represent specifically the car and lane concepts.
The rationale for this choice is that in mental imagery there
is no clear cut distinction between low-level features and
semantic features, the entire scene is mentally reproduced,
ENCODER
but including the awareness of the salient concepts present
in the scene.
Z
Note that the idea of partitioning the entire latent vector
into meaningful components is not new. In the context of
processing human heads the vector has been forced to encode
separate representations for viewpoints, lighting conditions,
DECODER DECODER DECODER
shape variations [19]. In [20] the latent vector is partitioned
in one segment for the semantic content and a second
segment for the position of the object. Our approach is
different. While we keep disjointed the two segments for
the car and lane concepts, we fully overlap these two
CONCEPT
representations within the entire visual space. This way, we
CONCEPT RECONSTRUCTION
OF LANES OF VISUAL DATA OF CARS adhere entirely to the CDZ principle, and try to achieve the
full scene by divergence, but at the same time including
Fig. 1. The architecture of our model. awareness for the car and lane concepts.
IV. RESULTS
III. IMPLEMENTATION We present here a selection of results achieved with an
instance of the model described in the previous section.
In the previous section we have reviewed several components The final architecture is trained for 200 epochs, and used
that match quite closely the relevant neurocognitive theories 4 convolutional layers in the encoder, 4 deconvolutional
identified in §I. Our proposed model attempts to weave layers for each decoder, and a latent space representation
together these components, finalized at visual perception in of 128 neurons, of which 16 encoding the car concept and
autonomous driving agents. another 16 for the lane marking concept. We would like
Similarly to the hierarchical arrangement of CDZs in the to highlight that, since the images fed to the network have
brain, our model is provided with different levels of process- dimension of 256 × 128 × 3 and the latent space dimension
ing paths. A first processing path starts from the raw image is 128, the compression performed by the network is almost
data and converges up to a low-dimension representation of of 4 orders of magnitude. This is a considerable achieve-
visual features. Consequently, the divergent path outputs in ment compared to other relevant works adopting variational
the same format as the input image. The other processing autoencoder [21], [22] which limit the compression of the
path leads to representations that are no more in terms of encoder to only 1 order of magnitude.
visual features, rather in terms of concepts. As discussed We trained and tested the presented model on the SYN-
in §I-C, our brain naturally projects sensorial information THIA dataset [23], a large collection of synthetic images
– especially visual – into conceptual space, where the local representing various urban scenarios. The dataset contains
perceptual features are pruned and neural activation code the about 100, 000 color images (and as many corresponding
nature of entities present in the environment that produced segmented images, used for ground truth of the conceptual
the stimuli. In the driving context it is not necessary to infer branches of the network). We used 70% of the data for
categories for every entity present in the scene, it is useful training, 25% for validation and 5% for testing.
to project in conceptual space only the objects relevant to Fig. 2 shows the image results produced by our model for
the driving task. In the model presented here we choose a selection of driving scenarios. The images are processed
to consider the two main concepts of cars and lane to better show at the same time the results on conceptual
markings. space and visual space. The colored overlays highlight the
As depicted in Fig. 1, the presented variational autoen- concepts computed by the network: the cyan regions are the
coder is composed by one shared encoder and three inde- output of the car divergent path, and the pink overlays are
pendent decoders. All the components of the architecture the output of the lane markings divergent path. Fig. 2
are trained jointly. The encoder compresses an RGB image includes a variety of driving situations, going from sunny
to a compact high-feature representation. Then the decoders environments (top rows) to very adverse driving conditions
map different part of the latent space back to separated output (bottom rows) in which the detection of other vehicles can
spaces: one into the same visual space of the input; the other be challenging even for a human. These results nicely show
two into conceptual space, producing binary images contain- how the projection of the sensorial input (original frames)
ing, respectively, car entities and lane marking entities. into conceptual representation is very effective in identifying
Input
Output
Input
Output
Fig. 2. Results of our model for a selection of frames from the SYNTHIA dataset, with different environmental and lighting conditions.
and preserving the sensible features of cars and lane [7] D. Hubel and T. Wiesel, “Receptive fields, binocular interaction,
markings, despite the large variations in lighting and and functional architecture in the cat’s visual cortex,” Journal of
Physiology, vol. 160, pp. 106–154, 1962.
environmental conditions. [8] C. D. Gilbert and T. N. Wiesel, “Morphology and intracortical projec-
Lastly, we would like to stress that the purpose of our tions of functionally characterised neurones in the cat visual cortex,”
network is not mere segmentation of visual input. The Nature, vol. 280, pp. 120–125, 1979.
[9] D. J. Felleman and D. C. Van Essen, “Distributed hierarchical pro-
segmentation task is to be considered as a support task, cessing in the primate cerebral cortex,” Cerebral Cortex, vol. 1, pp.
used to enforce the network to learn a more robust latent 1–47, 1991.
space representation, which now is explicitly taking into [10] D. C. Van Essen, “Organization of visual areas in macaque and
human cerebral cortex,” in The Visual Neurosciences, L. Chalupa and
consideration two of the concepts that are fundamental to J. Werner, Eds. Cambridge (MA): MIT Press, 2003.
the driving tasks. [11] K. Friston, “The free-energy principle: a unified brain theory?” Nature
Reviews Neuroscience, vol. 11, pp. 127–138, 2010.
[12] K. Friston and K. E. Stephan, “Free–energy and the brain,” Synthese,
V. CONCLUSIONS vol. 159, pp. 417–458, 2007.
[13] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality
The model here presented is an attempt to convert into of data with neural networks,” Science, vol. 28, pp. 504–507, 2006.
an artificial neural network model the fundamental theories [14] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
about how the brain processes its sensory inputs to pro- “Stacked denoising autoencoders: Learning useful representations in
a deep network with a local denoising criterion,” Journal of Machine
duce purposeful representations. We especially identified the Learning Research, vol. 11, pp. 3371–3408, 2010.
consolidated variational autoencoder architecture as the best [15] U. Güçlü and M. A. J. van Gerven, “Unsupervised feature learning
candidate for implementing convergence-divergence zone improves prediction of human brain activity in response to natural
images,” PLoS Computational Biology, vol. 10, pp. 1–16, 2014.
schemes. The reason for constraining a deep learning model [16] B. P. Tripp, “Similarities and differences between stimulus tuning
on cognitive theoretical grounds, instead of starting from in the inferotemporal visual cortex and convolutional networks,” in
scratch as often done, derives from the observation of how International Joint Conference on Neural Networks, 2017, pp. 3551–
3560.
humans excel in sophisticated sensorimotor control tasks [17] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
such as driving. Proceedings of International Conference on Learning Representations,
2014.
[18] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop-
R EFERENCES agation and approximate inference in deep generative models,” in
Proceedings of Machine Learning Research, E. P. Xing and T. Jebara,
[1] M. Jeannerod, “Neural simulation of action: A unifying mechanism Eds., 2014, pp. 1278–1286.
for motor cognition,” NeuroImage, vol. 14, pp. S103–S109, 2001. [19] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. B. Tenenbaum,
[2] G. Hesslow, “The current status of the simulation theory of cognition,” “Deep convolutional inverse graphics network,” in Advances in Neural
Brain, vol. 1428, pp. 71–79, 2012. Information Processing Systems, 2015, pp. 2539–2547.
[3] K. Meyer and A. Damasio, “Convergence and divergence in a neural [20] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun, “Stacked what-
architecture for recognition and memory,” Trends in Neuroscience, where auto-encoders,” in International Conference on Learning Rep-
vol. 32, pp. 376–382, 2009. resentations, 2016, pp. 1–12.
[4] J. S. Olier, E. Barakova, C. Regazzoni, and M. Rauterberg, “Re- [21] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol.
framing the characteristics of concepts and their relation to learning abs/1608.01230, 2016.
and cognition in artificial agents,” Cognitive Systems Research, vol. 44, [22] D. Ha and J. Schmidhuber, “World models,” CoRR, vol.
pp. 50–68, 2017. abs/1803.10122, 2018.
[5] D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick, [23] G. Ros, L. S. J. M. D. Vazquez, and A. M. Lopez, “The SYNTHIA
“Neuroscience-inspired artificial intelligence,” Neuron, vol. 95, pp. dataset: A large collection of synthetic images for semantic segmen-
245–258, 2017. tation of urban scenes,” in Proc. of IEEE International Conference on
[6] C. Buckner, “Empiricism without magic: transformational abstraction Computer Vision and Pattern Recognition, 2016, pp. 3234–3243.
in deep convolutional neural networks,” Synthese, vol. 195, pp. 5339–
5372, 2018.
Cognitive Wheelchair: A Personal Mobility Platform
Mahendran Subramanian, Suhyung Park, Pavel Orlov, and A. Aldo Faisal
David Windridge and Seyed Ali Ghorashi, Middlesex University, London, UK Member, IEEE
Abstract— By modelling driving as a Perception-Action (PA) back-propagation) at all levels via an end-to-end neuralization
hierarchy it is possible to combine high-level symbolic logical of the PA hierarchy.
reasoning (in particular, the Highway Code applied to
hypothetical road configurations) with low-level sub-symbolic
processes (specifically, Optimal Control and stochastic machine II. A HIERARCHICAL SENSORIMOTOR CONTROL SYSTEM FOR
learning). In this context, we propose a cortical frontal loop AUTONOMOUS VEHICLES
analogue for autonomous vehicles in which progressively The driving agent model here proposed implements a
abstracted bottom-up scene understanding is followed by top- biologically-analogous cortical system for hierarchical
down legal action specification (with progressive contextual sensorimotor control, with physically connected (non-
grounding), such that final action selection is carried out via symbolic) bottommost layers and a top-most symbolic
simulated basal ganglia model. Although the top level of the PA-
subsumption architecture. Long term (strategic) goals, in
hierarchy employs explicit first-order logical reasoning we can
exploit the duality principle of Hölldobler to generate a
particular compliance with the highway code is enacted by the
functionally equivalent deep neural network such that the PA symbolic module. The module acts on the sub-symbolic
hierarchy can learn adaptively at all levels. (physical) layer by specifying desirable target areas, hence
biasing low-level action selection. The symbolic module thus
steers the behavior of the lowermost (physically-connected)
I. INTRODUCTION layer which retains the final authority vetoing all but the safe
Perception-Action (PA) learning proposes an intrinsic link tactical maneuvers that are moment-by-moment available (it
between the perceptive and active capabilities of an agent can veto incorrect high-level requests).
(motto: action precedes perception). This may be modelled as Our architecture is inspired the organization of the human
an explicit bijection constraint between percept transitions and brain’s visual processing [1-3], with differing cortical loops
actions: 𝑷𝑷𝑷𝑷𝑷𝑷 → A such that any perceptual redundancy is permitting different agent learning modalities:
eliminated in relation to the agent’s affordances with respect to
the environment. The cerebellar loop learns forward/inverse models of the
vehicle/environment dynamics (used for motor control and
The notion of a Perception-Action Hierarchy further relates the adaptation to differing environments, as well as embodied
Brooksian notion of action subsumption to this progressive simulation for training the dorsal stream to learn the value of
perceptual abstraction via layer-wise application of the PA novel short-term tactical-level maneuvers).
bijection principle. We can thus model human car-driving as a
PA hierarchy by enabling the combination of high-level The dorsal stream has a convergence-divergence organization
symbolic logical reasoning (i.e. in relation to the High way and leans compact representations of simple events that are
code) with low-level sub-symbolic processes. used to construct simple episodes for developing short-term
motor strategies (e.g., imagining other road users’ possible
To this end, we here outline a run-time Cortical Frontal Loop behaviors and learning collision avoidance countermeasures).
analogue in which (progressively abstracted) bottom-up scene
representation is followed by top-down (legal) action The symbolic level learns long-term strategies with high-level
specification. The top level of the PA-hierarchy, the Logical action selection via reinforcement learning in an episodic
Reasoning Module (LRM), hence employs explicit first-order simulation context.
logical reasoning in order to compute the full set of equi-legal Agent evolution is conducted via off-line learning utilizing
agent actions (constituting the Herbrand base of the LRM’s wake-dream cycles to replace the various neural network
logical programme) with respect to the currently configuration building blocks.
as interpreted via the bottim-up scene understanding.
The role of the dorsal stream is hence to recognize actions
The PA-hierarchy so constructed utilizes both neural and latent in the environment and to prepare motor plans
formal reason processes. However, there is a fundamental accordingly [1]. The dorsal stream also has a role in
duality principle that suggests Logic Programmes are always conceptualizing episodes. Both of these capabilities are
capable of neuralization (cf Holldobler & Kalinke’s naturally implemented within our system via an auto-encoded
equivalence between 3-layer NNs and logic programmes convergence-divergence architecture.
(LPs)). However, if rule-base is hierarchical (as it must be in a
PA-hierarchy), then the above equivalence becomes that Because of its intrinsically discrete and iterative nature,
between the reasoning hierarchy and a functionally equivalent however, neuralization of the symbolic frontal cortex is less
deep neural network. This means that the system can, in
principle, be so constructed as be able learn adaptively (via
straightforward. We thus first give a detailed account of this scene-description and annotation may be seen as a process of
loop prior to discussing our symbolic neuralization strategy. bottom-up symbolic abstraction. The two processes are hence
the precise inverse of each other in the LRM’s design.
III. THE BIASING LOOP (FRONTAL CORTEX LOOP) B. PA Subsumption Design Principle Adopted by the LRM
To implement complex symbolic rule-based behaviors such as The criteria for the number of levels in the hierarchy is defined
legal action sequence-planning (e.g. overtaking), further layers by the notions of subsumption and Percept-Action bijection.
are constructed on top of the dorsal stream that steer the agent Application of the PA bijectivity criterion implies that we
low-level behaviors so as to produce legal action sequences for should, as far as possible, represent only those percepts that
longer-term goals. distinguish intentional actions on a given layer. This means that
This is specified as a hierarchical PA-subsumption architecture each intention must bring about a perceivable change in such a
that provides a unified framework for semantic annotated event way that the total set of percepts is minimized with respect to
logging, generation of legal priors for action selection via the the available actions (affordances), consistent with the
basal ganglia (BG) loop and high-level motor babbling/top- highway code representation of a priori meaningful perceptual
down dream instantiation (in the offline system). objects. In practical terms, application of this principle means
that, for example, it is not possible to have two consecutive
There is hence a unified architecture within the LRM legal gaps within a lane, since a ‘legal gap’ in order to exist as
incorporating a common symbolic/sub-symbolic interface that a high level percept must be distinguishable by a
operates across 3 distinct symbolic/sub-symbolic information- correspondingly legally-definable intention (a legal gap is
flow modalities: bottom-up semantic annotation, top-down defined as a potential legal place of relative occupation for the
legal intention biasing, top-down dream instantiation. Ego car within a given lane, and as such is not sub-divisible at
While the sub-symbolic system operates via goal salience (i.e the highest level of legal intentionality).
defined regions of the motor cortex), which may be learned so The notion of Subsumption in the LRM is thus related to the
as to optimize long-term strategical behaviors, the LRM can legal sub-structuring of high-level intentionality; in particular,
only make recommendations (as per the subsumptive where perceptual targets are fine-grained by sub-intentions, for
`principle of lower-level veto’), such that the final choice is in which the same PA bijectivity condition also applies.
charge of the lowest level motor control (the dorsal stream).
The final authority is thus always in the responsibility of the This bijectivity principle also extends to levels below that
dorsal stream physical loops. indicated by the HWC; however, the lowest intentional level
defined by the HWC is that of the linearized road metric; this
A. The Logical Reasoning Module therefore dictates the interface point of the LRM with the rest
of the system (equally, this is the symbolic/sub-symbol cut off)
The subsumptive Perception Action hierarchy embodied as indicated in fig. 1.
within the LRM consequently implements the symbolic (i.e.
high-level representational) component of the architecture,
being responsible for high-level scene interpretation &
annotation, and for introducing legal biasing in intention (note
that the highway code itself does not generally identify unique
actions within a given road context, but rather gives rise to a
degenerate, equi-legal set of action possibilities).
The LRM acts via a mixture of theorem proving-via-resolution
and functional extrapolation in order to apply the HWC in
unfamiliar scenarios, with the former constituting the highest
level of PA subsumption. The road configuration is thus
represented within the LRM as instantiated logical variables,
irrespective of the LRM’s operational modality (as indicated,
the LRM subsumption framework is constrained to have the Figure 1. Run-time system PA hierarchy (OC refers to the optimal control
capability to act reversibly, that is to say, in a generative trajectories existing at the physical layer)
manner via reverse PA logical-variable instantiation, such that
hallucinated high-level legal road configurations are From the hierarchical PA perspective, there are thus two
spontaneously generated alongside the corresponding legal distinct symbolic reasoning layers implicit in the Highway
intentionality in the offline dreaming process. The latter Code (because the HWC explicitly excludes both navigational
(although beyond the scope of this paper) is an instance of top considerations and motor processes from its remit, which
-down exploratory PA motor babbling, in which theorem would respectively extend the higher and lower levels of the
proving-via-resolution is applied to random instantiations of hierarchy if present). The two levels are: the discrete symbolic
logical variables in order to establish self-consistent Herbrand level and the logico-linear metric level, as shown in Figure 1.
(i.e. logically-self consistent) interpretations, i.e. scenarios Consequently, legal-intention related configurations can only
consistent with the legal road protocols). be defined in the above terms; they collectively represent the
Thus, while the offline dreaming process is one of top-down high-level semantic annotation (or equivalently, the high-level
symbolic grounding through the full PA subsumption scene understanding) brought about by hierarchical PA
architecture, it is conversely the case that run-time high-level considerations.
The LRM is therefore architected on two distinct layers (see E. Top-down communication from the LRM (Legal intention
Figure), with a perception/action interface specified between Grounding)
each level at the appropriate level of symbolic abstraction. The logic-symbolic reasoning process, as well as providing
the high-level interpretation of the road circumstances
C. Interlayer Interface Structure of the LRM indicated above, also serves to provide a full set of Herbrand
The Highway Code refers to both discrete symbolic entities (i.e. logically-self consistent) interpretations of the future
(cars, lanes, signs, gaps, etc.), as well as linearized-metric legal action possibilities (for example, whether it is legal to
entities —i.e. metric entities expressed in terms of distance- change lane in the current context).
to/time-to and distance-from/time-from other entities These Herbrand sets are then grounded —i.e. propagated
described in relation to the Ego Car. downwards (as instantiated hierarchical variables in the run-
At the high-level node, lane-wise road configurations are time mode)— through the perception-action subsumption
characterized in the LRM via a logical-list format: ordered in- hierarchy so that, at the point of interface they manifest as a
lane lists of cars and gaps, with (the equivalent of) predicatized set of binary saliency indicators attached to legally-
assertions as to which cars/gaps are legally adjacent to which designated areas in the linearized metric space.
others.
At the immediately lower level of the LRM, the (symbolic/sub- The top-down communication from the LRM thus take the
symbolic) node is characterized by annotated metrical form of metrical bounding boxes augmented by discrete legal
bounding boxes relating to legal transitions produced by a two- saliency indicators that are used to directly compute the
stage process, corresponding to the two stages of subsumption motor cortex biasing matrix. This annotation thus
at the apex of the PA hierarchy listed above (the annotation simultaneously satisfies the requirements of perception-
aspect of the metric bounding boxes thus correlates to their action bijectivity and legal self-consistency; in particular
high-level representation, illustrating the progressively bijectivity allows the bounding box annotation to be directly
grounded nature of symbols generated in a PA hierarchy). interpretable at the motor cortex in regard to action
selection.
Contextual metric information (distances to, and velocities of,
Note that the top-down LRM logical annotation process is
other cars), received from the agent are hence converted into a
non-metrical list of cars and gaps by means of linear exhaustive, with a complete Herbrand-interpretation of the
extrapolation according the HWC protocols (i.e. assuming scene generated as the annotation output (this is a natural
constant speeds and legally-specified reaction times). This list consequence of the logic program being applied recursively
is passed to the second level of the LRM as the equivalent of until an inferential fixed-point is arrived at).
declaratively-enacted predicate script, from which a set of This means that, in the event of incomplete input data, the
high-level legal intentions with uniform priors are generated system generates a full range of self-consistent ‘completion’
(they are uniform since road protocols do not distinguish sets, which are effectively the equivalent of equally-weighted
between legal intentional possibilities a priori). ‘possible worlds’ (in the modal logic sense) consistent with
the input, composed of alternative groundings of predicate
D. Bottom-up Communication from the Pre-LRM Layer variables with the available constants.
The bottom-up semantic annotation function of the runtime
system thus involves communication through the various F. Dreaming Initiation via Top-down Communication of
levels of the LRM in the form of abstractions of the perceptual Legal-Perceptual Priors (LRM Percept-Motor Babbling).
data consistent with the outlined notion of perceptual
subsumption: As a corollary, where no input is given to the LRM, there are
no grounded logical road configuration variables asserted at
At the symbolic/sub-symbolic interface layer (Linearized the symbolic/sub-symbolic interface. In principle, this allows
Metric Layer), geometric details such as the exact shape of the LRM to initiate an offline learning via a dreaming
the lanes are hence discarded, while the topology and linear process (i.e. high-level percept-action babbling) without any
distance (constituting a higher-level legal-symbolic modification of the system’s subsumptive structure; exactly
parametrization) is retained. The speed and distance of the same mechanism for legal biasing can be utilized for
individual objects in relation to the Ego car, and road dreaming, since the Herbrand fixed-points in the absence of
configuration information in the form of lane numbering, any assertion as to road configuration (i.e. no assertions
width, lane marker types (e.g. whether lane change is relating to either road topology or to vehicle traffic using that
allowed) etc. are passed to the LRM. topology) are simply a uniform set of possible worlds
The net result of the bottom-up communication of road consistent with the legal constraints on the road
configuration, after processing by the logical-reasoning configurations in general (the LRM’s logical axioms
system, is thus a high-level symbolic representation of both necessarily have only a nominal distinction between
the legal status of, and the legal possibilities with respect to, intentional-rules and environmental-consistency rules).
the current road configuration. This hence constitutes a
semantic annotation of the road situation described with G. Action selection loops
respect to a (legal) intentional frame, or equivalently the
high-level scene interpretation. Action selection within the system has is consequently
organized in a hierarchical fashion. There are two distinct
action selection modules, acting at the symbolic (the LRM) unconnected rules, each of which might be only very weakly
and sub-symbolic (physical) levels of description. The levels evidentially supported on its own and involves a combinatorial
are differentiated firstly by their differing inputs, and explosion in the number of neurons.
secondly by the differing timescales over which their We thus construct a “neural logic programme parser” that
decisions are made. simplifies the LRM Logical Programmes via a 3-fold strategy:
The higher-level action selection loop takes the outputs of
the logical reasoning module (LRM) as its inputs. The high-
level action selection module first assigns, at each time step, 1. Appropriate thresholding considerations w.r.t. single predicate clauses
scalar weights representing the “desirability” of each of the potentially reduces the mid-layer neuronal budget by orders
LRM’s bounding boxes. These weights are learned, this way magnitude
enabling the agent to learn long-term strategies.
Once the high-level action-selection loop has concluded its => it also naturally gives rise to a more pyramidal, CNN-like hierarchy
decision-making process, the conclusion can be passed down
to the lower levels.
Neurally-inspired action selection within the agent hence takes
the form of a computational model of the basal ganglia. In
particular, it has been demonstrate that the basal ganglia could
be performing a form of action selection known as multi-
hypothesis sequential probability ratio testing (MSPRT). This
algorithm sums evidence for each action over time, and finds
the log likelihood that each channel is drawn from a 2. Assertion of facts can be accommodated straightforwardly to reduce
input layer size.
distribution with a higher mean than the other channels. Once
the log likelihood crosses a threshold, the action becomes 3. Explicit observance of rule subsumption
selected. The threshold has to be tuned such that some
Applying these three strategies very naturally results in a deep
predetermined error rate is permitted. Subject to a few
neural network structure, moreover one for which the layers
assumptions, the algorithm can be shown to be optimal in
intrinsically form a PA-hierarchy (since the underlying LRM
decision time, given a particular error rate.
logical rule base is constructed so as to respect PA bijectivity).
The MSPRT process is readily neuralizable, and so amenable Furthermore, there is intrinsic parameter-sharing amongst
to back-propagative learning in relation to the system as a certain of the resulting network’s weights. Consequently,
whole. In order to produce fully end-to-end neuralization of the during training, all deep-learning tools appropriate for back-
PA hierarchy it thus only remains to neuralize the LRM. propagation within convolutional neural networks can be
applied. Critically, FOL syntax is retained during training,
irrespective of the network architecture of the sub-symbolic
III. NEURALIZATION OF THE LRM levels. We thus obtain an end-to-end trainable deep network
The direct translation of logic programs into artificial neural for implementing a perception-action hierarchy based frontal
networks has a relatively long history. A standard approach to cortex model within an autonomous driving context.
neuralisation of Horn clauses, using a local representation in Acknowledgment: This work was supported by the EU 2020
which each (ground) atom corresponds to a single dedicated project Dreams4Cars, grant number 731593
neuron, is exemplified by the Knowledge-Based Artificial
Neural Network (KBANN) of Towell & Shavlik [7]. REFERENCES
Networks of this type have been criticized as having a
[1] P. Cisek, «Cortical mechanisms of action selection: the affordance
“propositional fixation”: a finite neural network can represent competition hypothesis», Philos. Trans. R. Soc. B Biol. Sci., vol.
only a finite number of ground atoms, and can therefore 362, n. 1485, pagg. 1585–1599, set. 2007.
represent a logic program only for a finite base. A language [2] G. Pezzulo e P. Cisek, «Navigating the Affordance Landscape:
with first-order syntax but only a finite alphabet of symbols is Feedback Control as a Process Model of Behavior and Cognition»,
equivalent to propositional logic, because any universally Trends Cogn. Sci., vol. 20, n. 6, pagg. 414–424, giu. 2016.
quantified (“for all X”') clause can be translated into finitely [3] K. Meyer e A. Damasio, «Convergence and divergence in a neural
architecture for recognition and memory», Trends Neurosci., vol.
many propositional clauses of the same form, one for each 32, n. 7, pagg. 376–382, lug. 2009.
possible value of X. Thus, networks of the KBANN type [4] D. Windridge, «Emergent Intentionality in Perception-Action
cannot implement ``true'' first-order logic programs, only a Subsumption Hierarchies», Front. Robot. AI, vol. 4, ago. 2017.
finite fragment of first-order logic. [5] D. Windridge e S. Thill, «Representational fluidity in embodied
(artificial) cognition», Biosystems, vol. 172, pagg. 9–17, ott. 2018.
Holldobler et al. [8] give a neural method (in fact a precise [6] R. Bogacz e K. Gurney, «The basal ganglia and cortex implement
duality) that replicates the immediate-consequence operator optimal decision making between alternative actions», Neural
T_P of a true First-Order Logic program, to a desired degree Comput., vol. 19, n. 2, pagg. 442–477, 2007.
[7] G. Towell and J. W. Shavlik, “Knowledge based artificial neural
of accuracy in a real embedding. However, this may involve a networks,” Artificial Intelligence, vol. 70, no. 4, 1994.
thousand or a million copies of a clause, one for each possible [8] S. H¨olldobler, F. Kurfess, and H.-P. St¨orr, “Approximating the
grounding of a variable X. semantics of logic programs by recurrent neural networks,” Applied
Intelligence,vol. 11, no. 1.
In neuralizing a logic program it is thus desirable to maintain
the concept that there is a universal rule, rather than a thousand
Following Social Groups: Socially Compliant Autonomous
Navigation in Dense Crowds
Xinjie Yao1 , Ji Zhang2 and Jean Oh2
Abstract— In densely populated environments, socially com- to that a large set of comprehensive expert demonstrations
pliant navigation is critical for autonomous robots as driving are hard to acquire.
close to people is unavoidable. This manner of social navigation The study of this paper is based on our previous work
is challenging given the constraints of human comfort and social
rules. Traditional methods based on hand-craft cost functions which uses deep learning in solving the socially compliant
to achieve this task have difficulties to operate in the complex navigation problem [9]. This paper extends the work in
real world. Other learning-based approaches fail to address the two ways. First, we consider the findings from a previous
naturalness aspect from the perspective of collective formation study [10] that 70% of people walk in social groups. Crowd
behaviors. We present an autonomous navigation system capa- behavior can be summarized as flows of social groups, and
ble of operating in dense crowds and utilizing information of
social groups. The underlying system incorporates a deep neural humans tend to move along the flow. It is our understanding
network to track social groups and join the flow of a social that the behavior of joining the flow that shares similar
group in facilitating the navigation. A collision avoidance layer heading direction is more socially compliant, causing fewer
in the system further ensures navigation safety. In experiments, collisions and disturbances to surrounding pedestrians. Our
our method generates socially compliant behaviors as state-of- method recognizes social groups and selects the flow to fol-
the-art methods. More importantly, the system is capable of
navigating safely in a densely populated area (10+ people in a low. Second, we ensure safety with a multi-layer navigation
10 m × 20 m area) following crowd flows to reach the goal. system. In this system, a deep learning-based global planning
layer makes high-level socially compliant behavioral deci-
I. I NTRODUCTION sions while a geometry-based local planning layer handles
collision avoidance at a low-level.
The ability to safely navigate in populated scenes, e.g. The paper is further related to previous work on modeling
airports, shopping malls, and social events, is essential for aggregate interactions among social groups [10] and leverag-
autonomous robots. The difficulty comes from the fact that ing learned social relations in tracking group formations [11].
people walk closely to the robot cutting ways in front of the Our main contributions are a deep learning-based method for
robot or between the robot and the goal point. The safety socially compliant navigation with an emphasis on tracking
margin for the robot to drive in crowded scenes is pushed to and joining the crowd flow and an overall system integrated
the minimum. In such a case, the navigation system has to with the deep leaning method capable of safe autonomous
trade-off between driving safely close to people and reaching navigation in dense crowds.
the goal quickly. Furthermore, a previous study of socially
II. M ETHOD
compliant navigation [1] states three aspects in terms of
the robot behaviors – comfort as the absence of annoyance A. System Overview
and stress for humans in interaction with robots, naturalness Fig. 1 gives an overview of the autonomous navigation
as the similarity between the robot and human behaviors, system which consists of three subsystems as follows.
and sociability as to abide by general cultural conventions. • State Estimation Subsystem involves a multi-layer
Among these three aspects, the first aspect essentially reflects data processing pipeline which leverages lidar, vision,
safety of the navigation. and inertial sensing [12]. The subsystem computes the
Previous studies on socially compliant navigation attempt 6-DOF pose of the vehicle as well as registers laser scan
to solve the problem with various methods, including data- data with the computed pose.
driven approaches for human trajectory prediction [2], [3], • Local Planning Subsystem is a low-level planning sub-
potential field-based [4] and social force model-based [5] ap- system in charge of obstacle avoidance in the vicinity of
proaches. In particular, reinforcement learning-based meth-
ods use reward functions to penalizes improper robot be-
haviors eliminating the cause of discomfort [6], [7]. Inverse
reinforcement learning based-methods learn from expert
demonstrations [8]. These methods are hard to generalize due