0% found this document useful (0 votes)
108 views55 pages

Proceedings PDF

This document is a preface for the workshop "TCV2019: Towards Cognitive Vehicles: perception, learning and decision making under real-world constraints. Is bio-inspiration helpful?" held at the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems. The workshop aims to critically discuss the potential benefits and challenges of applying biologically inspired approaches to developing intelligent systems for perception, interaction, learning, and decision making for cognitive vehicles. Topics of interest include applications in intelligent vehicles, perception, learning, and cognitive architectures. The workshop will bring together experts from different perspectives to discuss these topics and stimulate discussion on the role of biological inspiration in developing future AI systems for safety-critical robotic applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views55 pages

Proceedings PDF

This document is a preface for the workshop "TCV2019: Towards Cognitive Vehicles: perception, learning and decision making under real-world constraints. Is bio-inspiration helpful?" held at the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems. The workshop aims to critically discuss the potential benefits and challenges of applying biologically inspired approaches to developing intelligent systems for perception, interaction, learning, and decision making for cognitive vehicles. Topics of interest include applications in intelligent vehicles, perception, learning, and cognitive architectures. The workshop will bring together experts from different perspectives to discuss these topics and stimulate discussion on the role of biological inspiration in developing future AI systems for safety-critical robotic applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

TCV2019 preface

Preface
The 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019)
is the flagship international conference in robotics and intelligent systems. It is co-sponsored
by the IEEE, the IEEE Robotics and Automation Society (RAS), the IEEE Industrial Elec-
tronics Society (IES), the Robotics Society of Japan (RSJ), the Society of Instruments and
Control Engineers (SICE), and the New Technology Foundation (NTF). IEEE is a non-profit,
technical professional association of more than 400,000 members in 160 countries. It is a lead-
ing authority in technical areas ranging from computer engineering, biomedical technology and
telecommunications, to electric power, aerospace and consumer electronics, among others.
This volume contains the papers presented at the workshop TCV2019: Towards Cogni-
tive Vehicles: perception, learning and decision making under real-world constraints. Is bio-
inspiration helpful? held on November 8, 2019 at IROS.
Objectives of the workshop:
Autonomous driving is only one out of many aspects of intelligence required for future trans-
portation systems. Human-machine interaction in a cognitive vehicle is an intriguing use case
that requires intelligence beyond the state of the art in machine learning, computer vision, and
AI. For safe and convenient human-machine interaction, the intelligent system such as a smart
vehicle needs to be able to perceive its environment and make decisions based on the received
data. Current state-of-the-art approaches to both intelligent perception and decision making
typically rely on machine learning with offline training of neural networks using elaborated
datasets. To enable truly adaptive intelligence, as we know it from biological systems, learning
that supports decision making and perception needs to happen in real time, in an online fashion.
But can such adaptive perceiving, deciding, and learning systems be safe enough to actually be
deployed in an intelligent vehicle?

While biological inspiration has led to some of the most successful approaches in perception
and machine learning – deep neural networks, – its deployment in real-world, safety-critical
settings is yet limited. We aim to explore and critically discuss what biological inspiration in
perception, learning, and decision making could bring in the future for increasing intelligence
of vehicles and other robotic systems.
Thus, the aim of the workshop is to discuss potential benefits and pitfalls in applying bio-
inspired approaches when developing intelligent real-world systems that perceive, interact, learn,
and make decisions. We will focus on the application area of intelligent, ”cognitive” vehicles and
will use an unconventional format: for each of three subtopics we invited 2-4 experts from dif-
ferent schools of thought (for example, traditional machine learning and brain-inspired learning,
conventional approach to planning and decision making and cognitive architecture-based ap-
proach, event-based bio-inspired vision and conventional machine visions, etc. ). Each speaker
will give a short introductory talk followed by a moderated panel discussion around each topic.
Furthermore, we will invite researchers from intelligent robotics and vehicles with a focus on
perception, learning and decision making to present their work in posters and short spot-light
talks.

The workshop will stimulate discussion of the role of biological inspiration in the develop-
ment of future AI systems in the context of real-world, safety-critical applications of robotic
systems in environments shared with humans.
Topics of interest:
Applications

i
TCV2019 preface

• Intelligent vehicles (cars, UAVs, ...)


• Human-machine interaction
• Intelligence in the cockpit

Perception

• Robust accountable and scalable perception with neural networks and without
• Multi-modal perception and sensory integration
• Attention and cognitive control in visual and tactile perception

• Gesture recognition
• Perception for action

Learning

• Machine learning for vehicles


• Fast inference and learning
• Online learning and reliability
• Embedded machine learning

• Learning in complex hierarchical control systems

Cognitive Architectures

• Cognitive architectures and machine learning / neuronal networks

• Cognitive architectures for action selection


• Scalable cognitive architectures
• Learning cognitive architectures

We would like to thank the technical committee for cognitive robotics of the IEEE Robotics
& Automation Society, NEUROTECH, BMW Group and BOSCH for their support.

November 8, 2019 Florian Mirus


Macau Mohsen Kaboli
Yulia Sandamirskaya
Nicolai Waniek

ii
TCV2019 Table of Contents

Table of Contents
Enhancing Object Detection in Adverse Conditions using Thermal Imaging . . . . . . . . . . . . . . . 1
Kshitij Agrawal and Anbumani Subramanian
Exploration for Objects Labelling Guided by Environment Semantics using UAVs . . . . . . . . 4
Reem Ashour, Tarek Taha, Jorge Dias, Lakmal Seneviratne and Nawaf Almousa
Towards game theoretic AV controllers: measuring pedestrian behaviour in Virtual
Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Fanta Camara, Patrick Dickinson, Natasha Merat and Charles W. Fox
MSPRT action selection model for bio-inspired autonomous driving and intention
prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Riccardo Donà, Gastone Pietro Rosati Papini and Giammarco Valenti
A dynamic neural model for endowing intelligent cars with the ability to learn driver
routines: where to go, when to arrive and how long to stay there? . . . . . . . . . . . . . . . . . . . . . . . . 15
Flora Ferreira, Weronika Wojtak, Wolfram Erlhagen, Paulo Vicente, Ankit Patel,
Sérgio Monteiro and Estela Bicho
Towards an Evaluation Methodology for the Environment Perception of Automotive
Sensor Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Maike Hartstern, Viktor Rack and Wilhelm Stork
Risk-Aware Reasoning for Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Majid Khonji, Jorge Dias and Lakmal Seneviratne
Cognitively-inspired episodic imagination for self-driving vehicles . . . . . . . . . . . . . . . . . . . . . . . . . 28
Sara Mahmoud, Henrik Svensson and Serge Thill
A Cognitively Informed Perception Model for Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Alice Plebe and Mauro Da Lio
Cognitive Wheelchair: A Personal Mobility Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Mahendran Subramanian, Suhyung Park, Pavel Orlov and Aldo Faisal
A Frontal Cortical Loop For Autonomous Vehicles Using Neuralized Perception-Action
Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
David Windridge and Seyed Ali Ghorashi
Following Social Groups: Socially-Compliant Autonomous Navigation in Dense Crowds . . . 44
Xinjie Yao, Ji Zhang and Jean Oh

1
TCV2019 Author Index

Author Index

Agrawal, Kshitij 1
Almousa, Nawaf 4
Ashour, Reem 4

Bicho, Estela 15

Camara, Fanta 7

Da Lio, Mauro 32
Dias, Jorge 4, 23
Dickinson, Patrick 7
Donà, Riccardo 11

Erlhagen, Wolfram 15

Faisal, Aldo 36
Ferreira, Flora 15
Fox, Charles W. 7

Ghorashi, Seyed Ali 40

Hartstern, Maike 19

Khonji, Majid 23

Mahmoud, Sara 28
Merat, Natasha 7
Monteiro, Sérgio 15

Oh, Jean 44
Orlov, Pavel 36

Park, Suhyung 36
Patel, Ankit 15
Plebe, Alice 32

Rack, Viktor 19
Rosati Papini, Gastone Pietro 11

Seneviratne, Lakmal 4, 23
Stork, Wilhelm 19
Subramanian, Anbumani 1
Subramanian, Mahendran 36
Svensson, Henrik 28

1
TCV2019 Author Index

Taha, Tarek 4
Thill, Serge 28

Valenti, Giammarco 11
Vicente, Paulo 15

Windridge, David 40
Wojtak, Weronika 15

Yao, Xinjie 44

Zhang, Ji 44

2
TCV2019 Keyword Index

Keyword Index

3D reconstruction 4

action selection problem 11


affordance competition 11
Artificial Cognition 28
autonomous driving 11, 32
Autonomous driving robot 44
Autonomous Exploration 4
Autonomous Vehicles 7, 28
autonomous vehicles 23
Autonomous Wheelchair 36

bio-inspired cognitive models 11

chance constraints 23
Cognitive architectures and neural networks 15
Cognitive architectures for action selection Pedestrian Behaviour 7
Cognitive Vehicle 36
Cognitive Vehicles 40
convergence-divergence zones 32
Cost function 4

deep learning 32
Deep Learning 40
Deep Reinforcement Learning 28
Driver routines 15

Episodic Imagination 28
external vehicle sensors 19
Eye Gaze 36

Fast inference and learning 15


faster-rcnn 1
First Order Logic 40
flir 1
free energy 32

Game Theory 7
Gaze tracking 36

Intelligent vechicles (cars) 15


Intention Decoding 36

MSPRT 11

1
TCV2019 Keyword Index

object detection 1
Online learning 15

perception 19
Perception-Action Learning 40

risk-aware planning 23

Safety in dense crowds 44


Semantic Exploration and Mapping 4
Semi-autonomous Wheelchair 36
sensor performance 19
Severely Disabled 36
Simulation 28
simulation 19
Social navigation 44

thermal imaging 1

Utility function 4

variational autoencoder 32
vehicle sensor setup configuration 19
virtual testing 19

WTA 11

2
TCV2019 Program Committee

Program Committee

Mohsen Kaboli Institute for Cognitive Systems


Florian Mirus BMW
Yulia Sandamirskaya University of Zurich
Nicolai Waniek Bosch Center for Artificial Intelligence

1
Enhancing Object Detection in Adverse Conditions using Thermal
Imaging
Anonymous submission

Abstract— Autonomous driving relies on deriving under- The features extracted from an image can help identify
standing of objects and scenes through images. These images are an object in good lighting and normal weather conditions.
often captured by sensors in the visible spectrum. For improved However, images obtained using camera systems in low light
detection capabilities we propose the use of thermal sensors to
augment the vision capabilities of an autonomous vehicle. In this conditions - night, dusk and dawn, and adverse weather
paper, we present our investigations on the fusion of visible and conditions - rain and snow, contain partially illuminated
thermal spectrum images using a publicly available dataset, and objects, low contrast and low information content. These
use it to analyze the performance of object recognition on other images are often difficult for object detection algorithms.
known driving datasets. We present an comparison of object The primary contribution of our work is to investigate the
detection in night time imagery and qualitatively demonstrate
that thermal images significantly improve detection accuracies. nature of object detectors in the thermal spectrum in driving
scenarios for autonomous navigation. We utilize the FLIR
I. INTRODUCTION ADAS [11] dataset that consists of annotated thermal images
Object detection is one of the primary component for and time synchronized visible images. Datasets like KAIST
scene understanding in an autonomous vehicle. The detected [12] exist for similar purpose, however they are limited to
objects are used to plan the trajectory of a vehicle. Cameras annotations of only people.
are used to capture images of the environment, which are The next sections are organized as follows: in Section 2,
then input to a myriad of computer vision tasks, including we will cover related research, Section 3, we deal with the
object detection. datasets, generation of a ground truth for the visible and
While significant progress has been acheived in using thermal pairs in the FLIR ADAS dataset and the setup of
visible spectrum for object detection algorithms, it poses our experiment. In Section 4 we will present our result and
inherent limitations due to the response from cameras in subsequent conclusion in Section 5.
visible spectrum. Some of the shortcomings include low
dynamic range, slow exposure adjustment, inefficiencies in II. RELATED WORK
high contrast scenes etc, while being subject to weather Object detection consists of recognition and localization
conditions like fog and rain. Bio inspired vision, like infra- of object boundaries within an image. Early work in the
red based thermal vision, could be an effective tool to computer vision field has focused on building task based
augment the shortcomings of imagers that operate in the classifiers using specific image properties. In some of the
visible spectrum. earlier approaches a sliding window is used to classify parts
Other sensing modalities like LIDAR based systems are of an image based on feature pyramids [15], histogram of
sufficient to detect depth in a scene. However, the data may oriented gradients (HoG) with a combination of SVM has
be too coarse to detect objects at further distances and may been used to classify pedestrians [13] and features pools of
lack resolution to classify objects. Thermal imagers on the Haar features [14] have been employed for face detection.
other hand can easily visualize objects that emit infra-red A more generalized form of object detection has evolved
radiation due to their inherent heat. Due to this property, over the years due to the advancement in deep learning.
thermal imagers can visualize important participants on the The exhaustive search for classification has been replaced
road like people, cars and animals at any time of the day. by convolutional classifiers. Object detection models have
Augmenting the detection of objects in the thermal spectrum been proposed to work with relative good accuracy on the
could be a good way to enable robust object detection for visible spectrum using models that consist of a) a two stage
safety critical systems like autonomous vehicles. system a classifier connected with a region proposal network,
Object detection methods have progressed significantly RCNN [16], Fast-RCNN [17] and Faster-RCNN [18] b) a
over the years from simple contour based methods using single stage network with the classification and localization
support vector machines (SVM) [1-7] to ones using deep layers in a cohesive space, like YOLO [19] and SSD [20].
classification models [16]–[20] that utilize hierarchal repre- Models trained on large scale datasets are known to
sentation of data. Data driven models are the flavor of the perform to quite a good extent. With driving datasets like
day by dominating the detection benchmarks on large scale KITTI [21], Cityscapes [22] the object detection models have
datasets like PASCAL VOC [8] and COCO [9]. been employed to detect pedestrians, cars and bicycles.
There is a large body of work done for recognizing Some work has been done in the detection of objects
and localizing objects in the visible spectrum to recognize thermal images [23-26], especially focusing on human de-
objects like people [13, 14], vehicles [10] and traffic lights. tection. Since some of the work has been from static camera,
methods we are able to translate the annotations to the visible
space as well resulting in about 8000 training and 1247
validation images with 42-58 split in night vs day. In the rest
of our work we refer to this translated dataset as the FLIR
RGB dataset. Fig 1 shows the translation of bounding boxes
from the thermal images to the corresponding registered
image in the RGB domain. The input images as part of
the FLIR dataset are uncorrected images and slight radial
distortions due to the lens can be visualized. The drawback
of our technique is that the points closer to the center can
be registered, however, the points radially distant from the
center do not align well.
TABLE I
S CHEME SHOWING MAPPING OF LABELS

FLIR IDD KITTI


Person Person Pedestrian
Rider Cyclist
Car Car Car
Caravan
Autorickshaw
Bicycle Bicycle -
Motorcycle
Dog Animal -
- Bus Truck
Trailer
Truck
Vehicle fallback

B. Indian Driving Dataset


Fig. 1. Annoted and RGB translated pairs from FLIR ADAS dataset The India Driving Dataset [27] consists of images taken in
driving conditions in city and highway situations primarily
during the day. It is unique in the 26 classes that it proposes
the proposals can be generated from background subtraction and the high number of objects in each scene. We pick
techniques in the thermal domain [26]. However, most of the common traffic participants that also exist in the FLIR dataset
work does not deal with investigating the effect of multiple and translate them to similar labels. Table 1 shows the
day and night conditions across the thermal and visible translation mechanism.
spectrum in driving scenarios.
C. KITTI
III. DATASETS The KITTI object detection dataset consist of day time
In this section we will detail the datasets that we utilize in images captured in the urban and highway driving conditions
our study and the process we employed to create a baseline in Karlsruhe, Germany. Again classes corresponding to the
for training the Faster-RCNN model. FLIR dataset are chosen and translated. A detailed translation
can be seen from Table 1.
A. FLIR ADAS Dataset
The FLIR thermal imaging dataset is acquired via a RGB IV. EXPERIMENT & RESULTS
and thermal camera mounted on a vehicle with annotations The Faster-RCNN implementation from Ren et al [18] was
created for 14,452 thermal images. It primarily is captured in used to train the model on three datasets: FLIR thermal
streets and highways in Santa Barbara, California, USA from (FLIR THM), IDD and KITTI. The Faster-RCNN model
November to May with clear-sky conditions at both day and used a Resnet-101 for the high level feature extraction and
night. Annotations exist for the thermal images based on the the complete model is initialized from pre-trained COCO
COCO annotation scheme. However, no annotations exist for weights. The model is trained on each dataset till conver-
the corresponding visible images. gence for about 180,000 iterations. We present the results of
To analyse the night time performance for object detection each baseline model performance by testing on a validation
it was absolutely essential to have corresponding annotated dataset from the same domain in Table 2.
images in the visible spectrum in the day and night scenarios. In the first part of our study the trained model performance
We build a custom point based correspondence generator and is tested on the night time images (653 out of 1247) from the
utilized 8 point homography method to generate a correspon- translated FLIR RGB dataset. Table 3, shows that the per-
dence from the thermal to the visible spectrum. Using such formance of models trained in the visible spectrum degrades
significantly on the night images from the FLIR RGB. We R EFERENCES
can also see that training on FLIR thermal does not translate [1] S. Agarwal and D. Roth. Learning a sparse representation for object
well to the visible domain, with a drop of 40% from the detection. In Proc. ECCV, page IV: 113 ff., 2002.
baseline inference on the FLIR thermal dataset. Thus training [2] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, ”Local features
and kernels for classification of texture and object categories: A
in the thermal domain does not improve performance in the comprehensive study,” in Proceedings of the IEEE Conference on
night time on the same dataset. While training on IDD does Computer Vision and Pattern Recognition Workshops, 2006.
retain the highest performance because of better correlation [3] A. Opelt, A. Pinz, and A. Zisserman, ”Learning an alphabet of shape
and appearance for multi-class object detection,” International Journal
to road scenes from the IDD in day and night conditions. of Computer Vision, vol. 80, pp. 16-44, 2008.
[4] Z. Si, H. Gong, Y. N. Wu, and S. C. Zhu, ”Learning mixed templates
TABLE II for object recognition,” in Proceedings of the IEEE Conference on
AVERAGE PRECISION PER CLASS FOR DATASET COMBINATIONS TESTED Computer Vision and Pattern Recognition, 2009.
[5] J. Shotton, ”Contour and texture for visual recognition of object
ON NIGHT TIME IMAGES FROM THE FLIR RGB TRANSLATED DATASET
categories,” Doctoral of Philosphy, Queen’s College, University of
Cambridge, Cambridge, 2007.
Train Dataset Test Dataset Bicycle Car Dog Person mAP
FLR THM FLIR RGB 0.1312 0.571 0 0.245 0.237 [6] J. Shotton, A. Blake, and R. Cipolla, ”Multiscale categorical object
IDD FLIR RGB 0.3314 0.625 0.042 0.365 0.341 recognition using contour fragments,” IEEE Transactions on Pattern
IDD+FLIR THM FLIR RGB 0.1319 0.570 0 0.260 0.240 Analysis and Machine Intelligence, vol. 30, pp. 1270-1281, 2008.
KITTI FLIR RGB - 0 - 0.403 0.201 [7] K. Schindler and D. Suter, ”Object detection by global contour shape,”
KITTI FLR THM - 0 - 0.141 0.070 Pattern Recognition, vol. 41, 2008.
KITTI KITTI - 0.970 - 0.899 0.935 [8] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman, ”The PASCAL Visual
Object Classes Challenge 2007 (VOC2007) Results,”
We conduct another evaluation, a performance of domain http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html
transfer by introducing the large scale driving dataset into [9] T.-Y. Lin et al., ”Microsoft COCO: Common objects in context,” in
the training. The trained models are tested in the thermal Proc. Eur. Conf. Computer Vision, Springer, 2014, pp. 740755
[10] S. Gupte, O. Masoud, R. F. K. Martin, and N. P. Papanikolopoulos.
and visible domain for performance gains. We observe a ”Detection and classification of vehicles”, IEEE Transactions on
significant drop in performance by testing the IDD and Intelligent Transportation Systems, 3(1):3747, Mar. 2002.
KITTI model on FLIR thermal images - 2.6x drop and 13x [11] https://www.flir.in/oem/adas/adas-dataset-form/
[12] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, ”Multispectral
drop, respectively. This shows that a model trained in visible pedestrian detection: Benchmark dataset and baseline,” in Proc. IEEE
domain does not infer well in another domain due to the Conf. Computer Vision Pattern Recognition (CVPR), Jun. 2015.
inherent difference of visual representations. In the case of [13] N. Dalal and B. Triggs. ”Histograms of Oriented Gradients for Human
Detection,” In CVPR, 2005
inference on RGB domain itself we can observe a drop of [14] P. Viola and M. Jones, ”Robust Real-time Face Detection”, IJCV,
1.6x and 6.2x respectively from the baseline performance on 57(2), 2004
the same dataset. [15] P. Dollar, R. Appel, S. Belongie, and P. Perona, ”Fast Feature Pyramids
for Object Detection,” TPAMI, 36(8), 2014
V. CONCLUSIONS [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, ”Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014
TABLE III
[17] R. Girshick, ”Fast R-CNN,” in Proc. IEEE Conf. Computer Vision,
BASELINE RESULTS FROM TRAINING OBJECT DETECTION MODEL ON 2015, pp. 14401448
THE THREE DATASETS [18] S. Ren, K. He, R. Girshick, and J. Sun, ”Faster R-CNN: Towards
realtime object detection with region proposal networks,” in Advances
Train Dataset Test Dataset Bicycle Car Dog Person mAP in Neural Information Processing Systems, 2015, pp. 9199.J.
IDD FLR RGB 0.192 0.473 0.052 0.339 0.264 [19] Redmon, S. Divvala, R. Girshick, and A. Farhadi, ”You only look
IDD FLR THM 0.126 0.265 0.099 0.160 0.163 once: Unified, real-time object detection,” in Proc. IEEE Conf. Com-
IDD IDD 0.569 0.617 0.070 0.448 0.426 puter Vision and Pattern Recognition (CVPR), 2016, pp. 779788.
KITTI FLR RGB - 0 - 0.316 0.158 [20] W.Liu, D.Angeuelov, D.Erhan, C.Szegedy, S.Reed, C-Y.Fu and
KITTI FLR THM - 0 - 0.141 0.070 A.C.Berg, ”SSD: Single shot multibox detector,” in Proc. Eur. Conf.
KITTI KITTI - 0.970 - 0.899 0.935 Computer Vision. Springer, 2016, pp. 2137
[21] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, ”Vision meets robotics:
The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 12311237,
From our experiments in Table 2, 3 we can conclude that 2013.
there is no domain transfer from a model trained in the [22] M. Cordts et al., ”The cityscapes dataset for semantic urban scene
visible spectrum to inferences in the thermal domain. Thus understanding,” in Proc. IEEE Conf. Computer Vision Pattern Recog-
nition (CVPR), Jun. 2015, pp. 32133223.
thermal imagers can prove to be a valuable addition to object [23] J. Ge, Y. Luo, and G. Tei. Real-Time Pedestrian Detection and Track-
detection pipelines, especially for robustness of systems like ing at Nighttime for Driver-Assistance Systems. IEEE Transactions on
autonomous vehicles. Results in Table 3 also show that few ITS, 10(2), 2009.
[24] Y. Lee, Y. Chan, L. Fu, and P. Hsiao. Near-Infrared-Based Nighttime
shot training on the Faster-RCNN model from a previously Pedestrian Detection Using Grouped Part Models. IEEE Transactions
trained model does not perform well across the domains and on ITS, 16(4), 2015.
on new datasets. [25] R. Miezianko and D. Pokrajac. People Detection in Low Resolution
Infrared Videos. In CVPR Workshops, 2008
VI. F UTURE W ORK [26] M. Teutsch, T. Muller, M. Huber, and J. Beyerer. Low resolution
person detection with a moving thermal infrared camera by hot spot
Further investigations to evaluate the effect of fusion classification. In CVPR Workshops, 2014.
strategies in the Faster-RCNN network is ongoing. We would [27] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker and C
V Jawahar, ”IDD: A Dataset for Exploring Problems of Autonomous
also like to compare the effect of multiple fusion strategies Navigation in Unconstrained Environments”, in IEEE Winter Conf. on
with the baseline performance. Applications of Computer Vision (WACV 2019)
Exploration for Objects Labelling Guided by Environment
Semantics using UAVs

Reem Ashour1, Tarek Taha2, Jorge Dias1, Lakmal Seneviratne1, and Nawaf Almoosa1

the proposed 3D semantic-aware map for object localization


Abstract— This paper presents an efficient autonomous and labeling using a UAV with an onboard RGBD camera.
exploration strategy of unknown indoor environments. This New utility functions are proposed to direct the robot towards
strategy focuses on 3D mapping the environment and the objects in the environment. The results show that the
performing a grid level semantic labeling in order to identify all model is capable of exploring unknown environments and
objects. Unlike conventional exploration techniques that utilize
label objects effectively.
geometric heuristics and information gain theory on occupancy
grid maps, the work presented in this paper considers semantic
information, such as the class of objects, to gear the exploration II. MAPPING AND INFORMATION QUANTIFICATION
towards environment segmentation and objects labeling. The
proposed approach utilizes deep learning to map 2D The procedure for the mapping and information
semantically segmented images into 3D semantic point clouds quantification is provided in Fig. 1. It involves three stages:
that encapsulate both occupancy and semantic annotations. A object detection, annotated point cloud generation, and the 3D
Rapidly-exploring Random Tree (RRT) algorithm with the semantic-aware mapping. In object detection stage, semantic
proposed semantic cost functions is employed to iteratively segments are generated for objects found in a 2D image frame.
evaluate the global map to label all the objects in the
The deep neural network Pyramid Scene Parsing Network
environment using a novel utility function that balances
exploration and objects labeling. The proposed strategy was (PSPNet) based on semantic segmentation [4] is employed to
evaluated in a realistic simulated indoor environment, and provide semantic segments for the different objects in 2D
results were benchmarked against other exploration strategies. images. After that, the point cloud captured from the depth
camera is annotated to the corresponding class from the deep
learning model output. This is performed by firstly registering
I. INTRODUCTION the depth with the same reference frame that the color image is
registered, which usually the camera frame. After that,
The growth in aerial robotics has led to their ubiquitous transforming the pixels from the camera frame to the real
presence in various fields such as urban search and rescue world frame using the image position, its depth, and the
(USAR) [1], infrastructure inspection [2] and surveillance [3], intrinsic camera parameters to form the point cloud. In the last
etc. The completeness and efficiency of mapping and stage, a 3D occupied semantic map structure based on an
exploration processes are crucial to facilitate these occupancy grid map the octomap [5] is proposed. The map
applications. Some of the recent research has been focused on M={m1,m2,…,mi} consists of the cubical elements of the same
rescue and rescue activities performed by Unmanned Aerial size where mi is a voxel for index i. Each voxel mi
Vehicle (UAV). These robots assist rescue team in the form of encapsulates volumetric information and semantic
vital information on time-sensitive situations without information, which are the semantic color, confidence value,
endangering human lives. Autonomous capabilities such as and the number of visits.
exploration and mapping of unknown environments are
crucial to provide rescuers with richly reconstructed mapped
environments and increase their understanding of the
situation.

In this work, an efficient autonomous semantic-aware 3D


mapping and exploration method for unknown indoor is
proposed by utilizing semantic information encapsulated in
Figure 1. Proposed Semantic-aware Exploration and Mapping
System Architecture
1Reem Ashour, Jorge Dias and Lakmal Seneviratne are with Khalifa
III. SEMANTIC-AWARE EXPLORATION
University of Science and Technology, Abu Dhabi, UAE, Emails:
[email protected] , [email protected] , [email protected]
and [email protected] The proposed exploration strategy provides the robot with
2Tarek Taha, is with Algorythma’s Autonomous Aerial Lab, Abu Dhabi,
the ability to explore unknown environments, while
simultaneously optimizing information gathering and
UAE, email: [email protected]
directing the robot to label objects in the environment. To
enable this functionality, two new multi-objective utility should explore. The robot position is assumed to be perfectly
functions are proposed to account for the semantic known. The simulation environment has multi-connected
information (confidence or number of visits) that each voxel rooms with a corridor with several objects placed inside the
encapsulates. The proposed system used quantified rooms. The environment contains 11 objects which are walls,
information from the semantic-occupied map to evaluate the floors, three people, a sofa, two chairs, a book shelve, a vase,
next best exploration action. and a table. The constructed maps are based on 3D occupancy
grid using OctoMap library with res = 0:15m per pixel. Each
A remarkable technique used to explore an unknown
utility function is tested separately under controlled
environment is the Next Best View (NBV) [6] approach. The
environments.
main steps in the exploration task are: A) viewpoint sampling,
B) viewpoint evaluation, C) navigating toward the selected
viewpoint, and D) termination. The exploration process is
summarized in Fig. 2. At the beginning of the exploration
process, a robot uses the onboard sensors to observe the scene
and produce a set of viewpoints candidates (also known as
states or poses). In this work, the Rapidly-Exploring Random
Tree (RRT) [7] is used. The RRT selects a series of points
randomly in a tree-like manner instead of multiple single
points for evaluation. The tree is expanded throughout all the
exploration space, and each branch forms a group of random
branches. The accepted viewpoints candidates are then
evaluated using a utility function (also known as reward, cost, Figure 3: Simulation Environment
or heuristic function). In this work, each point in the branch is
evaluated using a utility function, and the branch which V. EVALUATION METRICS
maximizes the utility function is selected as the next goal.
Although the evaluation is performed for the whole branch, Table I summarizes the evaluation metrics used.
and the best branch is selected for execution, only the first Table I: Evaluation Metrics
edge of the selected branch is executed. The exploration Coverage Percentage of the number of known
process repeats in a receding horizon fashion until a voxels compared with the total number
of voxels the environment can cover.
termination condition is met, indicating the end of the
After each iteration, the coverage is
exploration. In this work, the termination goal used is a calculated as follows
predefined number of iteration. Free  Occupied
VC 
Free  Occupied  Uknown
Detected objects Counting the number of detected objects
in the environment
Efficiently labeled objects Counting manually the number of
objects that are correctly labeled using
the semantic color table

VI. EXPERIMENTAL RESULTS


The two proposed utility functions are compared with the
state of the art volumetric gain [9]. The reported results are for
three different experiments simulation tests. The tests are
divided according to the viewpoint evaluation approaches.
Table II shows the recorded values for the evaluation metrics.
Figure 2. General components of th e NBV method The reconstructed 3D semantic maps are shown in Fig. 4, Fig.
5, and Fig. 6 when using the volumetric gain, semantic visible
IV. EXPERIMENTAL SETUP
voxel, and semantic visited voxels for objects of interest
Simulation experiments performed on an ASUS laptop respectively. Figure 7 shows the semantic annotations.
(Intel Core i7 @ 2.8 GHz x 8, 16 GB RAM). The NBV
framework was implemented on Ubuntu 16.04 using the Table II. : Evaluation Results
Robot Operating System (ROS- kinetic) [8] to handle message
Num VS Viewpoint Evaluation VC(%) NDO NSDO
sharing and ease the transition to hardware. The gazebo 1 RRT Volumetric Gain [9] 91.3 % 11 8
simulator was used to perform the actual simulation, with 2 Semantic Visible Voxels 88.5 % 11 7
3 Semantic Visited Object 93.1 % 11 8
programming done in both C++ and Python. The simulation
were performed using a UAV equipped with one RGB-D Viewpoint Sampling (VS), Volumetric Coverage (VC), Number of Detected
Object (NDO), Number of Sufficiently Detected Objects (NSDO)
camera. The environment, shown in Fig. 3, was designed via
gazebo and used as the unknown environment that the robot
Figure 4: 3D Map Using Figure 5: 3D Map Using Semantic
Volumetric Gain Utility Function Visible Voxels

Figure 6: 3D map using Semantic Figure 7: Color annotations map


Visited Voxel for Object of
Interest

ACKNOWLEDGMENT
This publication is based upon work supported by the Khalifa
University of Science and Technology under Award No.
RC1-2018-KUCARS.

REFERENCES

[1] M. Erdelj, E. Natalizio, K. R. Chowdhury, and I. F. Akyildiz, “Help from


the sky: Leveraging uavs for disaster management,” IEEE Pervasive
Computing, vol. 16, no. 1, pp. 24–32, 2017.
[2] N. Hallermann and G. Morgenthal, “Visual inspection strategies for large
bridges using unmanned aerial vehicles (uav),” in Proc. of 7th
IABMAS, International Conference on Bridge Maintenance, Safety and
Management, 2014, pp. 661–667.
[3] A. Wada, T. Yamashita, M. Maruyama, T. Arai, H. Adachi, and H. Tsuji,
“A surveillance system using small unmanned aerial vehicle (uav) related
technologies,” NEC Technical Journal, vol. 8, no. 1, pp. 68–72, 2015.
[4] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 2881–2890.
[5] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,
“Octomap: An efficient probabilistic 3d mapping framework based on
octrees,” Autonomous Robots, vol. 34, no. 3, pp. 189–206, 2013.
[6] C. Connolly, “The determination of next best views,” in Robotics and
automation. Proceedings. 1985 IEEE international conference on, vol. 2.
IEEE, 1985, pp. 432–435.
[7] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal
motion planning,” The international journal of robotics research, vol. 30, no.
7, pp. 846–894, 2011.
[8] Koubâa, A. Robot Operating System (ROS). Springer, 2011
[9] A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart,
“Receding horizon” next-best-view” planner for 3d exploration,” in Robotics
and Automation (ICRA), 2016 IEEE International Conference on. IEEE,
2016, pp. 1462–1468
Towards game theoretic AV controllers:
measuring pedestrian behaviour in Virtual Reality
Fanta Camara1,2 , Patrick Dickinson2 , Natasha Merat1 and Charles W. Fox1,2,3

Abstract— Understanding pedestrian interaction is of great towards each other at an unmarked intersection as in Fig. 1.
importance for autonomous vehicles (AVs). The present study In the model, this process occurs over discrete space as in
investigates pedestrian behaviour during crossing scenarios Fig. 2 and discrete times (‘turns’) during which the agents
with an autonomous vehicle using Virtual Reality. The
self-driving car is driven by a game theoretic controller which can adjust their discrete speeds, simultaneously selecting
adapts its driving style to pedestrian crossing behaviour. We speeds of either 1 square per turn or 2 squares per turn,
found that subjects value collision avoidance about 8 times at each turn. Both agents want to pass the intersection as
more than saving 0.02 seconds. A previous lab study found soon as possible to avoid travel delays, but if they collide,
time saving to be more important than collision avoidance in they are both bigger losers as they both receive a negative
a highly unrealistic board game style version of the game. The
present result suggests that the VR simulation reproduces real utility Ucrash . Otherwise if the players pass the intersection,
world road-crossings better than the lab study and provides a each receives a time delay penalty −TUT , where T is the
reliable test-bed for the development of game theoretic models time from the start of the game and UT represents the value
for AVs. of saving one turn of travel time.
Keywords: Autonomous Vehicles; Game Theory; Cog-
nitive architectures for action selection; Pedestrian Be-
haviour;
I. I NTRODUCTION
The upcoming arrival of autonomous vehicles on the roads
poses several concerns regarding their future interaction with
other road users, in particular with pedestrians and cyclists,
whose behaviour is more complex and unpredictable. Pedes-
trian interaction is challenging due to multiple uncertainties
in their pose estimation, gestures and intention recognition.
We thus recently proposed a game theory model for such in-
teractions [3], where a pedestrian encounters an autonomous Fig. 2: Sequential Chicken Game
vehicle at an unsignalized intersection.

The model assumes that the two players choose their ac-
tions (speeds) aY , aX ∈ {1, 2} simultaneously then implement
them simultaneously, at each of several discrete-time turns.
There is no lateral motion (positioning within the lanes of
the roads) or communication between the agents other than
via their visible positions. The game is symmetric, as both
players are assumed to know that they have the same utility
functions (Ucrash ,UT ), hence they both have the same optimal
strategies. These optimal strategies are derivable from game
theory together with meta-strategy convergence, via recur-
Fig. 1: Two agents negotiating for priority at an intersection sion. Sequential Chicken can be viewed as a sequence of
one-shot sub-games, whose payoffs are the expected values
In this model, two agents (e.g. pedestrian and/or human of new games resulting from the actions, and are solvable
or autonomous driver) called Y and X are driving straight by standard game theory.
1 Institute for Transport Studies (ITS), University of Leeds, UK
The (discretized) locations of the players can be repre-
2 Lincoln Centre for Autonomous Systems (LCAS), School of Computer sented by (y, x,t) at turn t and their actions aY , aX ∈ {1, 2}
Science, University of Lincoln, UK for speed selection. The new state at turn t + 1 is given by
3 Ibex Automation Ltd, UK
(y + aY , x + aX ,t + 1). Define vy,x,t = (vYy,x,t , vXy,x,t ) as the value
This project has received funding from the European Union’s Horizon
2020 Research and Innovation programme under grant agreement No (expected utility, assuming all players play optimally) of the
723395 (InterACT) game for state (y, x,t). As in standard game theory the value
of each 2 × 2 payoff matrix can then be written as, collision avoidance. The present study aims to extend these
  experiments and put participants in more realistic interaction
v(y − 1, x − 1,t + 1) v(y − 1, x − 2,t + 1)
vy,x,t = v( ), (1) scenarios with a game theoretic autonomous vehicle in a
v(y − 2, x − 1,t + 1) v(y − 2, x − 2,t + 1)
virtual environment.
which can be solved using dynamic programming assum-
ing meta-strategy convergence equilibrium selection. Under
III. M ETHODS
some approximations based on the temporal gauge invariance
described in [3], we may remove the dependencies on the
A. VR Setup
time t in our implementation so that only the locations (y, x)
are required in computation of vy,x and optimal strategy The study was conducted using an HTC Vice Pro head
selection. mounted display (HMD). Participants did not use the HTC
Virtual Reality (VR) offers the opportunity to experiment Vice controllers, as no interactions other than walking were
on human behaviour in similated real world environments required. The HMD was used with the HTC wireless adapter
that can be dangerous or difficult to study, such as pedestrian in order to facilitate easier movement during the simulation.
road crossing. The present study uses VR to run the game We used an area of approximately 6m by 3m to conduct the
thereotic model on a virtual autonomous vehicle and then simulation (as shown in Fig. 3), which was mapped using
examines the responses of human participants to that. the usual HTC Vive room mapping system. The size of this
Contributions: To our best knowledge, this is the first area slightly exceeds that recommended by the manufacturer;
attempt to evaluate pedestrian behaviour during interaction however, we experienced no technical problem with tracking
scenarios with a game theoretic autonomous vehicle in a or system performance. The start position on the floor was
virtual reality environment. It examines pedestrian road- marked with an ’X’ using floor tape, so that participants
crossing preferences (Ucrash ,UT ) when interacting with the knew where to stand at the start of each simulation, prior to
virtual autonomous vehicle and demonstrates the importance placing the HMD on their head. The simulation was created
of VR for the development of the model. using the Unity 3D engine, and was run under Windows 10
on a PC based on an Intel Core i7-7700K CPU, with 32GB
II. R ELATED W ORK
of RAM, and an Nvidia GeForce GTX 1080 GPU.
There are few previous studies which investigated on
interactions between autonomous vehicles and other road
users in VR. Wang et al. [7] developed 5 different behaviours
for an autonomous vehicle. The vehicle behaviour was suc-
cessfully tested in different simulated traffic scenarios such
as at intersections and lane changing, in a simulated city
and highway road networks. Keferböck et al. [5] studied
autonomous vehicles interactions with pedestrians in a virtual
environment. In one of their experiments, participants are
asked to cross a road in front of them while a car is
approaching. This experiment differs from ours in that the
AV stops and shows (or not) a stop intent to pedestrians.
This study aimed to show the importance of substituting
communications between pedestrians and drivers by some Fig. 3: VR Lab
explicit communication forms for self-driving cars. Pillai [6]
used task analysis to divide pedestrian-vehicle interaction as B. Car behaviour model
a sequence of actions giving two outcomes, either the vehicle
passes first or the pedestrian crosses and perform some The virtual AV was designed to drive using the Sequential
experiments with participants on their crossing behavior Chicken model described above. The car began driving 40
using virtual reality. Hartmann et al. [4] proposed a tesing meters away from the intersection, its full speed was 30km/h
procedure of pedestrian collision avoidance for autonomous and lowest speed was 15km/h. The vehicle moved and
vehicles using VR techniques. This test bed can take into adapted its behaviour to participants motion. Every 0.02s,
account different factors that could influence pedestrian the car observed the current position of the pedestrian and
behaviour such as their understanding of the environment, made its decision based on the game theory model. The car
their body movement and their personality. was designed not to stop for any pedestrian. Indeed, in the
We previously performed laboratory experiments to fit data sequential chicken model, if the two players play optimally,
to the game theory model [3]. We first asked participants to then there must exist a non-zero probability for a collision to
play this game as a board game in [2]. Secondly, participants occur. Intuitively, if we consider an AV to be one player that
were asked to play the game in person moving on squares always yields, it will make no progress as the other player
[1]. These previous laboratory experiments have shown unre- will always take advantage over it, hence there must be some
alistic results, participants preferring time saving rather than threat of collision.
Fig. 4: Virtual Autonomous Vehicle

C. Human experiment
We invited 11 participants, 10 males and 1 female aged
Fig. 7: Pedestrian behaviour preference
between 19 and 37 years old, to take part to the study,
under University of Lincoln Research Ethics. 7 participants
had previous experience with VR. Participants were asked to
Similar to the optimal solution computation method de-
cross a road in front of them as they would do in everyday
veloped in the laboratory experiments [2] [1], we obtain
life. They should stop moving on their other side of the
an optimal parameter, θ = Ucrash /UT = −60/8 = 7.5, for
road, when they reached a yellow cube, located there for
participants, as shown in Fig. 7. This reveals that pedestrians
safety reasons. A vehicle approaches from their right hand
valued avoidance of a crash 8 times more more than a
side. Participants began walking about 4 meters away from
0.02s time saving per turn, resulting in pedestrians being
the intersection. Prior to the experiment, participants were
less assertive in crossing the road. In comparison, previous
introduced to the experimental setup and trained on walking
laboratory experiments found that participants valued time
within the VR environment with the VR headset. There were
saving more than collision avoidance [2][1].
6 trials per participants in the virtual environment with the
first trials considered as training data.

Fig. 6: Example of pedestrian and AV trajectories


Fig. 5: Participant taking part in the study (magenta: AV; green: pedestrian)

IV. R ESULTS V. C ONCLUSION


In total, 55 pedestrian-vehicle interactions were recorded. The present study demonstrated a work-in progress on
Among those interactions, pedestrians managed to cross the the use of virtual reality for the development of game
road before the car reached the intersection only 9 times. theoretic AV controllers. We examined the trajectories of
These crossings happened after the first trials, by pedestrians pedestrians interacting with a virtual autonomous vehicle
who felt confident after evaluting/gauging the car driving which makes its decisions based on the sequential chicken
style. Most interactions looked similar to Fig. 6, which shows model. The results reveal that pedestrian behaviour is more
the trajectories of a participant and the autonomous vehicle natural in VR than in previous laboratory experiments. This
during one interaction. The trajectory profile shows that is important, as it shows that virtual reality makes pedestrian
pedestrians were slowing down very quickly after seeing the crossing behaviour more realistic and it can therefore help
car, they were not playing optimally the game of chicken, so improve the development of the game theorectic model.
that the AV could cross most of the time. Future work would include the evaluation of pedestrian
crossing behaviours with different car models and within dif-
ferent environments. Methods of learning the best behaviour
parameters for the autonomous vehicle will be explored in
future VR studies.
R EFERENCES
[1] F. Camara, S. Cosar, N. Bellotto, N. Merat, and C. W. Fox. Towards
pedestrian-av interaction: method for elucidating pedestrian preferences.
In IEEE/RSJ Intelligent Robots and Systems (IROS) Workshops, 2018.
[2] F. Camara, R. Romano, G. Markkula, R. Madigan, N. Merat, and C. W.
Fox. Empirical game theory of pedestrian interaction for autonomous
vehicles. In Measuring Behavior 2018: 11th International Confer-
ence on Methods and Techniques in Behavioral Research. Manchester
Metropolitan University, March 2018.
[3] C. W. Fox, F. Camara, G. Markkula, R. Romano, R. Madigan, and
N. Merat. When should the chicken cross the road?: Game theory
for autonomous vehicle - human interactions. In Proceedings of
VEHITS 2018: 4th International Conference on Vehicle Technology and
Intelligent Transport Systems, January 2018.
[4] M. Hartmann, M. Viehweger, W. Desmet, M. Stolz, and D. Watzenig.
Pedestrian in the loop: An approach using virtual reality. In Inter-
national Conference on Information, Communication and Automation
Technologies (ICAT), 2017.
[5] F. Keferböck and A. Riener. Strategies for negotiation between
autonomous vehicles and pedestrians. In D. G. Oldenbourg, editor,
Mensch und Computer 2015 –Workshopband, 2015.
[6] A. Pillai. Virtual reality based study to analyse pedestrian attitude
towards autonomous vehicles. Master’s thesis, 10 2017.
[7] H. Wang, J. K. Kearney, J. Cremer, and P. Willemsen. Steering
behaviors for autonomous vehicles in virtual environments. In IEEE
Virtual Reality, 2005.
MSPRT action selection model for
bio-inspired autonomous driving and intention prediction
Riccardo Donà1 , Gastone Pietro Rosati Papini 1 and Giammarco Valenti1

Abstract— This paper proposes the usage of a bio-inspired Several theories have been proposed in literature on how
action selection mechanism, known as multi-hypothesis sequen- animals perform effective decision making [9]. For instance,
tial probability ratio test (MSPRT), as a decision making tool in [10] the affordance competition concept underlines a
in the field of autonomous driving. The focus is to investigate
the capability of the MSPRT algorithm to effectively select the parallel processing of multiple actions competing against
optimal action whenever the autonomous agent is required to each other until the selection of the winning behavior. Such
drive the vehicle or, to infer the human driver intention when a modeling framework is based on the definition of criteria
the agent is acting as an intention prediction mechanism. After for assessing the worthiness of the action and the selection
a brief introduction to the agent, we present numerical simu- process itself.
lations to demonstrate how simple action selection mechanisms
may fail to deal with noisy measurements while the MSPRT We exploit this concept of parallel competing actions
provides the robustness needed for the agent implementation in the context of the European Projects SafeStrip1 and
on the real vehicle. Dreams4Cars2 . In particular, in SafeStrip we take advantage
of the mirroring mechanism introduced in [11] to infer the
I. INTRODUCTION
human driver intended action in several dangerous scenarios,
Autonomous vehicles (AVs) require effective algorithms to like in the proximity of a pedestrian crossing, in a road
perform robust decision making in the shortest time frame work zone or in an intersection. In the latter case a more
possible. Indeed, in a dynamic environment such as the one complex mirroring is performed, taking into account the
faced by the AVs, the capability of reacting promptly is a right of way rules and mirroring other vehicles. This is
major factor in potentially avoiding collisions and saving made through vehicle to vehicle and vehicle to infrastructure
lives. The inherent complexity of the process is worsened communication [12].
by the presence of sensors’ noise and uncertainties which Such an inference process boils down to the selection
affect the way the behavioural level selects the proper action. among a set of longitudinal maneuvers, called motor prim-
In the early days of autonomous driving, tactical/behavioral itives, of the one matching the driver intended action in
level planning typically relied on manually engineered state terms of instantaneous jerk j0 . Each motor primitive has
machines, this approach has been adopted by many com- an optimality-based formulation characterized by an initial
petitors of the 2007 DARPA Grand Challenge (a.k.a. ur- jerk associated with. By defining the jerk space as a one 1-
ban challenge) [1], [2]. Despite some participants actually dimensional grid we can explore a set of possible actions
managed to succeed, state machines inherently lack the taking also into account infrastructure-based information.
capability of safely generalizing to unmodeled scenarios. In Dreams4Cars we utilize a similar optimality-based mo-
More recent autonomous driving softwares are built on top of tor primitives approach for the synthesis of an autonomous
probabilistic approaches including Markov Decision Process driving agent called Co-driver [13]. In addition to the longitu-
[3] or machine learning-based techniques such as behaviour dinal manoeuvres, we also generate set of lateral manoeuvres
networks [4] or support vector machines [5]. A promising by defining a 1-dimensional grid on instantaneous lateral jerk
method is the adoption of reinforcement learning (RL) as a r0 . By combining the two grids we devise a 2-dimensional
high level biasing mechanism for learning an optimal action matrix where each entry is a pair of (j0 , r0 ) which encodes
selection policy [6] or oppositely, the exploitation of the a latent action. Each pair is then assign a merit via the
inverse reinforcement learning (IRL) framework to learn the definition of a scenario dependent salience.
reward function from human data [7]. Common to both the project there is the need to select
Conversely, the problem of action selection is not a the best action after the computations of the grids. The rest
peculiar feature of AVs, instead any agent (both artificial and of this paper is devoted to demonstrate how we can perform
biological) dealing with complex dynamical environments such a task taking advantage of a biologically inspired action
where multiple mutually exclusive behaviours are possible, selection mechanism.
shares similar dilemmas. Indeed there exists a huge amount
of ethology literature investigating “behaviour switching” II. THE MOTOR CORTEX CONCEPT
and “decision making” [8], the common jargon among cog-
In order to better clarify how the affordances competition
nitive scientists to refer to the action selection problem in
process takes place, let us inspect an example simulation
robotics.
1 Department of Industrial Engineering, University of Trento, 38123 1 https://www.safestrip.eu

Trento, Italy [email protected] 2 https://www.dreams4cars.eu


scenario as in Fig. 1. In the proposed situation the ego car, can compute an artificial motor cortex as in Fig. 2, where
driven by the Co-driver agent, is travelling at high speed on the salience is displayed along the z-axis of the 3D plot.
a straight road when a slower vehicle is detected (Fig. 1a). It can be noticed how lateral controls close to zero have
This scenario translates into the control space representation high merit values as, clearly, steering abruptly will drive the
shown in Fig. 1b. Physical space to control space transforma- vehicle out of the road sooner than steering mildly while the
tion is performed via the analytical solution of a linearized orange region in Fig. 1b has a close to zero salience due to
vehicle kinematic plant optimal control similarly to [14], the inherent risk of collide in a short time frame.
[13]. The green portion of the control space representation
expresses the feasible control actions, i.e. the set of pairs
(j0 , r0 ) which allow the ego car to stay within the solid
lane markings. On the other side, the orange/yellow portion
conveys the control inhibition caused by the presence of a
slower leading vehicle. The solid orange region is associated
to controls that lead to a collision while the yellow area
encodes the potential danger in staying too close to the
obstacle. Eventually the white region expresses the speed
limit exceedance.

Fig. 2: Minimum intervention principle based motor cortex


for the scenario in Fig. 1.

The motor cortex in Fig. 2 encodes the affordance con-


cepts previously mentioned. Each of the action is in fact
associated to a merit value and compete against the others
for winning the selection process. The outcome of the

(a) “competition” is the optimal pair (j0∗ , r0∗ ) that will eventually
guide the car for the next time-step.
In the inference-via-mirroring application, the merit as-
signment procedure is slightly modified to account for both
� (�/�� )


the potential maneuvers and the one currently performed
○ by the driver. After the computation of the scenario-based
merit for each initial control, a bias function measuring the
proximity of the driver maneuver to each action is applied
-�
to the motor cortex as shown in [16] for the longitudinal
-���� � ���� control only.
� (�/�/�)

(b) III. ACTION SELECTION


Fig. 1: Example simulation scenario bird-eye view (a) and
A. WTA algorithm
corresponding control space representation (b).
The most trivial approach to model the affordances con-
The motor cortex corresponding to the action space in tetition would be to simply choose the pair having the
Fig. 1b can be computed by introducing some merit criterion. highest instantaneous salience. This selection mechanism is
For the considered example scenario we model the merit known as winner takes all (WTA) [9] and has proven to
as the maximum time at which, given the pair (j0 , r0 ), the be fairly efficient in the simulation environment where there
vehicle will leave the road or collide with other road users. In is no signal noise. On the other side, this action selection
other words we are trying to find which are the controls that procedure is likely to choose sub-optimal action in the
allow the vehicle to navigate the longest without any further presence of noise such as when the agent is driving a real
intervention during the execution. This idea is also known car. Furthermore, even in the simulation environment, this
as minimum intervention principle [15]. Given the biological mechanism may give rise to hysteresis when two competing
inspiration of the procedure, we refer to such a time as the actions share similar salience values which can cause loss of
salience of the action. By establishing the criterion above, we vehicle control.
B. MSPRT algorithm TABLE I: Parameters of the MSPRT algorithm.
In order to overcome the problematics of the WTA name symbol value effect
procedure we propose here the introduction of the multi- threshold th 0.0005 slows down the switch to
a new channel
hypotheses sequential probability ratio test (MSPRT) [17] windows size ws 8 average out noise, brings
decision making algorithm. in more robustness
The key idea of the MSPRT algorithm is to accumulate forgetting factor λ 0.9 introduces a memory
effect after the switching
evidence for each channel and then pick an action only when to a new channel
the integral reaches a predefined threshold level. This mech-
anism should guarantee more robustness to noisy decisions
by trading off some responsiveness. measurements. According to this set-up we can perform
The MSPRT has been shown to be asymptotically time- optimal decision making using a simple WTA algorithm.
optimal in a multi alternatives process [18]. More recently a We then select a 9-seconds long critical double lane change
link between the action selection process happening in the maneuver where the responsiveness of the action selection
basal ganglia of the human brain and the MSPRT algorithm plays a fundamental role. Next, we re-execute the simulation
has been drawn [19]. offline, i.e. we take the logged motor cortex history, we
The overall procedure for action-selection using the apply some random noise on the channels and we re-
MSPRT algorithm is reported in Algorithm 1. First we execute the decision making algorithm only on the corrupted
append the set of observations at time-step Mt to the motor cortex. We then analyze again the performances of
list of observations Mlist which contains observations from the WTA against MSPRT with respect to the ground-truth
t − ws up to t where ws is the dimension of the averaging case obtained previously. The exact parameters used in the
window. Then we compute the mean of observation along the simulation above are reported in Table I.
time dimension in order to get an average of each channel Fig. 3 reports the results of assessment as a function of the
evidence for the considered window. Next we compute the adimensional noise variance σ injected into the motor cortex.
likelihood of each channel according to In case of limited noise figures, the WTA still outperforms
N
X MSPRT due to the worse transient performance of the latter.
L(t) = y(t) − log exp (yk (t)), (1) As soon as we introduce noise in the simulation, however,
k=1 the advantages of the MSPRT start to be evident. In this
where y(t) represents the vector of evidence at time-step t case we chose a fairly conservative tuning for the MSPRT
obtained via flattening the motor cortex. Then we compute that will make it behave correctly even in the presence of
the max(exp (L)) to investigate whether some of the chan- high noise while the performance of the WTA drops in a
nels reached a predefined threshold value. In the positive case more significant manner. Indeed by shrinking the threshold
we reset the moving average list by taking only a percentage value and setting the λ to zero MSPRT will perform exactly
λ of the current average value. Otherwise we continue to like WTA.
follow the previous action until eventually a new action will 120

win the affordance race.


100
Algorithm 1: MSPRT algorithm
Result: Action log-likelihood 80
Mlist ← Mt ;
M̄ ← mean {Mlist };
60
compute likelihood L as in (1);
if max(exp (L)) > threshold then
40
take action;
Mlist = λ M̄
else 20
follow previous action;
end 0
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Overall the behaviour of the MSPRT algorithm can be Fig. 3: MSPRT vs. WTA channels selection errors. Parame-
shaped by adjusting the hyper-parameters in Table I. ters of the simulation as in Table I.

IV. SIMULATION COMPARISON Another valuable performance index is the number of


We compare the performances of the MSPRT against the switches, the lower the number of switches the more stable
WTA on simulated logged data. Firstly we let the agent the behaviour of the agent. Fig. 4 shows the switching logic
drove on a simulated scenario with no noise affecting the for the MSPRT and WTA for a selection of the data-set.
It is evident how the MSPRT not only picks the best action R EFERENCES
more effectively than WTA but also tends to stick with a sub- [1] T. Gindele, D. Jagszent, B. Pitzer, and R. Dillmann, “Design of the
optimal action rather than continuously changing the channel planner of team annieways autonomous vehicle used in the darpa urban
which could lead to vehicle instability. challenge 2007,” in 2008 IEEE Intelligent Vehicles Symposium. IEEE,
2008, pp. 1131–1136.
1250
[2] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark,
MSPRT
J. Dolan, D. Duggins, T. Galatali, C. Geyer et al., “Autonomous driving
WTA in urban environments: Boss and the urban challenge,” Journal of Field
true Robotics, vol. 25, no. 8, pp. 425–466, 2008.
1200 [3] S. Brechtel, T. Gindele, and R. Dillmann, “Probabilistic MDP-behavior
planning for cars,” in 2011 14th International IEEE Conference on
Intelligent Transportation Systems (ITSC). IEEE, 2011, pp. 1537–
1542.
1150 [4] J. Schroder, M. Hoffmann, M. Zollner, and R. Dillmann, “Behavior
decision and path planning for cognitive vehicles using behavior
networks,” in 2007 IEEE Intelligent Vehicles Symposium. IEEE, 2007,
pp. 710–715.
1100 [5] C. Vallon, Z. Ercan, A. Carvalho, and F. Borrelli, “A machine learning
approach for personalized autonomous lane change initiation and
control,” in 2017 IEEE Intelligent Vehicles Symposium (IV). IEEE,
2017, pp. 1590–1595.
1050
[6] M. Mukadam, A. Cosgun, A. Nakhaei, and K. Fujimura, “Tactical
decision making for lane changing with deep reinforcement learning,”
2017.
1000 [7] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for
0 5 10 15 20 25 30 35 40 45 autonomous cars that leverage effects on human actions.” in Robotics:
Science and Systems, vol. 2. Ann Arbor, MI, USA, 2016.
[8] D. McFarland, Problems of animal behaviour. Longman Sc & Tech,
Fig. 4: MSPRT vs. WTA channels switching for σ = 0.5. 1989.
Parameters of the simulation as in Table I. [9] P. Redgrave, T. J. Prescott, and K. Gurney, “The basal ganglia: a
vertebrate solution to the selection problem?” Neuroscience, vol. 89,
no. 4, pp. 1009–1023, 1999.
[10] P. Cisek, “Cortical mechanisms of action selection: the affordance
V. CONCLUSIONS competition hypothesis,” Philosophical Transactions of the Royal
We have shown that bio-inspired cognitive models can Society B: Biological Sciences, vol. 362, no. 1485, pp. 1585–1599,
2007.
play a substantial role in the process of decision making in [11] M. Da Lio, A. Mazzalai, and M. Darin, “Cooperative Intersection
automated driving. In particular we demonstrated how the bi- Support System Based on Mirroring Mechanisms Enacted by
ologically inspired MSPRT algorithm can be adapted to both Bio-Inspired Layered Control Architecture,” IEEE Transactions on
Intelligent Transportation Systems, vol. 19, no. 5, pp. 1415–1429,
the inference and the action selection process given a suitable May 2018. [Online]. Available: https://ieeexplore.ieee.org/document/
lower-lever architecture for the agent. The advantages of 8011474/
the proposed formulation lie in an improved robustness to [12] G. Valenti, D. Piscini, and F. Biral, “A cooperative intersection support
application enabled by safe strip technology both for c-its equipped,
noisy observations (Fig. 3) and a greater stability of the non-equipped and autonomous vehicles,” in European conference on
chosen action (Fig. 4) with respect to traditional action Intelligent Transportation Systems, 2019.
selection. Indeed the effectiveness of the MSPRT depends [13] M. Da Lio, F. Biral, E. Bertolazzi, M. Galvani, P. Bosetti, D. Win-
dridge, A. Saroldi, and F. Tango, “Artificial Co-Drivers as a Universal
on the tuning of application dependent hyperparameters. The Enabling Technology for Future Intelligent Vehicles and Transporta-
activation threshold shapes the sensitivity to the process tion Systems,” IEEE Transactions on Intelligent Transportation Sys-
noise: the lower the threshold the more responsive will be tems, vol. 16, no. 1, pp. 244–263, 2015.
[14] M. Werling, S. Kammel, J. Ziegler, and L. Gröll, “Optimal trajectories
the action picking, the higher the threshold the more robust for time-critical street scenarios using discretized terminal manifolds,”
the selection. Similar considerations apply for the tuning of The International Journal of Robotics Research, vol. 31, no. 3, pp.
the forgetting factor λ. However, we proved via simulation 346–359, 2012.
[15] E. Todorov and M. I. Jordan, “A minimal intervention principle for
that it is possible to find an effective trade off adjustment coordinated movement,” in Advances in neural information processing
for the MSPRT such that the algorithm outperforms other systems, 2003, pp. 27–34.
techniques. In particular for the considered data-set and [16] G. Valenti, L. De Pascali, and F. Biral, “Estimation of longitudinal
speed profile of car drivers via bio-inspired mirroring mechanism,”
σ = 0.5 the MSPRT guarantees an error rate up to 40% in 2018 21st International Conference on Intelligent Transportation
inferior to the WTA algorithm. Further work will be devoted Systems (ITSC). IEEE, 2018, pp. 2140–2147.
to the set-up of a “layered” action selection process where a [17] C. W. Baum and V. V. Veeravalli, “A sequential procedure for
multihypothesis testing,” IEEE Transactions on Information Theory,
lower layer will be in charge of merging the contribution vol. 40, no. 6, 1994.
of channels encoding the same action to make sure that [18] V. Draglia, A. G. Tartakovsky, and V. V. Veeravalli, “Multihypothesis
the affordance competition takes place among statistically sequential probability ratio tests. i. asymptotic optimality,” IEEE
Transactions on Information Theory, vol. 45, no. 7, pp. 2448–2461,
independent channels only in order to run the MSPRT more 1999.
efficiently. [19] R. Bogacz and K. Gurney, “The basal ganglia and cortex implement
optimal decision making between alternative actions,” Neural compu-
VI. FUNDING tation, vol. 19, no. 2, pp. 442–477, 2007.
This work is supported by the European Commission
Grant 731593 (Dreams4Cars) and 723211 (SafeStrip).
A dynamic neural model for endowing intelligent cars with the ability
to learn driver routines: where to go, when to arrive and how long to
stay there?
Flora Ferreira1 , Weronika Wojtak1,2 , Wolfram Erlhagen1 , Paulo Vicente1,2 , Ankit R. Patel2 ,
Sérgio Monteiro2 , and Estela Bicho2

Abstract— For many people, driving a car is a routine recall of the past sequences (of destinations) with time
activity where they tend to go to the same places at about information can be used to predict what is the driver’s
the same time of day or day of week. We propose a learning intent. Learning occurs implicitly – driver does not need
system – based on dynamic neural fields– that allows cognitive
vehicles/cars to acquire sequential information about driver to be asked for his/her destinations – and is a continuous
destinations and corresponding time properties. Importantly, process modeled in the form of coupled dynamic neural fields
the learning system allows to memorize long sequences, to (DNFs). The theoretical framework of DNFs has been proven
deal with different temporal scales, and the destinations do to provide key processing mechanisms to implement work-
not need to be fixed in advance. Learning occurs implicitly ing memory, prediction, and decision making in cognitive
and it is a continuous process. Memory recall allows the car
to predict driver’s destination intention, when she/he intends systems (e.g. [9], [10]), including the learning of sequential
to arrive/leave, and for how long she/he intends to stay at a tasks ([12], [13], [14]).
destination. Such personalized information can be used to plan In this study, the central idea is to explore learning
the next trip. mechanisms able to learn not only the sequence of driver
I. I NTRODUCTION destinations but also time properties, e.g. (i) when to be at
a destination and (ii) when to leave. Memory recall allows
Many studies report that human mobility is characterized the car to predict driver’s destination intention, when she/he
by a high degree of regularity [1], [2], a significant tendency intends to arrive, and for how long she/he intends to stay
to spend most of the time in a few locations [3] and a there.
tendency to visit specific destinations at specific times [4],
[5]. For example, for many drivers, weekdays consist of II. T HE APPROACH
leaving home in the morning, driving to children’s school, The approach presented in this paper is based on pre-
work, again to children’s school and returning home in the vious work on memory mechanisms for order and timing
evening. A person’s daily routines are typically coupled with of sequential processes [11], [13], [15] based on Dynamic
routines across other temporal scales, such as going to the Neural Fields (DNFs) [16], [17]. The central idea of dynamic
gym or the church, at specific days of the week. field models is that relevant information is expressed by
Several different approaches, most of them statistical supra-threshold bumps of neural activity where each bump
models [6], [7], have been proposed for predicting the next represents a specific parameter value. Input from external
location in human mobility, in which a big data is necessary. sources, such as information from a sensor, causes activation
Traditional Markov models work well for specific set of in the corresponding population that remains active with
behaviors but have difficulty incorporating temporal patterns no further external input due to recurrent excitatory and
across different timescales [8] and destinations need to be inhibitory interactions within the populations. Those interac-
fixed in advance.Here, we propose a dynamic neural model tions are able to hold auto-sustained multi-bump patterns. We
for learning information about the sequence of places and assume that the vehicle GPS coordinates and the information
timing on the habits of individual drivers. The fundamental whether the car is turning on or off are available. We consider
assumption is that driving is mostly a routine, and memory as input the GPS coordinates (latitude and longitude) when
*The work received financial support from European Structural and the vehicle is turning off or on, which represent the GPS
Investment Funds in the FEDER component, through the Operational coordinates of a destination at the time the driver arrives
Competitiveness and Internationalization Programme (COMPETE 2020)
and national funds, through the FCT (Project “Neurofield”, ref POCI-
or departs, respectively. Figure 1 illustrates an overview of
01-0145FEDER-031393) and ADI (Project “Easy Ride:Experience is the model architecture consisting of several interconnected
everything”, ref POCI-01-0247-FEDER-039334), FCT PhD fellowship neural fields assuming as input the GPS coordinates when
PD/BD/128183/2016 and the Pluriannual Funding Programs of the research
centres CMAT and Algoritmi.
the car departs or arrives. For concreteness, we assume the
1 Flora Ferreira, Weronika Wojtak, Wolfram Erlhagen and Paulo Vi- case of “arrives signals” to describe the model.
cente are with Center of Mathematics, University of Minho, Portugal The 2D field and the four 1D fields on top of the
[email protected] figure implement the encoding and memorizing of the GPS
2 Weronika Wojtak, Paulo Vicente, Ankit R. Patel, Sérgio Monteiro and
Estela Bicho are with Center Algoritmi, University of Minho, Portugal coordinates (latitude and longitude) of the places where the
[email protected] driver was and the relative timing between the moments in
which the driver arrived at those places. At the moment
the driver turns off the car at specific GPS coordinates ∂ u(r,t)
Z
w(r, r0 ) f u(r0 ,t) dr

trigger the evolution of a bump in the input encode ON/OFF τ = −u(r,t) + S(r,t) + h +
∂t Ω
field uION/OFF . This activity is projected to corresponding (1)
neurons in two 1D fields: latitude perception field uPlat
where u(r,t) represents the activity at time t at position r
and longitude perception field uPlong resulting in a local-
on the domain Ω as a subset of Rd with d = 1 or d = 2. The
ized bump in each of these fields. Each of these bumps
constant τ > 0 defines the time scale of the field dynamics.
triggers through excitatory connections the evolution of a
The function S(r,t) represents the time-dependent, localized
self-sustained activity pattern at the corresponding site in
input to the field. The global inhibition, h < 0 defines the
latitude sequence memory field uSMlat and longitude sequence
baseline level of activation to which field excitation de-
memory field uSMlong , respectively. Inhibitory feedback from
cays without external stimulation. The connectivity function
uSMlat to uPlat (from uSMlong to uPlong ) destabilizes the existing
w(r, r0 ) models how a population of neurons at position r
bump in the perception field. This ensures that newly arrived
in the field interacts with a population at position r0 . For
localized input to uPlat (uPlong ) will automatically drive the
the fields on which only one bump at a time should evolve
evolution of a bump at a different field location even if
(e.g. uPlat , uD ), we use a standard kernel of lateral inhibition
the specific cue value is repeated during the course of the
type [18]. To enable multi-bump solutions in the fields (e.g.
sequence. The series of GPS coordinates at the moments
uPlat , uD ) we assume a kernel with oscillatory rather than
when the car was turned off creates a multi-bump pattern in
monotonic decay [16], [17]:
uSMlat and in uSMlong that stores the last sequence of visited
places with a strength of activation decreasing from bump to w(r) = Aw e−br (b sin (αr) + cos (αr)) (2)
bump as a function of elapsed time since sequence onset. p
To guarantee the memory of successive routines for the where r = |x| for 1D and r = x2 + y2 for 2D, the parameters
same period of time, a dynamic building of a long term Aw > 0, b > 0 and α > 0 control the amplitude, the rate at
memory in uLMlat (uLMlong ) is generated through excitatory which the oscillations in w decay with distance, and the zero
connection from in uSMlat ( uSMlong ). Multi-bump pattern of crossings of w, respectively. The firing function f is taken
uSMlat (uSMlong ) is projected via excitatory connections in uPlat as the Heaviside step function with threshold 0.
(uPlong ) ensuring the robustness of the encoding process in the A. Memory of interval timing between the destinations
face of noisy and potentially incomplete sensory information. To establish a stable activation gradient in the sequence
The resulting preshaping in uPlat (uPlong ) based on prior memory fields we consider the following state-dependent
experience modulates perception thresholds and speeds up dynamics [11], [13], [14]:
the processing of inputs. When prediction are needed, the
sequence recall mechanism is activated. During the recall the ∂ h(x,t)
τh = (1 − f (u(x,t))) (−h(x,t) + h0 ) + k f (u(x,t))
four 1D fields and the 2D field on bottom of Figure 1 become ∂t
(3)
active. The latitude decision field uDlat (uDlong ) receives the
multi-bump pattern of uLMlat (uLMlong ) as subthreshold input. where h0 defines the level to which h relaxes without
By a continuous increase of the baseline activity in uDlat suprathreshold activity at position x and k > 0 measures the
(uDlong ), all subpopulations are brought closer to the thresh- growth rate when it is present. The adaptation of the resting
old for the evolution of self-stabilized bumps. When the level h is performed locally at field sites with suprathreshold
currently most active population reaches this threshold, the activity. The memory of interval timing between the places
corresponding output in uR is triggered. At the same time, the in which the car arrives or departs is memorized in the peak
excitatory-inhibitory connections between associated popu- amplitudes. Considering as input the GPS coordinates of a
lations in uDlat (uDlong ) and latitude working memory field place at the moment in which the car is turned off, the dif-
uW Mlat (uW Mlong ) guarantee that the suprathreshold activity ference between two successive peak amplitudes represents
representing the latest sequence event becomes first stored the interval timing between two successive places that the
in working memory field and subsequently suppressed. The car was parked.
global initial value of h in uDlat (uDlong ) is proportional to the B. Memory of time spent in each destination
sequence duration (e.g. 24 hours) minus the time early (e.g.
10 minutes) in which arrive time to or depart time from a To create a memory of time duration in each place a 2D
specific place should be predicted. dynamic neural field is used. During recall this field receives
The population dynamics in each memory field is governed as input the corresponding localized activation from output
by the an integro-differential equation, which describes the recall OFF and a representation of the time duration spent in
activation of interconnected neurons along a one or two each place is obtained by applying the following dynamics
dimensional domain [18]: for h:
∂ h(x, y,t)
τh = k f (uROFF (x, y,t))(1 − f (uRON (x, y,t))). (4)
∂t
The h value increases locally as a function of elapsed
time only in the presence of activation in uROFF and when
GPS Coordinates when the car departs/arrives

Sequence

Latitude
Learning
with time
constraints
Longitude

𝑢𝑃𝑙𝑎𝑡:Latitude Perception 𝑢𝑀𝑙𝑎𝑡 : Latitude 𝑢𝑃𝑙𝑜𝑛𝑔 :Longitude Perception 𝑢𝑀𝑙𝑜𝑛𝑔 :Longitude Sequence memory
Sequence memory

Memory

𝑢𝐿𝑀𝑙𝑎 𝑡 :Latitude Long term memory 𝑢𝐿𝑀𝑙𝑜𝑛g :Longitude Long term memory

Sequence
Recall

𝑢𝑊𝑀 𝑙𝑎𝑡 :Latitude 𝑢𝐷𝑙𝑎𝑡 :Latitude Decision 𝑢𝑊𝑀 𝑙𝑜𝑛𝑔:Longitude Working Memory 𝑢𝐷𝑙𝑜𝑛𝑔 :Longitude Decision
Working Memory
Latitude

Longitude

Excitatory connection
Prediction driver's intention destination and when to depart/arrive Inhibitory connection

Fig. 1. Schematic view of the model architecture with several interconnected neural fields implementing sequence learning, memory and sequence recall
for the depart/arrive signals (GPS coordinates when the car arrives or departs at destination as input). For details see the text.

simultaneously corresponding subpopulation in uRON is not take them to the gym and finally come back to home.
activated (see Figure 2). To simulate this example we assume as input a real GPS
coordinates of a Portuguese city (Guimarães) at a realistic
Prediction of where and when the car arrives Memory of time spent in each destination
time. Figure 3 (A) illustrate the sequence memory of where
and when the car departs. The GPS coordinates memorized
Latitude

are represented with letter P symbol. The closer points (less


Activation u

Longitude than 400 meters in this case) represent the same place. The
𝑢𝑅𝑂𝐹𝐹:Output Recall OFF 0

Prediction of where and when the car departs


map illustrates the memory of five different places (Home,
School, Work, Restaurant and Gym) in which two places
Latitude

Lat
itu
de de
(School and Work) were visited by the driver twice. Stable
gitu
Lon
activation patterns corresponding to memory of the GPS
Longitude
𝑢𝑅𝑂N :Output Recall ON coordinates (latitude and longitude) sequence are represented
in two 1D fields where the bump amplitudes reflect the order
Fig. 2. Left: snapshot of activation of a bump in output recall fields uROFF of places where the car departed from and relative timing
and uRON . Right: snapshot of a stable activation pattern corresponding to
the duration time in each place already recalled. between them. Figure 3 (B) shows the representation in a
2D field of the time duration that the car was parked in each
Having a field with information about the time spent in place during a day (from 0:00 to 24:00). Each P symbol
each place will be useful, for example, to predict the time on the map corresponds to a bump in the 2D field, and
the car will stay in a specific destination. the amplitude represents the duration in each location. The
higher amplitudes represent the places in which the car was
III. R ESULTS parked for a longer time (i.e. work and home).
As an example, we consider a day driver routine: depart
IV. C ONCLUSION
from home to take the kids to school, next go to work, go
to the restaurant to have lunch, come back to work and We have presented an approach to learn ordinal and tempo-
in the evening go to pick up the children from school, ral aspects of driver routines using the theoretical framework
(A) 6
Sequence of places where the car departed during a day
1st
2nd
3rd 4th
5th 6th
7th

0 (B)
Memory of time spent in each place during a day
41.455
4th

4
3rd

3
5th

Activation
Latitude
2nd

2
6th

6
7th

de
7

itu
Lat
1st

Longitude
200 m 41.437
6

-8.310 Longitude -8.268

Fig. 3. (A) Part of Guimarães map in Portugal generated from Google Maps showing a sequence memory of the GPS coordinates of where the car
departed during a day. The GPS coordinates represented in the bump centers are marked with letter P symbol. The closer points represent the same place.
The seven marks are in five different places (Home, School, Work, Restaurant and Gym) in which two places have two marks. (B) Representation of time
duration in each visited place.

of dynamic neural fields. The learning is implicit, continuous [7] M. Boukhechba, A. Bouzouane, S. Gaboury, C. Gouin-Vallerand,
and can be scaled to different temporal scales. The model can S. Giroux, and B. Bouchard, Prediction of next destinations from
irregular patterns, Journal of Ambient Intelligence and Humanized
be instantiated for each day of the weak, and hence different Computing, 9(5), 2018, pp.1345-1357.
routines can be learned. There are several possible uses for [8] B.P. Clarkson, Life patterns: structure from wearable sensors, PhD
such learning memories. In terms of navigation systems, diss, Massachusetts Institute of Technology, 2002.
[9] G. Schöner, Dynamical systems approaches to cognition. Cambridge
smarter route selection/recommendation could be provided handbook of computational cognitive modeling, 2008, pp.101-126.
through the integration of these memories with other factors [10] Y. Sandamirskaya, S. K. Zibner, S. Schneegans, and G. Schöner, Using
such as traffic conditions without requiring input from the dynamic field theory to extend the embodiment stance toward higher
cognition. New Ideas in Psychology, 31(3), 2013, pp.322-339.
driver. The car could predict the next destination and the [11] F. Ferreira, W. Erlhagen, and E. Bicho, A dynamic field model of
desired time of arrival and alert the driver if she is getting ordinal and timing properties of sequential events, In International
late to come to the car. Predicting the next departure time Conference on Artificial Neural Networks, Springer, Berlin, Heidel-
berg, 2011, pp. 325-332.
could be used for preparing in advance the cockpit’s comfort [12] Y. Sandamirskaya, and G. Schoner, Dynamic field theory of sequential
– e.g. demist/defrost the windows and pleasant temperature action: A model and its implementation on an embodied agent. In 2008
– sometime before the driver (and occupants) enter the car. 7th IEEE international conference on development and learning, 2008,
pp. 133-138.
Future work concerns implementing and testing this learning [13] F. Ferreira, W. Erlhagen, E. Sousa, L. Louro, and E. Bicho, Learning
system in real driving scenarios, in the scope of the joint a musical sequence by observation: A robotics implementation of
project UMinho and Bosch – “Easy Ride: Experience is a dynamic neural field model, In 4th International Conference on
Development and Learning and on Epigenetic Robotics, 2014, pp.
everything” (ref POCI-01-0247-FEDER-039334). 157-162.
[14] F. Ferreira, W. Wojtak, E. Sousa, L. Louro, E. Bicho, and W. Erlhagen,
Rapid learning of complex sequences with time constraints: A dynamic
R EFERENCES neural field model, (Under review)
[15] W. Wojtak, F. Ferreira, L. Louro, E. Bicho, and W. Erlhagen, Towards
[1] N. Eagle and A. S. Pentland, Eigenbehaviors: Identifying structure in temporal cognition for robots: A neurodynamics approach, In 2017
routine, Behavioral Ecology and Sociobiology, 63(7), 2009, pp.1057- Joint IEEE International Conference on Development and Learning
1066. and Epigenetic Robotics (ICDL-EpiRob), 2017, pp. 407-412.
[2] C. Song, Z. Qu, N. Blumm, and A.L: Barabási, Limits of predictability [16] C.R. Laing, W.C. Troy, B. Gutkin, and G.B. Ermentrout, Multiple
in human mobility, Science, 327(5968), 2010, pp.1018-1021. bumps in a neuronal model of working memory. SIAM Journal on
[3] C. Song, T. Koren, P. Wang, and A.L. Barabási, Modelling the scaling Applied Mathematics, 63(1), 2002, pp.62-97.
properties of human mobility, Nature Physics, 6(10), 2010, p.818. [17] F. Ferreira, W. Erlhagen, and E. Bicho, Multi-bump solutions in
[4] S. Jiang, J. Ferreira, M.C. and González, Clustering daily patterns of a neural field model with external inputs. Physica D: Nonlinear
human activities in the city, Data Mining and Knowledge Discovery, Phenomena, 326, 2016, pp.32-51.
25(3), 2012, pp.478-510. [18] S. Amari, Dynamics of pattern formation in lateral-inhibition type
[5] S. Rinzivillo, L. Gabrielli, M. Nanni, L. Pappalardo, D. Pedreschi, neural fields. Biol. Cyber. 27, 1977, pp. 77–87
and F. Giannotti, The purpose of motion: Learning activities from
individual mobility networks. In 2014 International Conference on
Data Science and Advanced Analytics (DSAA), 2014, pp. 312-318.
[6] R. Simmons, B. Browning, Y. Zhang, Y. and V.Sadekar, Learning to
predict driver route and destination intent, In 2006 IEEE Intelligent
Transportation Systems Conference, 2006, pp. 127-132.
Towards an Evaluation Methodology for the
Environment Perception of Automotive Sensor Setups
How to define optimal sensor setups for future vehicle concepts?

Maike Hartstern1,2, Viktor Rack1 and Wilhelm Stork2


1
BMW Group Research, New Technology and Innovation, Munich, Germany
2
Institute for Information Processing Technologies, KIT Karlsruhe Institute of Technology, Karlsruhe, Germany
[email protected]

Abstract— With increasing degree of automation, vehicles Sensor Configuration: The setup can be established with
require more and more perception sensors to observe their diverse sensors that are working according to different measure-
surrounding environment. Car manufacturers are facing the ment principles. A redundant sensor arrangement shall ensure
challenge of defining a suitable sensor setup that covers all that important functions are still executable if one sensor drops
requirements. Besides the sensors’ performance and field of view out or a particular sensor technology is weak in a specific situ-
coverage, other factors like setup costs, vehicle integration and ation. The setup has to cover the vehicle’s surrounding without
design aspects need to be taken into account. Additionally, a dangerous blind spots. Aside from that, three different areas of
redundant sensor arrangement and the sensors’ sensitivity to interest have to be covered: the far field (highway driving), near
environmental influences are of crucial importance for safety. It is
field (urban driving) and ultra-near field (parking/start driving).
not feasible to explore every possible sensor combination in test
drives. This paper presents a new simulation-based evaluation Sensor Integration: Besides optimal mounting positions and
methodology, which allows the configuration of arbitrary sensor sensor alignment concerning Field Of View (FOV) coverage
setups and enables virtual test drives within specific scenarios to and sensor functionality, the feasibility of the geometrical inte-
evaluate the environmental perception in early development gration and design aspects have to be considered. In addition,
phases with metrics and key performance indicators. This evalua- environmental influences like sensor occlusion due to dirt, wea-
tion suite is an important tool for researchers and developers to ther and lighting conditions are crucial for sensor performance.
analyze setup correlations and to define optimal setup solutions.
Sensor Benchmarking: Sensor specifications like FOV,
Keywords— external vehicle sensors, perception, vehicle sensor
range, detection probability and accuracy are crucial. However,
setup configuration, sensor performance, virtual testing, simulation another important factor is the costs of the overall setup. Cost-
benefit analyses can reveal e.g. whether two adjoining sensors
I. INTRODUCTION with small FOV can replace an expensive sensor with high FOV.
Defining an adequate sensor setup is a complex task. So far,
Advanced Driver Assistance Systems (ADAS) support the
there is a lack of a consistent evaluation procedure and suitable
driver with functions like Adaptive Cruise Control (ACC),
tool to support developers to solve this task in a time- and cost-
Emergency Braking and Parking Assistant [1, 2]. Vehicles are
efficient way. Thus, we are addressing the research question:
equipped with several external perception sensors to provide
these functions with information about the car’s environment. “Which evaluation methodology can be applied to determine
The automotive development is now heading towards Highly the performance of automotive sensor setups regarding their
Automated Driving (HAD), Fully Automated Driving (FAD) environmental perception in an early development phase?”
and finally towards driverless Autonomous Driving (AD) to In this paper, we introduce a simulation-based evaluation
enhance the driving comfort and road safety. This rising degree concept that assists the procedure of evolving an optimal sensor
of automation comes along with an increasing number of setup in the context of automated driving based on a reliable
required perception sensors like cameras, radars, lidars and evaluation methodology. This also helps researchers to analyze
ultrasonic sensors to ensure sufficient coverage of the vehicle’s setup correlations and influences of sensor parameters.
surroundings. While the sensor setup for a typical ADAS was
still manageable with one radar, a few cameras and four ultra- II. RELATED WORK AND BASICS
sonic sensors at the front and back respectively, future setups Perception sensors are the first part of a complex data
will be more complex. To satisfy all requirements resulting from processing chain, which is visualized schematically in Fig. 1.
the higher automation degree like 360° surround view, far and
Preprocessing

Sensor 1
Environment
Environment

near field coverage and redundancy, they will need to be built


Actuators

Sensor
Model

up with a much higher quantity of diverse sensors. Sensor 2


Data
ADAS
This leads to a variety of sensor configurations, which have Sensor 3 Functions
Fusion
to be analyzed to find the best solution [3]. Concerning the setup, Sensor n
three aspects have to be considered: Figure 1: Schema of the data processing chain in automated vehicles.
Research supported by PRYSTINE project (grant agreement n° 783190).
The measurement data are preprocessed and are then enter- We are looking for a compact toolchain, which enables a
ing the data fusion part. Afterwards, an environment model configuration of sensor setups and a report on the overall setup
provides all the gathered information about the vehicle’s performance. Thus, the main tool aspects are investigated:
surroundings, which is then forwarded to the ADAS functions Virtual Environment: Environment simulations create object
that can trigger actuators, e.g. braking and steering. The lists as ground truth output, i.e. abstract information as classified
determination of the final sensor setup plays a central role in the objects with indications like object type, position, heading,
development of the whole data processing system since all these velocity, acceleration and bounding box dimensions for each
parts base on it. Consequently, the setup has to be defined at an time-stamp (high-level data). The trend is towards physics-
early phase of the development process. based simulation, which generates low-level sensor data (radar
To date, sensor evaluation consists of various measurement signals, pixel pictures, lidar point clouds). This allows to incor-
procedures to analyze specific use cases and to validate the porate environmental influences on the signal propagation
stated sensor specifications and measurement protocols under (weather/lighting). Those tools are still in development stage
specified conditions. This allows comparing different sensors of particularly with regard to the physical parameters like object
the same technology. However, a practicable method is missing shape and material properties for various object surfaces.
to constitute the performance of the complete setup. The Sensor Models: Virtual sensors can be divided into four
execution of test drives, as well as the subsequent data analysis, groups regarding their abstraction level (see Fig. 2). As first
are time-consuming and cost-intensive. Hence, this method is approach, we decided for probabilistic models that consider
not feasible to compare many setup concepts in an early deve- parameters like FOV, range and the statistical error behavior
lopment phase. Thus, we propose the use of a simulation tool (detection probability, accuracy, false positives/negatives) [9].
that assists developers in sensor configuration and setup With those modification options, these generic models can be
evaluation concerning the environmental perception. Using our adjusted to the properties of a specific sensor with good fidelity
approach, the process of defining the particular sensor setup [10]. Phenomenological models are extensions of the probabi-
could follow a three-step-procedure: listic ones, which incorporate “situational effects” e.g. more
1 Identification of sensor specifications measurement errors while driving into a tunnel. Usually, this
Determining the performance of particular sensors based on information is only available after test drives. Thus, it cannot be
datasheets, measurement protocols, corner-case tests [4]. implemented for research purposes. Ideal models are purely
geometry-based (FOV, range) and highly simplified: they are
2 Simulation of the sensor setup not covering measurement errors. Physics-based models provide
Using the information of step (1) to feed a simulation tool the highest fidelity [11]. Based on raytracing, they simulate sig-
and obtain results on the overall setup performance. nal propagation considering influences of weather and lighting
3 Test drives effects as well as signal reflections on objects [12]. They cannot
Optimizing the setup through performance tests under real be used for our use case so far, since they are not available at the
environmental conditions for the chosen setup in step (2). early time of setup configuration. Besides, their low-level data
output requires a processing module to extract object detections
Step (1) and (3) describe the state-of-the-art evaluation method. out of the simulated raw data, which is not provided by any tools.
The intermediate step (2) provides the potential to clarify cor-
relations within this complex interrelationship of sensors. An Data Fusion: There are several approaches regarding data
appropriate simulation tool can thus be an important instrument fusion solutions for multi-sensor systems [13]. Usually,
to assist the process of defining the optimal setup in a time- and extended Kalman filters are applied in this context [14, 15].
cost-efficient way at this early development stage [5]. Simulations are used to adapt the fusion algorithms to the
specific sensor. In contrast, this part should remain fixed for the
III. APPROACH TOWARDS A SIMULATION FRAMEWORK purpose of a sensor evaluation to ensure a consistent base.
As Table  illustrates, many simulation tools exist for the Unfortunately, available tools do not include a ready to use
purpose of automotive systems engineering [6]. Most of them data fusion module and an evaluation suite to assess the
address the technical consolidation and test bench applications environmental perception of sensor setups. However, the sensor
in Hardware-in-the-Loop (HiL) and Software-in-the-Loop (SiL) setup evaluation requires an intermediate stage of framework
systems to validate ADAS functions or to develop data fusion which focus on the sensor part with many modification options
through virtual testing [7, 8]. For our use case, the focus has to and an adequate fidelity. Thus, we established a new simulation
be shifted towards the sensor configuration part. workflow that allows the configuration of arbitrary setups and
enables virtual test drives within specific scenarios to evaluate
TABLE I. SIMULATION TOOLS FOR AUTOMOTIVE SYSTEMS ENGINEERING the environmental perception in early development phases
Simulation tool Company quantitatively with key performance indicators (KPIs).
CarMaker IPG Automotive
PreScan + DRS360 Siemens
Complexity
DRIVE NVIDIA Real-time capabilities Physical
Virtual Test Drive VTD Vires
CANape / vADASdeveloper Vector Phenomenological
DYNA4 Driver Assistance TESIS Probabilistic Real sensor
ASM Traffic dSPACE
Pro-SiVIC CIVITEC Ideal
Realism
Automated Driving System Toolbox MathWorks
Other simulation tools rFpro, ANSYS, Addfor, … Figure 2: Abstraction levels of sensor models. [cf. Baselabs GmbH [17]]
IV. WORKFLOW CONCEPT AND METHODOLOGY V. EVALUATION AND METRICS
Our framework covers the entire process chain from data The evaluation suite calculates the KPIs “detection time”,
acquisition (ground truth, sensor data), data processing (data “detection rate” and “false alarm rate”. In addition, three metrics
fusion), data evaluation (metrics) to data analysis (KPIs) to are implemented and presented below. A set 𝐺{𝑔𝑖 } of ground
derive data insights and knowledge (setup performance). The truth objects 𝑔𝑖 for 𝑖 = 1, … , 𝑚 as well as a set 𝐸{𝑒𝑗 } of estimated
workflow is visualized in Fig. 3 and is divided into two parts. objects 𝑒𝑗 for 𝑗 = 1, … , 𝑛 are assumed. The elements of the sets
In the first part, ground truth data are generated with an are multidimensional including object position, heading,
environment simulation based on object lists. In the presented velocity and acceleration. The Euclidean distance 𝑑(𝑔𝑖 , 𝑒𝑗 ) cal-
context, IPG CarMaker [16] is used. In this simulation part, the culates the distance between actual and estimated objects.
virtual environment, traffic and the driving maneuver of the ego The Hausdorff metric 𝑑𝐻 (𝐺, 𝐸) in (1) is insensitive to different
vehicle are designed to create a scenario. The simulated ground cardinalities and weights outliers heavily [18].
truth data of the scenario are transferred to the second workflow
part, which consists of the sensor configuration and evaluation  𝑑𝐻 (𝐺, 𝐸) = 𝑚𝑎𝑥 {max min 𝑑(𝑔𝑖 , 𝑒𝑗 ) , max min 𝑑(𝑔𝑖 , 𝑒𝑗 )}
𝑔𝑖 ∈𝐺 𝑒𝑗 ∈𝐸 𝑒𝑗 ∈𝐸 𝑔𝑖 ∈𝐺
suite. We used the software Baselabs Create [17], that is a
(𝑐)
development tool for environmental perception applications for The Optimal SubPattern Assignment (OSPA) 𝑑̅𝑂𝑆𝑃𝐴 (𝐺, 𝐸) in (2)
automated vehicles. The ground truth data can either be pro- punishes outliers less than the Hausdorff metric, depending on
cessed directly or it can be recorded in a scenario collection for the cutoff parameter 𝑐 . Instead, it penalizes the scenario when a
the use in replay mode. After the scenarios have been recorded, ground truth object has several estimated objects. [18]
the first workflow part is not required anymore for working with
the toolchain. Thus, the second workflow part is an independent 𝑑̅𝑂𝑆𝑃𝐴
1 2
𝑖=1 𝑑 (𝑔𝑖 , 𝑒𝜋(𝑖) ) + 𝑐²(𝑛 − 𝑚)))
(𝑐)
(𝐺, 𝐸) = √( (min ∑𝑚 (𝑐)
framework, which leads to a flexible working tool. 𝑛 𝜋∈Π𝑛

The sensor configuration and evaluation suite contains three To consider the position confidence of the estimated objects, the
modules: First, the ground truth data enter the data fusion Normalized Estimation Error Squared (NEES) in (4), also
designer with sensor models and the setup configuration. In this known as Mahalanobis distance [15], includes the position error
part, the ground truth data are modified according to the settings (𝑔 – 𝑒) and the covariance matrix 𝑃𝑔𝑒 .
of probabilistic sensor models. For each virtual sensor of the
𝑇
setup, a measurement model, a detection model as well as a track  𝑁𝐸𝐸𝑆(𝑔𝑖 , 𝑒𝑗 ) = √(𝑔𝑖 − 𝑒𝑗 ) 𝑃𝑔𝑒 −1 (𝑔𝑖 − 𝑒𝑗 )
management strategy is selected via a graphical user interface
(GUI). For this, the individual sensor parameters are set VI. CONCLUSION
according to the sensor properties maintained from the three-
step-procedure (1) Identification of sensor specifications. In We presented a new simulation-based concept that is suit-
addition, it is possible to modify the available sensor models or able for the evaluation of perception sensor setups for automated
to add new models by the programmatic usage of a Software vehicles regarding particular sensor mounting positions, diverse
Development Kit (SDK). The simulated sensor data are fused in setup configurations and different sensor technologies. Our
the data fusion module, which relies on extended Kalman filters established framework is an important instrument that assists
and feeds the subsequent environment model. Afterwards, the automotive system engineers during the early stages of develop-
simulated and fused sensor data are compared with the initial ment and supports them with a performance overview for
ground truth data in the evaluation metrics module that was built different setups in relevant scenarios. The evaluation suite
especially for this workflow. In this part, custom evaluation allows configuring the sensor setup via a simple GUI while it is
metrics can be added. Based on those metrics, a KPI report still possible to access the software code to modify the proba-
containing the results for all scenarios is created to assess the bilistic sensor models and to implement evaluation metrics. By
perception performance quantitatively in a compact overview. studying the correlations within the sensor data processing
After that, KPI reports of different setups are compared and chain, requirement profiles in terms of a roadmap for future
further aspects like cost-benefit considerations are analyzed to sensor technologies can be derived. In future work, the method
decide whether a particular sensor setup is worth being tested in could be adapted to a physics-based simulation like PreScan [19]
the next step of the three-step-procedure: (3) Test drives. to extend the sensor evaluation by considering physical effects.

Sensor Specifications Test Drives


Environment Simulation Sensor Simulation and Analysis Part
Virtual Environment Ground Truth (Object Lists) Live Sensor Configuration and Evaluation Suite
+ Setup Analysis
Record
Replay

Scenario Generation
Data Fusion
Environment Model
+

Benefit

Evaluation
Cost

Sensor Models
s

+ Metrics
Sensor Setup
</>
Collection

Configuration
Scenario

[GUI] + -
* % KPI
[SDK] Report

Data Acquisition Data Processing Data Evaluation Data Analysis


Figure 3. Schematic visualization of our simulation framework and our proposed workflow for the configuration and evaluation of perception sensor setups.

Fig. 1.
REFERENCES
[1] A. Ziebinski, R. Cupek, D. Grzechca, and L. Chruszczyk, “Review of [10] R. Schubert, N. Mattern, and R. Bours, “Simulation of sensor models
advanced driver assistance systems (ADAS),” in AIP Conference for the evaluation of advanced driver assistance systems,”
Proceedings1906, p. 120002. ATZelektronik worldwide, vol. 9, no. 3, pp. 26–29, 2014.
[2] K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, and H. [11] C. van Driesten and T. Schaller, “Overall approach to standardize AD
Winner, “Three decades of driver assistance systems: Review and future sensor interfaces: Simulation and real vehicle,” in Proceedings,
perspectives,” IEEE Intelligent Transportation Systems Magazine, vol. Fahrerassistenzsysteme 2018, T. Bertram, Ed., Wiesbaden: Springer
6, no. 4, pp. 6–22, Winter 2014. Fachmedien Wiesbaden, 2019, pp. 47–55.
[3] J. van Brummelen, M. O’Brien, D. Gruyer, and H. Najjaran, [12] C. Goodin, R. Kala, A. Carrrillo, and L. Y. Liu, “Sensor modeling for
“Autonomous vehicle perception: The technology of today and the virtual autonomous navigation environment,” in 2009 IEEE Sensors,
tomorrow,” Transportation Research Part C: Emerging Technologies, Christchurch, New Zealand, Oct. 2009 - Oct. 2009, pp. 1588–1592.
vol. 89, pp. 384–406, 2018. [13] S. Matzka and R. Altendorfer, “A comparison of track-to-track fusion
[4] S. Hasirlioglu, A. Kamann, I. Doric, and T. Brandmeier, “Test algorithms for automotive sensor fusion,” in Lecture Notes in Electrical
methodology for rain influence on automotive surround sensors,” IEEE Engineering, Multisensor Fusion and Integration for Intelligent
19th International Conference on Intelligent Transportation Systems Systems, H. Hahn, H. Ko, and S. Lee, Eds., Berlin, Heidelberg: Springer
(ITSC), pp. 2242–2247, 2016. Berlin Heidelberg, 2009, pp. 69–81.
[5] T. Dörfler, “Testing ADAS sensors in early development phases,” [14] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics: The MIT
ATZelektronik worldwide, vol. 2018, no. 13, p. 54. Press, 2005.
[6] P. Viswanath, M. Mody, S. Nagori, J. Jones, and H. Garud, “Virtual [15] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with
simulation platforms for automated driving: Key care-about and usage applications to tracking and navigation: Theory algorithms and
model,” Electronic Imaging, vol. 2018, no. 17, 164-1-164-6, 2018. software, 1st ed.: Wiley-Interscience, 2001.
[7] D. Gruyer, S. Choi, C. Boussard, and B. d'Andrea-Novel, “From virtual [16] IPG Automotive GmbH, CarMaker. [Online] Available: https://ipg-
to reality, how to prototype, test and evaluate new ADAS: Application automotive.com/de/produkte-services/simulation-software/carmaker/.
to automatic car parking,” in 2014 IEEE Intelligent Vehicles Symposium Accessed on: Jun. 24 2019.
Proceedings, MI, USA, 2014, pp. 261–267. [17] Baselabs GmbH, Data fusion for automated driving: Data fusion
[8] W. Huang, K. Wang, Y. Lv, and F. Zhu, “Autonomous vehicles testing results. [Online] Available: https://www.baselabs.de/. Accessed on:
methods review,” in 2016 IEEE 19th International Conference on Mar. 21 2019.
Intelligent Transportation Systems (ITSC): Windsor Oceanico Hotel, [18] D. Schuhmacher, B.-T. Vo, and B.-N. Vo, “A consistent metric for
Rio de Janeiro, Brazil, November 1-4, 2016, Rio de Janeiro, Brazil, performance evaluation of multi-object filters,” IEEE Trans. Signal
2016, pp. 163–168. Process., vol. 56, no. 8, pp. 3447–3457, 2008.
[9] T. Hanke, N. Hirsenkorn, B. Dehlink, A. Rauch, R. Rasshofer, and E. [19] Leneman, F., D. Verburg, and S. Buijssen, Ed., PreScan, testing and
Biebl, “Classification of sensor errors for the statistical simulation of developing active safety applications through simulation, 2008.
environmental perception in automated driving systems,” in 2016 IEEE
19th International Conference on Intelligent Transportation Systems
(ITSC): Windsor Oceanico Hotel, Rio de Janeiro, Brazil, November 1-
4, 2016, Rio de Janeiro, Brazil, 2016, pp. 643–648.
Risk-Aware Reasoning for Autonomous Vehicles
Majid Khonji, Jorge Dias, and Lakmal Seneviratne

Abstract— A significant barrier to deploying autonomous dividual events becomes of critical importance as the pub-
vehicles (AVs) on a massive scale is safety assurance. Several lic would require transparency and explainable AI. Recent
technical challenges arise due to the uncertain environment in AV fatal crashes raise further debates among scholars and
which AVs operate such as road and weather conditions, errors
in perception and sensory data, and also model inaccuracy. In pioneers in the industry concerning how an autonomous
this paper, we propose a system architecture for risk-aware vehicle should act when human safety is at risk. On a more
AVs capable of reasoning about uncertainty and deliberately philosophical level, a study [2] sheds light on the major
bounding the risk of collision below a given threshold. We challenges of understanding societal expectations about the
discuss key challenges in the area, highlight recent research principles that should guide the decision making in life-
developments, and propose future research directions in three
subsystems. First, a perception subsystem that detects objects critical situations. As an illustrative example, suppose a self-
within a scene while quantifying the uncertainty that arises driving vehicle, experiencing a partial system failure, forced
from different sensing and communication modalities. Second, into an ultimatum choice between running over pedestrians
an intention recognition subsystem that predicts the driving- or sacrificing itself and its passenger to save them. What
style and the intention of agent vehicles (and pedestrians). should be the reasoning behind such a situation, and more
Third, a planning subsystem that takes into account the uncer-
tainty, from perception and intention recognition subsystems, fundamentally, what should be the moral choice? Despite
and propagates all the way to control policies that explicitly the profound philosophical dilemma and the impact on the
bound the risk of collision. We believe that such a white-box public perception of AI as a whole and the regulatory aspects
approach is crucial for future adoption of AVs on a large scale. for AVs in particular, the current state-of-the-art of the
technological stack of AVs does not explicitly capture and
I. I NTRODUCTION
propagate uncertainty sufficiently well throughout decision
Over the past hundred years, innovation within the auto- processes in order to accurately assess these edge scenarios.
motive industry has created more efficient, affordable, and In this work, we discuss algorithmic pipeline and a techni-
safer vehicles, but progress has been incremental so far. cal stack for AVs to capture and propagate uncertainty from
The industry now is on the verge of a substantial change the environment throughout perception, prediction, planning,
due to the advancements in Artificial Intelligence (AI) and and control. An AV has to be able to plan and optimize trajec-
Autonomous Vehicle (AV) sensing technologies. These ad- tories from its current location to a goal while avoiding static
vancements offer the possibility of significant benefits to and dynamic (moving) obstacles, while meeting deadlines
society, saving lives, and reducing congestion and pollution. and efficiency constraints. The risk of collision should be
Despite the progress, a significant barrier to large scale bounded by a given safety threshold that meets governmental
deployment is safety assurance. Most technical challenges regulations, while meeting deadlines should meet a quality
are due to the uncertain environment in which AVs operate of service threshold.
such as road and weather conditions, errors in perception To expand AV perception range, we consider the Vehicular
and sensory input data, and uncertainty in the behavior of Ad-Hoc Network (VANET) communication model. Vehicle-
the pedestrians and agent vehicles. A robust AV control to-Vehicle (V2V), Vehicle-to-Infrastructure (V2I), and more
algorithm should account for different sources of uncertainty recently Vehicle-to-Everything (V2X), are technologies that
and generate control policies that are quantifiably safe. In enable vehicles to exchange safety and mobility information
addition, algorithms that respect precise safety measures can between each other and with the surrounding agents, in-
assist policymakers addressing legislative issues related to cluding pedestrians with smart phones and smart wearables.
AVs, such as insurance policies and ultimately convince the Vehicles can collect information en route, such as road
public for a wide deployment of AVs. conditions and position estimates of static and dynamic
One of the most prevalent measures for AV safety is objects, and can use this information to continuously predict
the number of crashes per million miles [1]. Although actions performed by other vehicles and infrastructure. V2V
such a measure provides some estimate on overall safety messages would have a range of approximately 300 meters,
performance in a particular environment, it fails to capture which exceeds the capabilities of systems with cameras,
unique differences and the richness of individual scenarios. ultrasonic sensors, and LIDAR, allowing greater capability
As AVs become more prevalent, the reasoning behind in- and time to warn vehicles.
In this work, we propose a system architecture (Sec. II)
Majid Khonji, Jorge Dias, and Lakmal Seneviratne are with and discuss key challenges in quantifying uncertainty at dif-
KU Center for Autonomous Robotic Systems, Khalifa University,
Abu Dhabi, UAE (email {majid.khonji, jorge.dias, ferent levels of abstractions: scene representation (Sec. III),
lakmal.seneviratne}@ku.ac.ae). intention recognition (Sec. IV), risk-bounded planning
(Sec. V), and control (Sec. VI). We highlight current state- III. P ROBABILISTIC S CENE R EPRESENTATION
of-the-art, and propose research directions at each level. Scene understanding is research topic with strong impact
on technologies for autonomous vehicles. Most of the ef-
II. S YSTEM A RCHITECTURE forts have been concentrated on understanding the scenes
surrounding the ego-vehicle (autonomous vehicle itself). This
is composed by sensor data processing pipeline that includes
Demonstrating PFT Library
Trajectories Motion Model Generator different stages such as low-level vision tasks, detection,
tracking and segmentation of the surrounding traffic envi-
High Level Planner ronment –e.g., pedestrian, cyclists and vehicles. However,
Renegotiate Short Term
Intent Recognizer for an autonomous vehicle, these low-level vision tasks
objectives Objectives Agent are insufficient to comprehensive scene understanding. It is
vehicle Observed Vehicle States
Intentions necessary to include reasoning about the past and the present
Short Horizon Planner: Probabilistic Scene Representation
Risk-Bounded POMDP of the scene participants. This paper intends to guide future
Perception/V2X
research on interpretation of traffic scene in autonomous
Control Actions for Ego-vehicle Raw Sensing & V2X Data driving from a probabilistic event reasoning perspective.
Ego-vehicle
al
A. Probabilistic Context Layout for Driving
Go
Scene representation includes context representations that
include spatially geometrical relationships [5] among differ-
ent traffic elements with certain semantic labels. It is different
from the semantic segmentation frameworks [6], [7], because
Fig. 1: Risk-aware AV stack. the context representation does not only contain the static
In the following, we present the architecture of a risk- components of traffic scene (typical technique for this aspect
aware AV stack with six technical objectives in mind: is simultaneous localization and mapping (SLAM)), such as
road, the type of traffic lanes, traffic direction, and participant
• A probabilistic perception and object representation orientation, but also consists of several kinds of dynamic
system that takes into consideration uncertainty that elements, e.g., motion correlation of participants. The study
arises from hardware modalities and sensor fusion. The [8],[9] has given a detailed review on semantic segmentation,
system will capture uncertainty in object classification, taking the traffic geometry inferring into consideration.
bounding geometries, and temporal inconsistencies un- A key aspect of context representation is to extract salient
der diverse conditions. features from a large set of sensor data. For that purpose, it is
• Leverage the communication network to gain knowl- necessary to establish a saliency mechanism, that is a critical
edge of the surrounding agents (vehicles and pedes- region extraction and information simplification technique
trians) that are beyond line-of-sight, and then improve that is widely used for attractive region selection in images.
upon scene representation. Over the past few decades, saliency has been generally
• An intention recognition system that takes into account formulated as bottom-up and top-down modes. Bottom-up
all dynamic objects (vehicles and pedestrians), from modes [10], [11] are fast, data-driven, pre-attentive and task-
perception and V2X communication, and estimates a independent. Top-down approaches [12], [13], [14], [15]
distribution over potential future trajectories. often entail supervised learning with pre-collected task labels
• Generalize upon recently developed risk-aware opti- by a large set of training examples and are task-oriented and
mization algorithms [3], [4], in order to ensure that vary in different environments.
movements are safe. A recent work [16] presents a fast algorithm that obtains
• On a higher level, propose goal-directed autonomous a probabilistic occupancy model for dynamic obstacles in
planners that strive to meet the passenger goals and the scene with few sparse LIDAR measurements. Typically
preferences, and help the passengers to think through the occupancy states exhibit highly nonlinear patterns that
adjustments to their goals, when they can’t be safely cannot be captured with a simple linear classification model.
met. Therefore, deep learning models and kernel-based models
• To ensure that decisions are made in a timely manner, can be considered as potential candidates. However, these
design polynomial-time approximation algorithms that approaches require either a massive amount of data or a
offer formal bounds on sub-optimality, and which pro- high number of hyper-parameters to tune. A promising future
duce near-optimal results. direction is to extend this approach to account for different
In addition, by specifying the probability that a plan is object classes (rather than occupancy map) and other sensors
executed successfully, the system operator or policymaker as well such as cameras.
can set the desired level of conservatism in the plan in B. Beyond Line-of-sight
a meaningful manner and can trade conservatism against Any sensing modality has blind spots. For objects that
performance. Fig. 1 shows the interaction between key lie beyond-line-of-sight, one can consider a communication
components of the system as we illustrate throughout the network to improve upon the scene representation. This can
paper.
be critical in certain edge scenarios. For example, in Fig. 2, maneuvers comprises of a set of collected trajectories. Due
the ego-vehicle (red) has two options: either maintain speed to the uncertainties in the motions of human-driven vehicles,
or overtake the vehicle ahead. Suppose that another agent we learn a compact motion representation called Probabilistic
vehicle is approaching from a distance that is not detected Flow Tube (PFT) [25] from demonstrating trajectories to
by onboard sensors of the ego-vehicle. In this scenario, both capture human-like driver styles and uncertainties for each
the speed and location of the distant vehicle might not be maneuver. A library of pre-learned PFTs can be used to esti-
accurately estimated, therefore maneuver A2 leading to a mate the current maneuver as well as predict the probabilistic
collision. motion of each agent vehicle using a Bayesian approach.
V. R ISK - BOUNDED P LANNING
Deterministic optimization approaches have been well de-
veloped and widely used in several disciplines and industries,
in order to optimize processes both off-line and on-line.
In this work, we characterize uncertainty in a probabilistic
manner and find the optimal sequence of ego-vehicle trajec-
Fig. 2: V2V communication.
tory control, subject to the constraint that the probability of
There has been substantial progress for the standardization failure must be below a certain threshold. Such constraint
of vehicle-to-everything/V2X (V2V/V2I/V2P) communica- is known as a chance constraint. In many applications, the
tion protocols. The major V2X standards are known as DSRC probabilistic approach to uncertainty modeling has a number
(Dedicated Short-Range Communications) [17] as well as 5G of advantages over a deterministic approach. For instance,
[18]. The introduction of 5G’s millimeter-wave transmissions disturbances such as vehicle wheel slip can be represented
brings a new paradigm to wireless communications. Depend- using a stochastic model. When using a Kalman Filter for
ing on the application, 5G positioning can also enhance enhancing localization, the location estimate is provided as
tracking techniques, which leverage short-term historical a probabilistic distribution. In addition, by specifying the
data (local signatures and key features). Uncertainty can be probability that a plan is executed successfully, the system
captured by probabilistic models (e.g., Gaussian) through operator or policymaker can set the desired level of conser-
sampling temporal inconsistencies in historical data streams vatism in the plan in a meaningful manner and can trade
such as localization data, and parameter tuning. conservatism against performance. Therefore, robustness is
IV. I NTENTION R ECOGNITION achieved by designing solutions that guarantee feasibility
as long as disturbances do not exceed these bounds. Fur-
This subsystem involves prediction and machine learning thermore, if the passenger goals cannot be safely achieved,
tasks to reliably estimate the future trajectories of uncontrol- then the chance constraints can be analyzed to pinpoint the
lable agents in the scene, including pedestrians and other sources of risk, and the user goals can be adjusted, based on
agent vehicles. Many existing trajectory prediction algo- their preferences, in order to restore safety.
rithms [19], [20] obtain deterministic results quite efficiently. Reasoning under uncertainty has several challenges. The
However, these approaches fail to capture the uncertain optimization problem of trajectory optimization is non-
nature of human actions. Probabilistic predictions are bene- convex, due to discrete choices and the presence of obstacles
ficial in many safety-critical tasks such as collision checking in the feasible space. One approach to tackle the challenges
and risk-aware motion planning. They can express both is by introducing multiple layers of abstractions. Instead of
the intrinsically uncertain prediction task at hand (human solving high-level problems (e.g., route planning) and low-
nature) and reasoning about the limitations of the prediction level problems (e.g., steering wheel angle, acceleration, and
method (knowing when an estimate could be wrong [21]). To brake commands) in a single shot, one can decouple them
incorporate uncertainties into prediction results, data-driven into sub-problems. We achieve such hierarchy through a
approaches can learn common characteristics from datasets high-level planner, short-horizon planner, and precomputed
of demonstrated trajectories [22], [23]. These methods often and learned maneuver trajectories as we illustrate below.
express uni-modal predictions, which may not perform well
in sophisticated urban scenarios where the driver can choose A. High Level Planner
among multiple actions. A recent work [24] presents a hybrid High-level planning involves route planning, applying traf-
approach using a variational neural network that predicts fic rules, and consequently setting short-term objectives (aka
future driver trajectory distributions for the ego-vehicle based set points), which will be fed into Short Horizon Planner
on multiple sensors in urban scenarios. The work can be (as shown in Fig. 1). The planner adjusts those short-term
extended in future to predict trajectories for agent-vehicles objectives when no safe solution exists. To be able to model
using V2V data streams, if available. the feasibility of an obtained plan, we leverage Temporal
We propose a simple intent recognition that is divided into Plan Networks (TPN) [26]. A TPN is a graph where the
two steps. First we continuously record high-level maneuvers nodes represent events, and the edges represent activities. In
of surrounding vehicles (both off-line and online). Examples temporal planning, the ego-vehicle is presented with a series
of such maneuvers are merge left, merge right, accelerate all of events and must decide precisely when to schedule them.
at different velocities and variations and so on. Each of these STNs with Uncertainty (STNUs) is an extension allowing
to reason over stochastic, or uncontrollable, actions and policies. It also uses a risk heuristic to prune the search
their corresponding durations [27]. Such formalism allows to space, removing high-risk branches that violate the chance
check the feasibility of a high-level plan and prompt the user constraints. Hence, at each level, the action that maximizes
to adjust his or her intermediate goals and time constraints to expected reward and meets chance constrained is selected
output smooth intermediate plans, fed into the short horizon for the vehicle. However, one of the drawbacks of RAO* is
planner. that it does not always return optimal solutions and also does
B. Short Horizon Planner not provide any bound on the sub-optimality gap. In a recent
work [3], we provide an algorithm that provides guarantee on
Planning under uncertainty is a fundamental area in ar- optimality (namely, a fully polynomial time approximation
tificial intelligence. For the application of AV, it is crucial scheme (FPTAS)) while preserving safety constraints, all
to plan for potential contingencies instead of planning a within polynomial running time.
single trajectory into the future. This often occurs in dynamic Recently [31] applied RAO* for the application of self-
environments where the vehicle has to react quickly (in driving vehicles under restricted settings (e.g., known dis-
milliseconds) to any potential event. Partially observable tribution of actions taken by agent-vehicles). CC-POMDP,
Markov decision processes (POMDP)[28], [29] provide a while otherwise expressive, allow only for sequential, non-
model for optimal planning under actuator and sensor uncer- durative actions. This poses restrictions in modeling real-
tainty, where the goal is to find policies (contingency plans) world planning problems. In our recent ongoing work, we
that maximize (or minimize) some measure of expected extend the framework of CC-POMDP to account for durative
utility (or cost). actions, and leverage heuristic forward search to prune the
In many real-world applications, a single measure of search space to improve upon the running time.
performance is not sufficient to capture all requirements (e.g., VI. M OTION M ODEL G ENERATOR
an AV tasked to minimize commute time while keeping
Based on each driving scenario, we compute a library
the distance from obstacle below a given threshold). This
of maneuvers. Each maneuver is associated with nominal
extension is often called constrained POMDP (C-POMDP)
control signals by solving a model predictive control (MPC)
[30]. When constraints involve stochasticity (e.g., distance
optimization problem [31]. The set of possible maneuver
following a probabilistic model), the problem is modeled
actions are constrained by traffic rules and vehicle dynamics
as chance-constrained POMDP (CC-POMDP) [4], where we
and are informed by the expected evolution of the situation.
have a bound on the probability of violating constraints. To
Computing the actions can be accomplished through offline
calculate the risk of each decision, one can leverage the
and online computation, and also through publicly available
probabilistic flow-tube (PFTs) concept to model a set of
datasets (e.g., Berkeley DeepDrive BDD100k).
possible trajectories [25]. The current state-of-the-art solver The size of the search space of CC-POMDP, described
of CC-POMDP is called RAO* [4]. RAO* generates a above, is sensitive to the number of maneuver actions.
conditional plan based on action and risk models and likely To tackle this issue, we consider three different levels for
possible scenarios for agent vehicles. abstractions. i) Micro Actions are primitive actions like
Accelerate, Decelerate, Maintain. ii) Maneuver Actions are
sequences of micro actions like Merge left, Merge right, iii)
Macro Actions are sequences of maneuver actions such as
pass the front vehicle, go straight until next intersection [32].
To calculate the risk of collision, we leverage PFT, which
represents a sequence of probabilistic reachable sets. PFTs
show probabilistic future predictions for states of the vehicles
under a selected action. In this context, the intersection
between two, temporally aligned, PFT trajectories represents
the risk of collision. To construct PFTs, we use vehicle dy-
namics and also probabilistic information about uncertainties,
Fig. 3: CC-POMDP Hypergraph: Nodes are the probability distri- as well as through learning from datasets. By propagating
butions of states (belief states) of ego vehicle. At each node, there the probability distributions of uncertainties through the
are n possible actions that can be taken by the ego vehicle. At continuous dynamics of the vehicle, we construct probability
each level, belief state is updated with respect to chosen action and distributions for the locations of the vehicle over a finite
observations of the environment.
planning horizon.
RAO* explores from a probability distribution of vehicle VII. C ONCLUSION
states (belief state), by incrementally constructing a hyper- In this work, we proposed a system architecture for risk-
graph, called the explicit hyper-graph shown in Fig. 3. At aware AVs that can deliberately bound the risk of collision
each node of the hyper-graph, the planner considers possible below a given threshold, defined by the policymaker. We
actions provided by Motion Model Generator (see Fig. 1) presented the related work, discussed key challenges, and
and receives several possible observations. At each level, it proposed research directions in three key subsystems: per-
utilizes a value heuristic to guide the search towards optimal ception, intention recognition, and risk-aware planning. We
believe that our white-box approach is crucial for a better [20] H. Woo, Y. Ji, H. Kono, Y. Tamura, Y. Kuroda, T. Sugano, Y. Ya-
understanding of AV decision making and ultimately for mamoto, A. Yamashita, and H. Asama, “Lane-change detection based
on vehicle-trajectory prediction,” IEEE Robotics and Automation Let-
future adoption of AVs on a large scale. ters, vol. 2, no. 2, pp. 1109–1116, 2017.
[21] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian
deep learning for computer vision?” in Advances in neural information
R EFERENCES processing systems, 2017, pp. 5574–5584.
[22] D. Vasquez, T. Fraichard, and C. Laugier, “Growing hidden markov
[1] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of models: An incremental tool for learning and predicting human and
driving would it take to demonstrate autonomous vehicle reliability?” vehicle motion,” The International Journal of Robotics Research,
Transportation Research Part A: Policy and Practice, vol. 94, pp. 182– vol. 28, no. 11-12, pp. 1486–1506, 2009.
193, 2016. [23] J. Wiest, M. Höffken, U. Kreßel, and K. Dietmayer, “Probabilistic
[2] J.-F. Bonnefon, A. Shariff, and I. Rahwan, “The social dilemma of trajectory prediction with gaussian mixture models,” in 2012 IEEE
autonomous vehicles,” Science, vol. 352, no. 6293, pp. 1573–1576, Intelligent Vehicles Symposium. IEEE, 2012, pp. 141–146.
2016. [24] X. Huang, S. McGill, B. C. Williams, L. Fletcher, and G. Rosman,
“Uncertainty-aware driver trajectory prediction at urban intersections,”
[3] M. Khonji, A. Jasour, and B. Williams, “Approximability of
arXiv preprint arXiv:1901.05105, 2019.
constant-horizon constrained pomdp,” in Proceedings of the Twenty-
[25] S. Dong and B. Williams, “Learning and recognition of hybrid
Eighth International Joint Conference on Artificial Intelligence,
manipulation motions in variable environments using probabilistic flow
IJCAI-19. International Joint Conferences on Artificial Intelligence
tubes,” International Journal of Social Robotics, vol. 4, no. 4, pp. 357–
Organization, 7 2019, pp. 5583–5590. [Online]. Available: https:
368, 2012.
//doi.org/10.24963/ijcai.2019/775
[26] A. G. Hofmann and B. C. Williams, “Temporally and spatially flexible
[4] P. Santana, S. Thiébaux, and B. Williams, “Rao*: an algorithm for plan execution for dynamic hybrid systems,” Artificial Intelligence,
chance constrained pomdps,” in Proc. AAAI Conference on Artificial vol. 247, pp. 266–294, 2017.
Intelligence, 2016. [27] N. Bhargava and B. C. Williams, “Faster dynamic controllability
[5] C. Landsiedel and D. Wollherr, “Road geometry estimation for urban checking in temporal networks with integer bounds,” in Proceedings
semantic maps using open data,” Advanced Robotics, vol. 31, no. 5, of the Twenty-Eighth International Joint Conference on Artificial
pp. 282–290, 2017. Intelligence, IJCAI-19. International Joint Conferences on Artificial
[6] E. Levinkov and M. Fritz, “Sequential bayesian model update under Intelligence Organization, 7 2019, pp. 5509–5515. [Online]. Available:
structured scene prior for semantic road scenes labeling,” in Proceed- https://doi.org/10.24963/ijcai.2019/765
ings of the IEEE International Conference on Computer Vision, 2013, [28] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and
pp. 1321–1328. acting in partially observable stochastic domains,” Artificial intelli-
[7] Z. Zhang, S. Fidler, and R. Urtasun, “Instance-level segmentation gence, vol. 101, no. 1-2, pp. 99–134, 1998.
for autonomous driving with deep densely connected mrfs,” in Pro- [29] E. J. Sondik, “The optimal control of partially observable markov
ceedings of the IEEE Conference on Computer Vision and Pattern decision processes.” PhD the sis, Stanford University, 1971.
Recognition, 2016, pp. 669–677. [30] P. Poupart, A. Malhotra, P. Pei, K.-E. Kim, B. Goh, and M. Bowling,
[8] J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision for “Approximate linear programming for constrained partially observable
autonomous vehicles: Problems, datasets and state-of-the-art,” arXiv markov decision processes,” in Twenty-Ninth AAAI Conference on
preprint arXiv:1704.05519, 2017. Artificial Intelligence, 2015.
[9] B. Zhao, J. Feng, X. Wu, and S. Yan, “A survey on deep learning- [31] X. Huang, A. Jasour, M. Deyo, A. Hofmann, and B. C. Williams, “Hy-
based fine-grained object classification and semantic segmentation,” brid risk-aware conditional planning with applications in autonomous
International Journal of Automation and Computing, vol. 14, no. 2, vehicles,” in 2018 IEEE Conference on Decision and Control (CDC).
pp. 119–135, 2017. IEEE, 2018, pp. 3608–3614.
[10] J. Zhang and S. Sclaroff, “Exploiting surroundedness for saliency [32] S. Omidshafiei, A.-A. Agha-Mohammadi, C. Amato, and J. P. How,
detection: a boolean map approach,” IEEE transactions on pattern “Decentralized control of partially observable markov decision pro-
analysis and machine intelligence, vol. 38, no. 5, pp. 889–902, 2015. cesses using belief space macro-actions,” in 2015 IEEE International
[11] Q. Wang, Y. Yuan, P. Yan, and X. Li, “Saliency detection by multiple- Conference on Robotics and Automation (ICRA). IEEE, 2015, pp.
instance learning,” IEEE transactions on cybernetics, vol. 43, no. 2, 5962–5969.
pp. 660–672, 2013.
[12] S. He, R. W. Lau, and Q. Yang, “Exemplar-driven top-down saliency
detection via deep association,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 5723–5732.
[13] J. Yang and M.-H. Yang, “Top-down visual saliency via joint crf
and dictionary learning,” IEEE transactions on pattern analysis and
machine intelligence, vol. 39, no. 3, pp. 576–588, 2016.
[14] J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E.
O’Connor, “Shallow and deep convolutional networks for saliency
prediction,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 598–606.
[15] Y. Xia, D. Zhang, A. Pozdnoukhov, K. Nakayama, K. Zipser, and
D. Whitney, “Training a network to attend like human drivers saves
it from common but misleading loss functions,” arXiv preprint
arXiv:1711.06406, 2017.
[16] R. Senanayake, A. Tompkins, and F. Ramos, “Automorphing kernels
for nonstationarity in mapping unstructured environments.” in CoRL,
2018, pp. 443–455.
[17] H. Hartenstein and K. Laberteaux, VANET: vehicular applications and
inter-networking technologies. Wiley Online Library, 2010, vol. 1.
[18] F. G. . S. L. Gante, J., “Deep learning architectures for accu-
rate millimeter wave positioning in 5g,” Neural Process Letters
https://doi.org/10.1007/s11063-019-10073, pp. 1–28, 2019.
[19] A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao, “Vehicle trajec-
tory prediction based on motion model and maneuver recognition,”
in 2013 IEEE/RSJ international conference on intelligent robots and
systems. IEEE, 2013, pp. 4363–4369.
Cognitively-inspired episodic imagination for self-driving vehicles
Sara Mahmoud1 , Henrik Svensson1 and Serge Thill2,1

Abstract— The controller of an autonomous vehicle needs Another instance of episodic simulations is found in
the ability to learn how to act in different driving scenarios dreams, during which the brain is more or less cut off from
that it may face. A significant challenge is that it is difficult, sensory input and motor output. Although the function of
dangerous, or even impossible to experience and explore various
actions in situations that might be encountered in the real dreams remains heavily debated, some theories suggest that
world. Autonomous vehicle control would therefore benefit they might help to prepare agents for action and can improve
from a mechanism that allows the safe exploration of action performance in the wake state. For example, Revonsuo [7]
possibilities and their consequences, as well as the ability to suggests the “Threat Simulation Theory”, according to which
learn from experience thus gained to improve driving skills. a major function of dreams is to rehearse possibly threat-
In this paper we demonstrate a methodology that allows a
learning agent to create simulations of possible situations. These ening situations. Others have hypothesized that rudimentary
simulations can be chained together in a sequence that allows mental simulations during early childhood interact with wake
the progressive improvement of the agent’s performance such behavior to facilitate the formation of more mature mental
that the agent is able to appropriately deal with novel situations simulations during development [8], [9]. It follows from this
at the end of training. This methodology takes inspiration that the importance lies not only in some general reactivation
from the human ability to imagine hypothetical situations using
episodic simulation; we therefore refer to this methodology as of previous sensorimotor activity, but also in the content
episodic imagination. since this might influence the usefulness of the simulation
An interesting question in this respect is what effect the for future behaviors.
structuring of such a sequence of episodic imaginations has
on performance. Here, we compare a random process to a B. Implementations of simulations in artificial agents
structured one and initial results indicate that a structured
sequence outperforms a random one. Simulation theories of various kinds have previously also
been implemented in various artificial agents to investigate
I. INTRODUCTION how such an ability affects behavior and can improve perfor-
A. Simulation abilities in humans mance [10], [11], [12]. For example, an early approach was
adopted by Mel [13] who created a robot arm that by means
The ability to internally simulate what has or will happen
of forward models could plan its movements by “imagining”
in past and future situations provides agents with increased
its future movements. Many other approaches have since
flexibility when interacting with the world. In humans, these
then utilized the ability for internal re-creation of sensory
mental simulations occur in many forms, ranging from low-
and motor states to assist in various tasks (see e.g. [12],
level embodied simulations to higher level episodic simula-
[14], [8]). Many of the previous attempts of implementing
tion [1]. These can briefly be described as follows:
mental simulations in robots have been rather simplistic and
Embodied simulations, in which the sensorimotor systems
have been more related to embodied simulations rather than
of the brain are extensively reactivated in similar ways as
episodic simulation due to the nature of the mechanisms
during overt interaction with the world have been shown
used.
to improve subsequent motor performance in, for example,
The hallmark of episodic simulations is increased flex-
path navigation [2], sports activities [3], and rehabilitation
ibility and diversity with regards to the content of the
[4]. Thus, embodied simulations seems to facilitate learning
simulations not being dictated to the same degree by the
despite absence of direct feedback from the environment.
physical constraints of body and environment as would be
Episodic simulations, on the other hand, refer to simulations
the case in embodied simulations [1]. Implementing a more
concerning more abstract aspects of interactions not directly
strongly biologically inspired mechanism [8], [7] would
affecting motor performance, but rather being more flexible
require such simulations to be more flexible with respect
and diverse in terms of the content of the simulations
to their content. Deep neural networks [15] and Generative
and influencing action selection on a higher level, such
Adversarial Networks (GAN) [16], for example, may provide
as contemplating different places for the next vacation or
viable approaches for implementing episodic like simulations
preparing your arguments in the next salary negotiation, or
in artificial agents. GANs, for example, provide the required
imagining where you’ll be in 10 years [5], [6].
flexibility because they are able to create previously unseen
Funding: the authors acknowledge financial support from the EC H2020 data in a useful way. As such, GANs have been used for
research project Dreams4Cars (no. 731593) imagination to generate video scenes similar to collected
1 Interaction Lab, University of Skövde, 54128 Skövde, Sweden
real world video data, which subsequently was used to run
[email protected], [email protected]
2 Donders Institute for Brain, Cognition, and Behaviour, Radboud Univer- a reinforcement-learning based driving agent in the gener-
sity, 6525 HR Nijmegen, Netherlands [email protected] ated scenes [17]. Initial results showed that a trained GAN
generated simulated images very close to the real data. Thus,
rather than recreating very simple sensor data these networks
are able create more episodic like images [18]. In other work,
using a setup with a variational auto-encoder combined with
a recurrent neural network, Ha and Schmidhuber [19] also
showed promising results of using episodic like simulations
(or dreams/hallucinations in their terms) in the OpenAI gym Fig. 1: Imagination types used for training the driving
[20] and VizDoom [21] environments. agent:no imagination, stochastic structure and systematic
While previous work has put much effort into image structure
generation mechanisms, it is not clear what variables affect
the learning process when learning and generating behaviors
are instead based on episodic simulations. In particular, it
is an open question whether the structure of the content of
episodic simulations affects the learning performance. This
may be critical for autonomous driving [22]. For example,
one could vary the number of vehicles encountered when
learning to overtake in such simulations, but is it enough,
as has been done in previous work, to merely randomly
hallucinate different overtaking scenarios [19], [8] until per- Fig. 2: Episodic generator system architecture for self-driving
formance converges to a satisfactory level, or should there car on OpenDS simulation
be some guiding structure to the process?
In the remainder of this paper, we investigate this using
a lane-keeping task for an autonomous vehicle. Since the A. System Architecture
focus is not on the image generation process per se, but
In a nutshell, the system architecture consists of four
on how the content of episodic simulations interacts with
main components (see Figure 2): (1) OpenDS, the physical
the learning process, we here use the rendered simulation
simulation, in which the training and testing driving is
in a driving simulator directly as a model for embodied
executed (Figure 3), (2) the learning agent that is trained in
and episodic simulation. The simulation consists of both
different imagination conditions, (3) a middleware connector
embodied aspects, such as the the physics model of the
that converts the simulation into a RL environment and (4)
vehicle, and episodic aspects, such as the type of road
the road generator that describes the road specifications used
environment. However, since the study only manipulates the
in the simulations.
road environment, we use the term episodic imagination for
the test conditions in the study. This allows us to create a tool We use the middleware connector to calculate the reward
that is able to flexibly create new episodic-like simulations function (optimized for a lane keeping task) at each step (see
and focus on the question of how their content may affect the Eqns. 1–3). The function depends on the lateral distance from
learning and subsequent performance. It should also be noted the left lane margin of the road (Eqn. 1), and the car heading
that the imagination mechanism proposed here differs from angle between the lane and the car (Eqn. 2), as shown in
the common approach of manually designing the simulations Figure 3b.
– here, these are automatically generated by the proposed
system architecture. The work here thus also contributes re = min(dl , w − dl ) (1)
to the development of more effective means of learning
from imagination by developing an automatic imagination rh = 2 ∗ e(−15∗|lh |) (2)
mechanism.
rt = re + rh (3)
The remainder of the paper is structured as follows:
Section II describes the research method. Section III presents
the results and Section IV concludes the paper.

II. METHODS

The aim is to evaluate how the structure of the episodic


imagination affects the learning performance. We achieve
this by training the same Deep Q-network on a lane-keepting
task in three different imagination conditions; no imagina-
(a) 3D rendered Road (b) Lane keeping
tion, stochastic imagination and systematic imagination, as parameters
shown in Figure 1. In the following subsections, the task and
conditions are described in more detail. Fig. 3: OpenDS road environment
Where re is the reward for the distance from the side of
the road, dl is the distance of the car from the left edge of
the road in meters, w is the width of the lane,rh is the reward (a) The same road is repeated for all episodes 100 successful times in the
for the car heading, lh is the angle between the car heading control condition of no imagination
and the road heading in radius and rt is the total reward. In
plain terms, the function returns the highest reward when the
car is at the middle of the lane and aligned with the road
direction. (b) Sequence of random difficulty of the generated roads for the stochastic
imagination condition
The Road Generator is a python script that automatically
creates the road scenarios based on the defined features. The
generator is primarily used for generating episodic imagina-
tion. The generator describes the road features and then send
the description to OpenDS which construct the described (c) Gradual increase in the road difficulty for the systematic imagination
condition
road (Figure 3a). Across all imagination conditions, roads
are single lane with a width of six meters. Since the task is Fig. 4: Road samples of the three imagination conditions
lane keeping, the main factors for a road generation are the
ratio of straight to curved segments and the geometries of
the curves. All roads used in this paper have an approximate 3) Systematic Imagination: Systematic Episodic Imagi-
length of 500m. nation differs from the previous one in that the difficulty
The driving agent in this study is in the form of deep of the roads, quantified based on the number of roads and
neural networks using Deep-Q-Learning. During training, the their curvature, increases during training, some examples are
driving agent receives a representation of the road through shown in shown in Figure 4c in respect to the training order:
the Middleware Connector which is the RL state of the the first road consists of 99% straight segments, which then
driving agent in the simulation environment. The driving gradually drops until the ratio reaches the previously used [40
agent then selects the action from set of available actions curves : 60 straight] after 40 roads. Curvature limits were set
(turning the steering wheel 0.05 rad/s to the left, right, or to (-0.007 to 0.007 m−1 ) for the first 40 roads, increasing to
maintaining its current position). The agent receives a reward (-0.01 to 0.01 m−1 ) until road 80, and settling at (-0.015 to
that represents how good the chosen action is in the given 0.015 m−1 ) for the last 20 roads.
state. The agent updates the network weights based on the
obtained experience and continues with this process until it III. RESULTS
reaches a terminal state, which is arrived at either when the To measure the effectiveness of the imagination ap-
agent successfully reaches the end of the road, or when it proaches, the three trained driving agents were testing on
leaves the road prematurely. 100 new roads that they haven’t been trained on. The
measurement is the mean total rewards that the agent obtains
B. Road Generation Setup from the testing roads. The theoretical maximum mean total
Different types of imagination are used by the agents in reward is 8500 (if the agent scores reward of five at each of
the various experimental conditions, as shown in Figure 1. the 1700 steps per episode for the 100 episodes).
In detail, they are implemented as follows. In the No Imagination condition, which functions as a
1) No imagination: A single road is generated, randomly control in our setup, the agent completed all the 100 testing
selecting for the number of curves, curvature values and roads in openDS successfully. The condition resulted in a
length of components such that the road provides a reason- mean total reward of 6750 (see Fig. 5) which is 79% of the
able variation of the features (so as to not make the training theoretical maximum.
fundamentally impossible), as shown in Figure 4a. Training The Stochastic Imagination condition performed the worse
completes when the agent successfully completes this road compared to the two other conditions and failed to finish
100 times. some testing roads. The mean total rewards for the 100 roads
2) Stochastic Imagination: In this experiment setup the were 5800 (68%).
road generator creates 100 different roads before starting the The agent in this setup with Systematic Imagination
training phase, some examples are shown in Figure 4b in performed the highest among the other experiments with
respect to the training order. The parameters determining the an mean total rewards of over 7000 (82%). This shows a
complexity of the road are stochastically assigned based on significant improvement (t-test p < 0.01) from the controlled
a ratio and within ranges. condition.
During training, the agent iterates to the next road upon
successful completion of the current one, and finishes when IV. CONCLUSION
the agent has successfully trained on all 100 roads. As shown In this paper, we demonstrated how to generate imagi-
in Figure 4b, the learning agent may start with learning a very nation road scenarios for training a self-driving vehicle us-
curvy road and then after successfully learning this road, the ing physics simulation. The imagination generator generates
agent moves to an easier road with slight curves. sequence of episodes with different road features that the
[5] R. L. Buckner and D. C. Carroll, “Self-
projection and the brain,” Trends in cognitive sciences,
vol. 11, no. 2, pp. 49–57, 2007. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S1364661306003275
[6] K. K. Szpunar, R. N. Spreng, and D. L. Schacter, “A taxonomy
of prospection: Introducing an organizational framework for future-
oriented cognition,” Proceedings of the National Academy of
Sciences, vol. 111, no. 52, pp. 18 414–18 421, 2014. [Online].
Available: http://www.pnas.org/content/111/52/18414.short
[7] A. Revonsuo, “The reinterpretation of dreams: An evolutionary hy-
pothesis of the function of dreaming,” Behavioral and Brain Sciences,
vol. 23, no. 6, pp. 877–901, 2000.
[8] H. Svensson, S. Thill, and T. Ziemke, “Dreaming of electric sheep?
exploring the functions of dream-like mechanisms in the development
of mental imagery simulations,” Adaptive Behavior, vol. 21, no. 4, pp.
222–238, 2013.
Fig. 5: Mean total rewards at testing phase for the three [9] S. Thill and H. Svensson, “The inception of simulation : a hypothesis
experimental conditions. Error bar indicate 95% confidence for the role of dreams in young children,” in Proceedings of the 33rd
intervals. Annual Conference of the Cognitive Science Society :. Cognitive
Science Society, Inc., 2011, pp. 231–236.
[10] T. Ziemke, D.-A. Jirenhed, and G. Hesslow, “Internal simulation of
perception: a minimal neuro-robotic model,” Neurocomputing, vol. 68,
driving agent needs to learn. This work doesn’t focus on pp. 85–104, Oct. 2005.
[11] H. Hoffmann and R. Mller, “Action selection and mental transforma-
the agent’s optimization but on the scenario generation as tion based on a chain of forward models,” From Animals to Animats,
a learning environment. The paper presents two ways of vol. 8, pp. 213–222, 2004.
generating the sequence of episodes as either stochastic or [12] E. Billing, H. Svensson, R. Lowe, and Z. Tom, “Finding Your
Way from the Bed to the Kitchen: Re-enacting and Re-combining
systematic and compares the learning performance for each. Sensorimotor Episodes Learned from Human Demonstration,”
A controlled condition is no imagination which means vari- Frontiers in Robotics and AI, vol. 3, no. 9, 2016. [Online]. Available:
ous of features are collected in a single road and the training http://www.diva-portal.org/smash/record.jsf?pid=diva2:915452
[13] B. W. Mel, “A connectionist model may shed light on neural mecha-
is conducted on this road. The results showed that even for nisms for visually guided reaching,” Journal of cognitive neuroscience,
a relatively simple task, the structure of the imagination has vol. 3, no. 3, pp. 273–292, 1991.
an impact on learning performance. These results are also in [14] Y. Demiris and B. Khadhouri, “Hierarchical attentive multiple models
for execution and recognition of actions,” Robotics and autonomous
line with theories of human episodic simulation, in particular systems, vol. 54, no. 5, pp. 361–369, 2006.
the observation that human dreams increase in complexity [15] J. Schmidhuber, “Deep learning in neural networks: An overview,”
during development [9], suggesting that there is a benefit to vol. 61, pp. 85–117, 2015.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
bio (and cognitively) inspired approaches in this domain. S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
A lot of research has been put into investigations of how Advances in neural information processing systems, 2014, pp. 2672–
2680.
to design and optimize various learning agents, much less [17] E. Santana and G. Hotz, “Learning a driving simulator,” arXiv preprint
efforts have focused on the environment. This work shows arXiv:1608.01230, 2016.
that the structure of the environment plays a considerable role [18] A. Plebe., R. Don., G. P. P. Rosati., and M. D. Lio., “Mental imagery
for intelligent vehicles,” in Proceedings of the 5th International
in learning. For future work, additional investigation can be Conference on Vehicle Technology and Intelligent Transport Systems
done to mathematically analyze how the episodic structure - Volume 1: VEHITS,, INSTICC. SciTePress, 2019, pp. 43–51.
contributes to the learning performance. Besides, further [19] D. Ha and J. Schmidhuber, “World models,” arXiv preprint
arXiv:1803.10122, 2018.
mechanisms can be proposed to improving the criteria of the [20] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
episodic generation. For example, continuously assess the J. Tang, and W. Zaremba, “Openai gym,” 2016.
agent’s learning performance and accordingly generate the [21] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski,
“Vizdoom: A doom-based ai research platform for visual reinforce-
suitable episodic imagination that the agent actually needs. ment learning,” in 2016 IEEE Conference on Computational Intelli-
gence and Games (CIG). IEEE, 2016, pp. 1–8.
R EFERENCES [22] M. D. Lio, A. Mazzalai, D. Windridge, S. Thill, H. Svensson,
M. Yksel, K. Gurney, A. Saroldi, L. Andreone, S. R. Anderson, and
[1] H. Svensson and S. Thill, “Beyond bodily anticipation: H. J. Heich, “Exploiting dream-like simulation mechanisms to develop
Internal simulations in social interaction,” Cognitive Systems safer agents for automated driving,” in 2017 IEEE 20th International
Research, vol. 40, pp. 161–171, 2016. [Online]. Available: Conference on Intelligent Transportation Systems (ITSC), Oct. 2017,
http://www.sciencedirect.com/science/article/pii/S1389041716300894 pp. 1–6.
[2] S. Vieilledent, S. M. Kosslyn, A. Berthoz, and M. D.
Giraudo, “Does mental simulation of following a path improve
navigation performance without vision?” Cognitive Brain Research,
vol. 16, no. 2, pp. 238 – 249, 2003. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0926641002002793
[3] A. Guillot and C. Collet, “Duration of mentally sim-
ulated movement: a review,” Journal of motor behavior,
vol. 37, no. 1, pp. 10–20, 2005. [Online]. Available:
http://www.tandfonline.com/doi/abs/10.3200/JMBR.37.1.10-20
[4] J. Munzert, B. Lorey, and K. Zentgraf, “Cognitive motor processes: the
role of motor imagery in the study of motor representations,” Brain re-
search reviews, vol. 60, no. 2, pp. 306–326, 2009. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0165017309000022
A Cognitively Informed Perception Model for Driving
Alice Plebe1 and Mauro Da Lio2

Abstract— Deep learning is responsible for the current re- B. Convergence–Divergence Zones
newed success of artificial intelligence. Applications that in
the recent past were considered beyond imagination, now
Although the simulation theory is one of the most estab-
appear to be feasible. The best example is autonomous driving. lished, it does not identify how simulation takes place at
However, despite the growing research aimed at implementing neural level. A prominent proposal in this direction is the
autonomous driving, no artificial intelligence can claim to have formulation of the convergence-divergence zones (CDZs) [3].
reached or closely approached the driving performance of They highlight the “convergent” aspect of certain neuron
humans, yet. Deep learning is an evolution of artificial neural
networks introduced in the ’80s with the Parallel Distributed
ensembles, located downstream from primary sensory and
Processing (PDP) project. There is a fundamental difference in motor cortices. Such convergent structure consists in the
aims between the first generation of artificial neural networks projection of neural signals on multiple cortical regions in a
and deep neural models. The former was motivated primarily many-to-one fashion. On the other hand, the neuron ensem-
by the exploration of cognition. Current deep neural models are bles have the ability to reciprocate feedforward projections
instead developed with engineering goals in mind, without any
ambition or interest in exploring cognition. Some important
with feedback projections in a one-to-many fashion, realizing
components of deep learning – for example reinforcement the divergent flow.
learning or recurrent networks – owe indeed an inspiration to The primary purpose of convergence is to exploit synaptic
neuroscience and cognitive science, as PDP far legacy. But this plasticity in order to record which patterns of features –
connection is now neglected, what matters is only the pragmatic coded as knowledge fragments in the early cortices – occur
success in applications. We argue that it urges to reconnect
in relation with a specific higher-level concept. Such records
artificial modeling with an updated knowledge of how complex
tasks are realized by the human mind and brain. In this paper, are built through experience, by interacting with objects. The
we will first try to distill concepts within neuroscience and convergent flow is dominant during perceptual recognition,
cognitive science relevant for the driving behavior. Then, we while the divergent flow dominates imagery.
will identify possible algorithmic counterparts of such concepts, Convergent-divergent connectivity patterns can be identi-
and finally build an artificial neural model exploiting these
fied for specific sensory modalities, but also in higher order
components for the visual perception task of an autonomous
vehicle. association cortices. It should be stressed that CDZs are
rather different from a conventional processing hierarchy,
where processed patterns are transferred from earlier to
I. FROM THE COGNITIVE SIDE
higher cortical areas. In CDZs, part of the knowledge about
A. The Simulation Theory perceptual objects is retained in the synaptic connections of
the convergent-divergent ensemble. This allows to reinstate
A well-established theory in cognitive science is the one an approximation of the original multi-site pattern of a
proposed by Jeannerod and Hesslow, the so-called simu- recalled object or scene.
lation theory of cognition, which proposes that thinking
is essentially a simulated interaction with the environment C. Transformational Abstraction
[1], [2]. In their view, simulation is a general principle of One major challenge in cognitive science is explaining the
cognition, which can be expressed in at least three different mental mechanisms by which we build conceptual abstrac-
components: perception, actions and anticipation. tions. The conceptual space is the mental scaffolding the
The most simple case of simulation is mental imagery, brain gradually learns through experience, as internal repre-
especially in visual modality. This is the case, for example, sentation of the world. In particular, conceptual abstraction
when a person tries to picture an object or a situation. During is derived mostly from perceptual experience, which fits
this phenomenon, the primary visual cortex (V1) is activated perfectly with the approach implemented by artificial neural
with a simplified representation of the object of interest, but networks.
the visual stimulus is not actually perceived. As highlighted by [4] CDZs are a valid systemic candidate
for how the formation of high-level concepts takes place
This work was developed inside the EU Horizon 2020 Dreams4Cars at brain level. However, the idea of CDZs is just sketched
Research and Innovation Action project, supported by the European Com- and cannot provide a detailed mechanism for conceptual
mission under Grant 731593. The Authors want also to thank the Deep
Learning Lab at the ProM Facility in Rovereto (TN) for supporting this abstractions. A difficulty with acquiring abstract categories
research with computational resources funded by Fondazione CARITRO. lies in the inconsistent manifestations of the characteristic
1 Alice Plebe is with the Dept. of Information Engineering and Computer
features across real exemplars.
Science, University of Trento, Italy [email protected]
2 Mauro Da Lio is with the Dept. of Industrial Engineering, University A suggested solution to this difficult issue is the trans-
of Trento, Italy [email protected] formational abstraction [5], [6] performed by a hierarchy
of cortical operations, as in the ventral visual cortex. The space back into input space. There is a clear correspondence
essence of transformational abstraction, from a mathematical between the encoder and the convergence zone in the CDZ
point of view, lies in the combination of two operations: neurocognitive concept, and similarity between the decoder
linear convolutional filtering and nonlinear downsampling. and the divergence zone.
Operations of this sort have been identified in the V1 [7], Then how exactly convergence–divergence can be
[8], and are well recognized in the primate ventral visual achieved inside autoencoders? An interesting approach is
path as well [9], [10]. the one closely related to the transformational abstraction
hypothesis described in §I-C: the deep convolutional neu-
D. The Predictive Theory ral networks (DCNNs). They implement the hierarchy of
The reason why cognition is mainly explicated as simulation, convolutional filtering alternated with nonlinear downsam-
according to Hesslow or Jeannerod, is because the brain can pling, and are considered the essence of transformational
achieve through simulation the most precious information of abstraction. In addition, there is growing evidence of striking
an organism: a prediction of the state of affairs in the future analogies between patterns in DCNN models and patterns
environment. The need of prediction, and how it molds the of voxels in the brain visual system. Several studies have
entire cognition, has become the core of another popular the- successfully related results of deep learning models with the
ory popular known as “Bayesian brain”, “predictive brain”, visual system [15], [16], finding reasonable agreement be-
or “free-energy principle for the brain” introduced by Friston tween features computed by DCNN models and fMRI data.
[11]. According to him the behavior of the brain – and of an Convolutional–deconvolutional autoencoders are therefore a
organism as a whole – can be conceived as minimization of highly biologically plausible implementation for the CDZ
free-energy, a quantity that can be expressed in several ways theory, at least in the case of visual information.
depending on the kind of behavior and the brain systems
involved. B. Predictive Brain as Variational Autoencoder
Free-energy is a concept originated in thermodynamics, as In the last few years there has been renewed interest
a measure of the amount of work that can be extracted from in the area of Bayesian probabilistic inference in learning
a system. What is borrowed by Friston is not the thermody- models of high dimensional data. The Bayesian framework,
namic meaning of the free-energy, but its mathematical form variational inference in particular, has found a fertile ground
only, which is derived from the framework of variational in combination with neural models. Two concurrent and
Bayesian methods in statistical physics We will see in §II-B unrelated developments [17], [18] have made this theoretical
how the same probabilistic framework will be used in the advance possible, connecting autoencoders and variational
derivation of a deep neural model. For example, this is his inference. This new approach became quickly popular under
free-energy formulation in the case of perception [12, p.427]: the term variational autoencoder, and a variety of neural
  models have been proposed over the years.
FP = ∆KL p̌(c|z)kp(c|x, a) − log p(x|a) (1)
The loss function for a variational autoencoder is defined
where x is the sensorial input of the organism, c is the as follows:
collection of the environmental causes producing x, a are L(Θ, Φ|x) = ∆KL qΦ (z|x)kpΘ (z) +

actions that act on the environment to change sensory
samples, and z are inner representations of the brain. The − Ez∼qΦ (z|x) [log pΘ (x|z)] (2)
quantity p̌(c|z) is the encoding in the brain of the estimate where x is a high dimensional random variable, z the
of causes of sensorial stimuli. The quantity p(c|x, a) is the representation of the variable in the low-dimensional latent
conditional probability of sensorial input conditioned by the space. Θ and Φ are parameters describing, respectively, the
actual environmental causes c. The discrepancy between the decoder and encoder of the network. pΘ is computed by
estimated probability and the actual probability is given by the decoder and represents the desired approximation of
the Kullback-Leibler divergence ∆KL . The minimization of the unknown input distribution p, and qΦ is the auxiliary
FP in equation (1) optimizes z. distribution computed by the encoder from which to sample
II. TO THE ARTIFICIAL SIDE z. E[·] is the expectation operator, and ∆KL is the Kullback-
Leibler divergence.
A. Convergence–divergence as Autoencoder It is evident how this mathematical formulation is im-
In the realm of artificial neural networks, the computational pressively similar to the concept of free energy in Friston.
idea that most closely resonate with CDZ is the autoencoder. Despite this close analogy, all the proposers of variational
It is an idea that has been around for a long time, it was the autoencoder are either unaware or fully disinterested of this
cornerstone of the evolution from shallow to deep neural coincidence. It is not so surprising because mainstream deep
architectures [13], [14]. More recently, autoencoders have learning is driven by engineering goals without any interest in
been widely adopted for their ability to capture compact connections with cognition. We believe instead that a strong
information from high dimensional data. The basic structure connection between a well established cognitive theory and
of an autoencoder is composed of a feature-extracting part a computational solution greatly argues in favor of adopting
called encoder and a decoder part mapping from feature such a solution.
VISUAL INPUT So, in our implementation the entire latent vector z represents
inside the visual space, and at the same time two inner
segments represent specifically the car and lane concepts.
The rationale for this choice is that in mental imagery there
is no clear cut distinction between low-level features and
semantic features, the entire scene is mentally reproduced,
ENCODER
but including the awareness of the salient concepts present
in the scene.
Z
Note that the idea of partitioning the entire latent vector
into meaningful components is not new. In the context of
processing human heads the vector has been forced to encode
separate representations for viewpoints, lighting conditions,
DECODER DECODER DECODER
shape variations [19]. In [20] the latent vector is partitioned
in one segment for the semantic content and a second
segment for the position of the object. Our approach is
different. While we keep disjointed the two segments for
the car and lane concepts, we fully overlap these two
CONCEPT
representations within the entire visual space. This way, we
CONCEPT RECONSTRUCTION
OF LANES OF VISUAL DATA OF CARS adhere entirely to the CDZ principle, and try to achieve the
full scene by divergence, but at the same time including
Fig. 1. The architecture of our model. awareness for the car and lane concepts.
IV. RESULTS
III. IMPLEMENTATION We present here a selection of results achieved with an
instance of the model described in the previous section.
In the previous section we have reviewed several components The final architecture is trained for 200 epochs, and used
that match quite closely the relevant neurocognitive theories 4 convolutional layers in the encoder, 4 deconvolutional
identified in §I. Our proposed model attempts to weave layers for each decoder, and a latent space representation
together these components, finalized at visual perception in of 128 neurons, of which 16 encoding the car concept and
autonomous driving agents. another 16 for the lane marking concept. We would like
Similarly to the hierarchical arrangement of CDZs in the to highlight that, since the images fed to the network have
brain, our model is provided with different levels of process- dimension of 256 × 128 × 3 and the latent space dimension
ing paths. A first processing path starts from the raw image is 128, the compression performed by the network is almost
data and converges up to a low-dimension representation of of 4 orders of magnitude. This is a considerable achieve-
visual features. Consequently, the divergent path outputs in ment compared to other relevant works adopting variational
the same format as the input image. The other processing autoencoder [21], [22] which limit the compression of the
path leads to representations that are no more in terms of encoder to only 1 order of magnitude.
visual features, rather in terms of concepts. As discussed We trained and tested the presented model on the SYN-
in §I-C, our brain naturally projects sensorial information THIA dataset [23], a large collection of synthetic images
– especially visual – into conceptual space, where the local representing various urban scenarios. The dataset contains
perceptual features are pruned and neural activation code the about 100, 000 color images (and as many corresponding
nature of entities present in the environment that produced segmented images, used for ground truth of the conceptual
the stimuli. In the driving context it is not necessary to infer branches of the network). We used 70% of the data for
categories for every entity present in the scene, it is useful training, 25% for validation and 5% for testing.
to project in conceptual space only the objects relevant to Fig. 2 shows the image results produced by our model for
the driving task. In the model presented here we choose a selection of driving scenarios. The images are processed
to consider the two main concepts of cars and lane to better show at the same time the results on conceptual
markings. space and visual space. The colored overlays highlight the
As depicted in Fig. 1, the presented variational autoen- concepts computed by the network: the cyan regions are the
coder is composed by one shared encoder and three inde- output of the car divergent path, and the pink overlays are
pendent decoders. All the components of the architecture the output of the lane markings divergent path. Fig. 2
are trained jointly. The encoder compresses an RGB image includes a variety of driving situations, going from sunny
to a compact high-feature representation. Then the decoders environments (top rows) to very adverse driving conditions
map different part of the latent space back to separated output (bottom rows) in which the detection of other vehicles can
spaces: one into the same visual space of the input; the other be challenging even for a human. These results nicely show
two into conceptual space, producing binary images contain- how the projection of the sensorial input (original frames)
ing, respectively, car entities and lane marking entities. into conceptual representation is very effective in identifying
Input
Output
Input
Output

Fig. 2. Results of our model for a selection of frames from the SYNTHIA dataset, with different environmental and lighting conditions.

and preserving the sensible features of cars and lane [7] D. Hubel and T. Wiesel, “Receptive fields, binocular interaction,
markings, despite the large variations in lighting and and functional architecture in the cat’s visual cortex,” Journal of
Physiology, vol. 160, pp. 106–154, 1962.
environmental conditions. [8] C. D. Gilbert and T. N. Wiesel, “Morphology and intracortical projec-
Lastly, we would like to stress that the purpose of our tions of functionally characterised neurones in the cat visual cortex,”
network is not mere segmentation of visual input. The Nature, vol. 280, pp. 120–125, 1979.
[9] D. J. Felleman and D. C. Van Essen, “Distributed hierarchical pro-
segmentation task is to be considered as a support task, cessing in the primate cerebral cortex,” Cerebral Cortex, vol. 1, pp.
used to enforce the network to learn a more robust latent 1–47, 1991.
space representation, which now is explicitly taking into [10] D. C. Van Essen, “Organization of visual areas in macaque and
human cerebral cortex,” in The Visual Neurosciences, L. Chalupa and
consideration two of the concepts that are fundamental to J. Werner, Eds. Cambridge (MA): MIT Press, 2003.
the driving tasks. [11] K. Friston, “The free-energy principle: a unified brain theory?” Nature
Reviews Neuroscience, vol. 11, pp. 127–138, 2010.
[12] K. Friston and K. E. Stephan, “Free–energy and the brain,” Synthese,
V. CONCLUSIONS vol. 159, pp. 417–458, 2007.
[13] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality
The model here presented is an attempt to convert into of data with neural networks,” Science, vol. 28, pp. 504–507, 2006.
an artificial neural network model the fundamental theories [14] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
about how the brain processes its sensory inputs to pro- “Stacked denoising autoencoders: Learning useful representations in
a deep network with a local denoising criterion,” Journal of Machine
duce purposeful representations. We especially identified the Learning Research, vol. 11, pp. 3371–3408, 2010.
consolidated variational autoencoder architecture as the best [15] U. Güçlü and M. A. J. van Gerven, “Unsupervised feature learning
candidate for implementing convergence-divergence zone improves prediction of human brain activity in response to natural
images,” PLoS Computational Biology, vol. 10, pp. 1–16, 2014.
schemes. The reason for constraining a deep learning model [16] B. P. Tripp, “Similarities and differences between stimulus tuning
on cognitive theoretical grounds, instead of starting from in the inferotemporal visual cortex and convolutional networks,” in
scratch as often done, derives from the observation of how International Joint Conference on Neural Networks, 2017, pp. 3551–
3560.
humans excel in sophisticated sensorimotor control tasks [17] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
such as driving. Proceedings of International Conference on Learning Representations,
2014.
[18] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop-
R EFERENCES agation and approximate inference in deep generative models,” in
Proceedings of Machine Learning Research, E. P. Xing and T. Jebara,
[1] M. Jeannerod, “Neural simulation of action: A unifying mechanism Eds., 2014, pp. 1278–1286.
for motor cognition,” NeuroImage, vol. 14, pp. S103–S109, 2001. [19] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. B. Tenenbaum,
[2] G. Hesslow, “The current status of the simulation theory of cognition,” “Deep convolutional inverse graphics network,” in Advances in Neural
Brain, vol. 1428, pp. 71–79, 2012. Information Processing Systems, 2015, pp. 2539–2547.
[3] K. Meyer and A. Damasio, “Convergence and divergence in a neural [20] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun, “Stacked what-
architecture for recognition and memory,” Trends in Neuroscience, where auto-encoders,” in International Conference on Learning Rep-
vol. 32, pp. 376–382, 2009. resentations, 2016, pp. 1–12.
[4] J. S. Olier, E. Barakova, C. Regazzoni, and M. Rauterberg, “Re- [21] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol.
framing the characteristics of concepts and their relation to learning abs/1608.01230, 2016.
and cognition in artificial agents,” Cognitive Systems Research, vol. 44, [22] D. Ha and J. Schmidhuber, “World models,” CoRR, vol.
pp. 50–68, 2017. abs/1803.10122, 2018.
[5] D. Hassabis, D. Kumaran, C. Summerfield, and M. Botvinick, [23] G. Ros, L. S. J. M. D. Vazquez, and A. M. Lopez, “The SYNTHIA
“Neuroscience-inspired artificial intelligence,” Neuron, vol. 95, pp. dataset: A large collection of synthetic images for semantic segmen-
245–258, 2017. tation of urban scenes,” in Proc. of IEEE International Conference on
[6] C. Buckner, “Empiricism without magic: transformational abstraction Computer Vision and Pattern Recognition, 2016, pp. 3234–3243.
in deep convolutional neural networks,” Synthese, vol. 195, pp. 5339–
5372, 2018.
Cognitive Wheelchair: A Personal Mobility Platform
Mahendran Subramanian, Suhyung Park, Pavel Orlov, and A. Aldo Faisal

Abstract— Cognitive technologies towards smart vehicles are


evolving, but specific to wheelchairs are often overlooked. High-
level gaze informatics can be complemented with context-aware
algorithms for natural interaction with the environment through
robotic devices or autonomous vehicles. Herein, we harvest the
eye movements to enhance the cognitive abilities of an
autonomous wheelchair platform as eye movements are
correlated with motor intention and act as a precursor to
movement. First, we developed tools to estimate 3D gaze point
and to recognise the object at the wheelchair user's ‘Area Of
Interest' for high-level intention decoding. This 3D eye-tracking
tool with practical accuracy levels to analyse natural human
behaviour during physical interaction with the environment was
used to obtain human intention in autonomous systems for
natural, user-centric interaction with the environment. With
high-level intention decoding capability, such cognitive
wheelchair systems can not only perform autonomous mobility Figure 1. Wheelchair setup with a paraplegic pilot.
tasks but also function in a contextual, semantic and continuous
spectrum of tasks. translated into user intention for an autonomously driving
I. INTRODUCTION wheelchair, which essentially leads to a human-in-the-loop
cognitive wheelchair. Our patient-centred approach
Autonomous navigation and context-aware algorithms considerably reduces the need for constant interaction with an
have been studied within the framework of autonomous interface and lowers their cognitive load. With improved
wheelchairs for urban mobility [1-3]. These algorithms most context-aware algorithms, this cognitive wheelchair can
often make use of the radar, lidar, RGB-Depth and ultrasonic recognise the objects and surface conditions in the proximity
proximity sensors to reconstruct the map of the environment of the wheelchair and can work with high-level information
to compute the path to reach a predefined destination. like semantically annotated locations and more natural eye-
However, the Artificial Intelligence (AI) algorithms employed gaze interaction. This empowers users to communicate or
in these applications work in a closed-loop manner, where the interact with the environment while navigating the wheelchair.
algorithms adapt to the environment directly based on the Our cognitive, personal mobility technology makes the
sensor data. Such an approach is expected to work well in wheelchair control contextually responsive to the dynamic
known environments, which have well-defined conditions like environment of the user and potentially enables us to navigate
traffic systems and landmarks for GPS based navigation. an urban continuum from room to city scale.
However, wheelchairs’ operation does not have predefined
conditions, as the users have a more diverse range of needs and II. MATERIAL AND METHODS
environments that they need to move in. For this, the human A. System Hardware Architecture
intention needs to be included in the AI-environment loop, so
that the wheelchair can understand users’ intentions while A 2D lidar (YDLIDAR X4) and an own custom developed
monitoring for environment conditions through the sensor lidar, i.e. 3D lidar was fitted onto a powered wheelchair
data. Through such semi-autonomy, users can provide one- (Invacare) to fabricate an autonomous wheelchair platform, as
time goal-based commands rather than continuous directional shown in Fig. 1. The control system on the powered
commands, freeing them from the constant interaction with the wheelchair was replaced with a regenerative dual channel
interface, while AI algorithms can take care of the navigation. motor driver (H-bridge from Dimension Engineering) to
Natural gaze-based intention decoding is worthy of control the wheelchair's motors. The prototype is fitted with a
consideration [4,5]. Herein, we incorporate autonomous RGB-D camera (RealSense D435 – 1280 x 720 @ 90fps; 10
driving technology with gaze-based intention decoding. m range; 87°±3° H x 58°±1° V x 95°±3° D-angle of view or
Specifically, we demonstrate how gaze information can be Kinect v2 – 512 x424 @ 30fps; 0.5-4.5 m range; 70/60 angle
of view). RGB-D camera in combination with Tobii Eye 4c
* This work was supported by EPSRC [EP/N509486/1: 1979819] and remote eye tracker was used for 3D gaze estimation. The RGB
Toyota Mobility Discovery Grant. MS, SP, PO, and AAF are with the Brain feed in the SMI wearable eye-tracking spectacles for the ego-
& Behaviour Lab, Dept. of Bioengineering & Dept. of Computing, Imperial
College London, South Kensington, SW7 2AZ, UK (Corresponding author
centric view was used to perform 2D gaze point objection
email: [email protected]). We sincerely thank Paul Moore and Tom identification and simultaneously record eye movements.
Nebarro – Our wheelchair pilots.
B. Gazeinformatics-based intention decoding a ring-buffer of 40 frames, and we performed a winner-take-
all vote to determine the so temporally averaged intention. The
above process was done offline, but, in order to achieve real-
time intention decoding, Semantic Fovea was used to compute
object label and bounding box [11,12]. An evaluation of our
real-time gazeinformatics based intention decoding module is
under review [13].
C. 3D Gaze-based destination definition
3D gaze estimation is a suitable technique for end-point
control [14]. Herein, in order to determine the 3D coordinates
of the user’s intended destination, we combine remote eye-
trackers with an RGB-D camera. The remote eye-trackers,
placed at a distance of 60 cm from the user, can track users
gaze and provide 2D screen coordinates of the gaze on a 60 cm
x 34 cm display at a rate of 60Hz. To convert this 2D
Figure 2. Labelled ego-centric image with gaze fixation and fixation information into 3D coordinates, we overlay 2D gaze-point of
duration – White cross and white circle. (A) ‘Observing a door’ (Non- the user on the 3D point cloud map of the environment
interactive task). (B) ‘Open the door’ (Interactive task). The bounding reconstructed by the RGB-D camera. Calibration process and
boxes of the same objects have different sizes and proportions from a
different point of view. the accuracy of the 3D gaze-point estimation, along with
absolute errors in three dimensions as well as the Euclidean
This AI module decodes users’ high-level intention such as distance error are reported in our earlier study [15].
“Take me to the Door” by investigating their eye movements.
To collect data for model training and evaluation, we D. System architecture
performed a study with five healthy subjects. During the
experimental session, each subject performs 10 trials. We
propose meta-tasks, in which participants are shown a door
<targetObject> and are asked to perform different actions with
them (see Fig. 2.). A computer provides voice commands
based on these tasks for each subject, and the gaze tracking
software tool (BeGaze, SensoMotoric Instruments) records the
corresponding eye movements. 3-point calibration was
performed prior to recording. The meta-tasks include tasks
such as look at the open door; look at the door and imagine
opening the door, viz.
Macro eye-movement events, like gaze fixations, fixation
duration, saccades, and long drifts have been used previously
to model subjects’ actions [6,7] and for the development of
gaze controlled systems [8-10]. However, during their
everyday activity, humans use all available information for
motor planning, for example, information from the memory,
context, and visual information from the foveal and extra-
foveal area. In our system, we used raw eye- tracker data with
a sampling frequency of 120Hz. Fig. 2. illustrates the bounding
boxes that we used for object labelling using AOI tool within
BeGaze. The size of the bounding box corresponds with the
object’s pixels on the current ego-centric image. From one
image to another, the pixel area that is drawn for a single object
Figure 3. Diagram showing the system architecture of the semi-
might change because of head movement, different object autonomous wheelchair prototype. The nodes in Red are native to this
locations, and different viewing points (see Fig. 2., e.g. the architecture. Grey boxes show the ROS message type.
door). That is why we average bounding boxes per object and
normalise gaze point position with respect to this. We have The ROS based autonomous wheelchair platform (Fig. 3.)
collected 3718 gaze points (from 5 subjects). Out of which ~50 was built on a 2D real-time SLAM algorithm [16]. Home built
% gaze fixations were related to non-interactive meta tasks and 3D lidar was used for obstacle detection. Navigation_stack in
were labelled as Class 0. Gaze fixations related to interactive combination with a GPU – Asynchronous Advantage Actor-
meta tasks were labelled as Class 1. In order to predict a high- Critic (G-A3C) based Reinforcement Learning (RL)
level intention, we used visual attention density to determine controller trained for collision avoidance in a small
intention. To this end, we used a simple but robust approach: unmanned ground vehicle [17] was utilised for path planning,
gaze locations within the object bounding box were obstacle detection and collision avoidance. The autonomous
normalised. This 2D location was fed to an object-specific navigation architecture was then incorporated with the gaze-
SVM classifier that was trained on the two classes (10-fold based destination commands. The natural gaze_intention
cross-validated). Each frame classification output was fed into decoder node publishes predictor_msgs and
object_identifier_msgs, i.e. whether the user intends to one end of the room, facing directly towards the centre of the
interact with the object of interest within the field of view. The other end of the room. For the first task, the wheelchair was
gaze_monitor node subscribes to these messages, and the instructed to travel 4m ahead in the x-direction. In task 1 (n=5),
gaze-based commands are published in ROS message the wheelchair was unobstructed by any obstacles. For task 2
(wheelchair_nav/gaze) received from Windows client. 3D and 3 (n=5), the wheelchair was made to move 5m ahead in
gaze-based end-point control (‘Wink' detection) was used for the x-direction. In task 2, a static obstacle of height 2.5m and
low-level intention (free navigation) input as well as a double width 0.5m was placed 2.5m ahead of the wheelchair; and in
confirmation of the decoded high-level intention. The task 3, the static obstacle remained in the same position and
gaze_to_move_base_goal node subscribes to this message another person, acting as a dynamic obstacle, of height 2m and
and publishes the goal pose as a move_base_msg to the width 0.5m was instructed to walk by the wheelchair and stand
move_base node. Once a path to the goal has been computed, near the static obstacle. It was found that planner frequency of
the required velocity commands are sent to the 5Hz [out of 20, 10, 5 Hz], position tolerance of 0.13m (13 cm)
cmd_vel_to_wheelchair_drive node, which is the driver for [out of 100, 50, 25, 13, 6 cm] and orientation tolerance of 0.06
the wheel motors, and all of these processes were achieved in radians (3.4 degrees) [out of 1, 0.5, 0.25, 0.125, 0.0612, 0.0306
real-time. radians] gave the best performance as seen from Fig. 5-6. As
these parameter values are corresponding to the lowest travel
III. RESULTS AND DISCUSSION times. The optimisation experiments also demonstrate that
A. SLAM generated map evaluation and pose estimation achieving wheelchair orientation with a tolerance of 3.4
degrees at low travel time is possible.
SLAM generated map was evaluated by superimposing the
resulting map on top of the blueprint of the scanned floor. It
did form fit. To evaluate the fit, area (pixel) comparison was
performed (Image J, NIH, USA) between the SLAM
generated-map and the floor plan after adjusting their scale to
be uniform. Floor plan section area was 37202 px; SLAM
derived Map area for the same section was 38383 px. The
additional area 1181 px detected was due to the laser scan
penetrating through windows and fitted glasses. Which infer
that the SLAM derived map accuracy is 96.8 %, i.e. the
discrepancy is 3.1%. However, comparison of surface area
between a rectangle sized room (2 x 4 m) and a SLAM
derived-map of the same room resulted in 95% accuracy.

Figure 5. Task: 1. (A) shows the time taken to travel 4m varied as a


function of the planner frequency. (B) shows the time taken to travel 4m
varied as a function of the goal (x, y) position tolerance. (C) shows the time
Figure 4. SLAM derived map form fit with the floor plan (Floor-4,
taken to travel 4m varied as a function of orientation tolerance.
Royal School of Mines, Imperial College London)

Similarly, calculation of SLAM derived pose accuracy was


evaluated by measuring the room coordinates with respect to
the wheelchair and comparing it with the SLAM derived map
and localisation coordinates. Localisation pose accuracy
measured had a tolerance of +/- 10 cm. However, the pose
update frequency parameter had to be set to low, as higher
values resulted in a constant oscillation of the pose within the
tolerance.
B. Autonomous Wheelchair Performance Evaluation
In order to find the optimal specifications for safe use of
the wheelchair, we investigated three parameters: planner
frequency, position tolerance and orientation tolerance.
Different values for the planner frequency were investigated to
optimise the pose based path update during navigation and
understand their effect on wheelchair transmission. These
parameters were first evaluated in three different tasks by Figure 6. Performance results for Task 2 and Task 3. (A) shows the
measuring the time it takes the wheelchair to reach the time taken to travel 5m around a static obstacle varied as a function of the
destination. In all three tasks, the wheelchair was positioned at planner frequency. (B) depicts the time taken to travel 5m around both a
static obstacle and a dynamic obstacle varied as a function of the planner disabled to gain more independence and mobility in daily life
frequency.
activities. Our natural user interface may lead to better
Based on these results, the parameters for the system adoption by patients over time as we have approached its
architecture were optimised, and task 4 (n=6) was performed. development from an embodied perspective [18] by relying
In task 4, the goal for the autonomous wheelchair was defined on natural interactions to drive human-robot interaction.
as to move from A to B (5m), within different dynamic
environment scenarios, i.e. the number of static and dynamic
REFERENCES
obstacles and their position were changed for each run to
simulate different routes but within the same room for the [1] H. Grewal, A. Matthews, R. Tea, and K. George, “Lidar-based
autonomous wheelchair,” in IEEE Sensors Applications Symposium
same start and end goal coordinates. Time taken to move from
(SAS), pp. 1–6, 2017.
A to B varied for different routes. However, the autonomous [2] P. Marin-Plaza, A. Hussein, D. Martin, and A. de la Escalera, “Global
wheelchair was able to detect both static and dynamic and local path planning study in a ros-based research platform for
obstacles with perfect accuracy using optimised parameters. autonomous vehicles,” Journal of Advanced Transportation, 2018.
[3] D. Schwesinger, A. Shariati, C. Montella, and J. Spletzer, “A smart
Next, a questionnaire about the comfort level and required wheelchair ecosystem for autonomous navigation in urban
additional features was provided to 3 volunteers (2 environments,” Autonomous Robots, vol. 41, no. 3, pp. 519– 538, 2017.
Quadriplegic and 1 paraplegic) who showed interest in [4] S. I. Ktena, W. Abbott and A. A. Faisal, “A virtual reality platform for
evaluating our system further to a demonstration. State of the safe evaluation and training of natural gaze-based wheelchair driving,”
in Proc. 7th IEEE/EMBS Conf. Neural Engineering (NER), pp. 236–
passenger during autonomous navigation was recorded. Based 239, 2015.
on the observations, the system architecture was improved [5] L.A. Raymond, M. Piccini, M. Subramanian, P. Orlov, G. Zito, and A.
[13]. A. Faisal, “Natural Gaze Data Driven Wheelchair,” bioRxiv, p.252684,
2018.
D. 3D Gaze Based Semi-Autonomous Wheelchair Evaluation [6] A. M. Brouwer, V.H. Franz, and K.R. Gegenfurtner, “Differences in
fixations between grasping and viewing objects,” Journal of Vision,
The wheelchair participant's (subject) intention to get to the vol. 9, no. 1, pp.18, 2009.
Object of Interest was decoded successfully. From the [7] A. H. Fathaliyan, X. Wang, and V.J. Santos, “Exploiting Three-
accuracy level comparisons, we see that Naive gaze-pointer Dimensional Gaze Tracking for Action Recognition During Bimanual
approach with fine Gaussian SVM results in 78.9% accuracy, Manipulation to Enhance Human–Robot Collaboration,” Frontiers in
and that result was just for, i.e. "Take me to the Door". The Robotics and AI, vol. 5, pp. 25, 2018.
[8] S. Dziemian, W. W. Abbott, and A. A. Faisal, “Gaze- based
SVM was built from the 2D gaze point positions, meaning that teleprosthetic enables intuitive continuous control of complex robot
we can explain how the system works to our users. Such an arm use: writing & drawing,” In Proceedings of the 6th IEEE
explanation of machine learning results for end-users is highly RAS/EMBS International Conference on Biomedical Robotics and
essential from the perspective of usability, accessibility and Biomechatronics (BioRob). IEEE. pp. 5, 2016
trust with the system. From Fig. 2., we can observe that there [9] P. M. Tostado, W. W. Abbott, A. A. Faisal, “3D gaze cursor:
Continuous calibration and end-point grasp control of robotic
is a clear separation between the positions of gaze point for actuators,” In Robotics and Automation (ICRA), IEEE International
different tasks. The obstacle detection and collision avoidance Conference, pp. 3295-3300, 2016. 

were improved in the autonomous wheelchair platform by [10] P. Orlov, and Gorshkova K, “Gaze-Based Interactive Comics,” In
deploying a deep RL controller. The 3D gaze-based modules Proceedings of the 9th Nordic Conference on Human-Computer
were then integrated with the autonomous wheelchair platform Interaction, ACM, pp. 116, 2016.
to test the semi-autonomous functionality and cognitive [11] C. Auepanwiriyakul, A. Harston, P. Orlov, A. Shafti, and A.A. Faisal,
“Semantic fovea: real-time annotation of ego-centric videos with gaze
ability. In addition to the intention decoding module, 3D gaze- context,” In Proceedings of the ACM Symposium on Eye Tracking
based end-point control used to decode user’s navigational Research & Applications, p. 87, 2018.
intentions was investigated. A ‘Wink’ lasting for 3 seconds [12] A. Shafti, P. Orlov, and A. A., Faisal, “Gaze-based, context-aware
was used to define a destination within depth camera FOV. 3 robotic system for assisted reaching and grasping,” In Robotics and
‘Winks’ with the right eye was used for turning right, and 3 Automation (ICRA), IEEE International Conference, pp. 863-869,
2019.
‘Winks’ with the left eye was used for turning left. Thus, [13] M. Subramanian, S. Park, P. Orlov, A. Shafti, A. A. Faisal, “Gaze-
continuous gaze-based semi-autonomy was achieved contingent decoding of human navigation intention on an autonomous
successfully. For evaluation, the wheelchair was positioned wheelchair platform,” [Under review].
between static and dynamic obstacles and the task was carried [14] W. W. Abbott, A. A. Faisal, “Ultra-low-cost 3D gaze estimation: an
out six times (n=6). When a low-level navigation intention was intuitive high information throughput compliment to direct brain–
successfully detected, the wheelchair was able to navigate to machine interfaces,” Journal of neural engineering, vol. 9, pp. 046016,
2012.
the destination accounting for the obstacles along its path. [15] M. Subramanian, N. Songur, D. Adjei, P. Orlov, and A.A. Faisal,
“A.Eye Drive: Gaze based semi-autonomous wheelchair interface," in
IV. CONCLUSION Engineering in Medicine and Biology (EMBC), IEEE International
Conference, 2019. In Press.
Our fully functioning cognitive wheelchair can be built [16] W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-time loop closure
using frugal sensors and state-of-the-art modules based on in 2D LIDAR SLAM,” in Robotics and Automation (ICRA), IEEE
International Conference. pp. 1271-1278, 2016.
Machine Learning algorithms. Natural gaze-based intention
[17] M. Everett, F. C. Yu, and P. H. Jonathan, “Motion planning among
decoder and 3D gaze-based destination estimation method dynamic, decision-making agents with deep reinforcement learning.” In
was developed and showed practical accuracy levels. We 2018 IEEE/RSJ International Conference on Intelligent Robots and
place the user in the centre of the AI loop, allowing the user Systems (IROS), pp. 3052-3059. IEEE, 2018.
[18] T. R. Makin, F. de Vignemont, A. A. Faisal, “Neurocognitive barriers
to provide minimum input for successful navigation. Such to the embodiment of technology,” Nature Biomedical Engineering vol.
cognitive, personal mobility technology can benefit severely 1, pp. 0014, 2017.
A Frontal Cortical Loop For Autonomous Vehicles Using Neuralized
Perception-Action Hierarchies

David Windridge and Seyed Ali Ghorashi, Middlesex University, London, UK Member, IEEE

Abstract— By modelling driving as a Perception-Action (PA) back-propagation) at all levels via an end-to-end neuralization
hierarchy it is possible to combine high-level symbolic logical of the PA hierarchy.
reasoning (in particular, the Highway Code applied to
hypothetical road configurations) with low-level sub-symbolic
processes (specifically, Optimal Control and stochastic machine II. A HIERARCHICAL SENSORIMOTOR CONTROL SYSTEM FOR
learning). In this context, we propose a cortical frontal loop AUTONOMOUS VEHICLES
analogue for autonomous vehicles in which progressively The driving agent model here proposed implements a
abstracted bottom-up scene understanding is followed by top- biologically-analogous cortical system for hierarchical
down legal action specification (with progressive contextual sensorimotor control, with physically connected (non-
grounding), such that final action selection is carried out via symbolic) bottommost layers and a top-most symbolic
simulated basal ganglia model. Although the top level of the PA-
subsumption architecture. Long term (strategic) goals, in
hierarchy employs explicit first-order logical reasoning we can
exploit the duality principle of Hölldobler to generate a
particular compliance with the highway code is enacted by the
functionally equivalent deep neural network such that the PA symbolic module. The module acts on the sub-symbolic
hierarchy can learn adaptively at all levels. (physical) layer by specifying desirable target areas, hence
biasing low-level action selection. The symbolic module thus
steers the behavior of the lowermost (physically-connected)
I. INTRODUCTION layer which retains the final authority vetoing all but the safe
Perception-Action (PA) learning proposes an intrinsic link tactical maneuvers that are moment-by-moment available (it
between the perceptive and active capabilities of an agent can veto incorrect high-level requests).
(motto: action precedes perception). This may be modelled as Our architecture is inspired the organization of the human
an explicit bijection constraint between percept transitions and brain’s visual processing [1-3], with differing cortical loops
actions: 𝑷𝑷𝑷𝑷𝑷𝑷 → A such that any perceptual redundancy is permitting different agent learning modalities:
eliminated in relation to the agent’s affordances with respect to
the environment. The cerebellar loop learns forward/inverse models of the
vehicle/environment dynamics (used for motor control and
The notion of a Perception-Action Hierarchy further relates the adaptation to differing environments, as well as embodied
Brooksian notion of action subsumption to this progressive simulation for training the dorsal stream to learn the value of
perceptual abstraction via layer-wise application of the PA novel short-term tactical-level maneuvers).
bijection principle. We can thus model human car-driving as a
PA hierarchy by enabling the combination of high-level The dorsal stream has a convergence-divergence organization
symbolic logical reasoning (i.e. in relation to the High way and leans compact representations of simple events that are
code) with low-level sub-symbolic processes. used to construct simple episodes for developing short-term
motor strategies (e.g., imagining other road users’ possible
To this end, we here outline a run-time Cortical Frontal Loop behaviors and learning collision avoidance countermeasures).
analogue in which (progressively abstracted) bottom-up scene
representation is followed by top-down (legal) action The symbolic level learns long-term strategies with high-level
specification. The top level of the PA-hierarchy, the Logical action selection via reinforcement learning in an episodic
Reasoning Module (LRM), hence employs explicit first-order simulation context.
logical reasoning in order to compute the full set of equi-legal Agent evolution is conducted via off-line learning utilizing
agent actions (constituting the Herbrand base of the LRM’s wake-dream cycles to replace the various neural network
logical programme) with respect to the currently configuration building blocks.
as interpreted via the bottim-up scene understanding.
The role of the dorsal stream is hence to recognize actions
The PA-hierarchy so constructed utilizes both neural and latent in the environment and to prepare motor plans
formal reason processes. However, there is a fundamental accordingly [1]. The dorsal stream also has a role in
duality principle that suggests Logic Programmes are always conceptualizing episodes. Both of these capabilities are
capable of neuralization (cf Holldobler & Kalinke’s naturally implemented within our system via an auto-encoded
equivalence between 3-layer NNs and logic programmes convergence-divergence architecture.
(LPs)). However, if rule-base is hierarchical (as it must be in a
PA-hierarchy), then the above equivalence becomes that Because of its intrinsically discrete and iterative nature,
between the reasoning hierarchy and a functionally equivalent however, neuralization of the symbolic frontal cortex is less
deep neural network. This means that the system can, in
principle, be so constructed as be able learn adaptively (via
straightforward. We thus first give a detailed account of this scene-description and annotation may be seen as a process of
loop prior to discussing our symbolic neuralization strategy. bottom-up symbolic abstraction. The two processes are hence
the precise inverse of each other in the LRM’s design.
III. THE BIASING LOOP (FRONTAL CORTEX LOOP) B. PA Subsumption Design Principle Adopted by the LRM
To implement complex symbolic rule-based behaviors such as The criteria for the number of levels in the hierarchy is defined
legal action sequence-planning (e.g. overtaking), further layers by the notions of subsumption and Percept-Action bijection.
are constructed on top of the dorsal stream that steer the agent Application of the PA bijectivity criterion implies that we
low-level behaviors so as to produce legal action sequences for should, as far as possible, represent only those percepts that
longer-term goals. distinguish intentional actions on a given layer. This means that
This is specified as a hierarchical PA-subsumption architecture each intention must bring about a perceivable change in such a
that provides a unified framework for semantic annotated event way that the total set of percepts is minimized with respect to
logging, generation of legal priors for action selection via the the available actions (affordances), consistent with the
basal ganglia (BG) loop and high-level motor babbling/top- highway code representation of a priori meaningful perceptual
down dream instantiation (in the offline system). objects. In practical terms, application of this principle means
that, for example, it is not possible to have two consecutive
There is hence a unified architecture within the LRM legal gaps within a lane, since a ‘legal gap’ in order to exist as
incorporating a common symbolic/sub-symbolic interface that a high level percept must be distinguishable by a
operates across 3 distinct symbolic/sub-symbolic information- correspondingly legally-definable intention (a legal gap is
flow modalities: bottom-up semantic annotation, top-down defined as a potential legal place of relative occupation for the
legal intention biasing, top-down dream instantiation. Ego car within a given lane, and as such is not sub-divisible at
While the sub-symbolic system operates via goal salience (i.e the highest level of legal intentionality).
defined regions of the motor cortex), which may be learned so The notion of Subsumption in the LRM is thus related to the
as to optimize long-term strategical behaviors, the LRM can legal sub-structuring of high-level intentionality; in particular,
only make recommendations (as per the subsumptive where perceptual targets are fine-grained by sub-intentions, for
`principle of lower-level veto’), such that the final choice is in which the same PA bijectivity condition also applies.
charge of the lowest level motor control (the dorsal stream).
The final authority is thus always in the responsibility of the This bijectivity principle also extends to levels below that
dorsal stream physical loops. indicated by the HWC; however, the lowest intentional level
defined by the HWC is that of the linearized road metric; this
A. The Logical Reasoning Module therefore dictates the interface point of the LRM with the rest
of the system (equally, this is the symbolic/sub-symbol cut off)
The subsumptive Perception Action hierarchy embodied as indicated in fig. 1.
within the LRM consequently implements the symbolic (i.e.
high-level representational) component of the architecture,
being responsible for high-level scene interpretation &
annotation, and for introducing legal biasing in intention (note
that the highway code itself does not generally identify unique
actions within a given road context, but rather gives rise to a
degenerate, equi-legal set of action possibilities).
The LRM acts via a mixture of theorem proving-via-resolution
and functional extrapolation in order to apply the HWC in
unfamiliar scenarios, with the former constituting the highest
level of PA subsumption. The road configuration is thus
represented within the LRM as instantiated logical variables,
irrespective of the LRM’s operational modality (as indicated,
the LRM subsumption framework is constrained to have the Figure 1. Run-time system PA hierarchy (OC refers to the optimal control
capability to act reversibly, that is to say, in a generative trajectories existing at the physical layer)
manner via reverse PA logical-variable instantiation, such that
hallucinated high-level legal road configurations are From the hierarchical PA perspective, there are thus two
spontaneously generated alongside the corresponding legal distinct symbolic reasoning layers implicit in the Highway
intentionality in the offline dreaming process. The latter Code (because the HWC explicitly excludes both navigational
(although beyond the scope of this paper) is an instance of top considerations and motor processes from its remit, which
-down exploratory PA motor babbling, in which theorem would respectively extend the higher and lower levels of the
proving-via-resolution is applied to random instantiations of hierarchy if present). The two levels are: the discrete symbolic
logical variables in order to establish self-consistent Herbrand level and the logico-linear metric level, as shown in Figure 1.
(i.e. logically-self consistent) interpretations, i.e. scenarios Consequently, legal-intention related configurations can only
consistent with the legal road protocols). be defined in the above terms; they collectively represent the
Thus, while the offline dreaming process is one of top-down high-level semantic annotation (or equivalently, the high-level
symbolic grounding through the full PA subsumption scene understanding) brought about by hierarchical PA
architecture, it is conversely the case that run-time high-level considerations.
The LRM is therefore architected on two distinct layers (see E. Top-down communication from the LRM (Legal intention
Figure), with a perception/action interface specified between Grounding)
each level at the appropriate level of symbolic abstraction. The logic-symbolic reasoning process, as well as providing
the high-level interpretation of the road circumstances
C. Interlayer Interface Structure of the LRM indicated above, also serves to provide a full set of Herbrand
The Highway Code refers to both discrete symbolic entities (i.e. logically-self consistent) interpretations of the future
(cars, lanes, signs, gaps, etc.), as well as linearized-metric legal action possibilities (for example, whether it is legal to
entities —i.e. metric entities expressed in terms of distance- change lane in the current context).
to/time-to and distance-from/time-from other entities These Herbrand sets are then grounded —i.e. propagated
described in relation to the Ego Car. downwards (as instantiated hierarchical variables in the run-
At the high-level node, lane-wise road configurations are time mode)— through the perception-action subsumption
characterized in the LRM via a logical-list format: ordered in- hierarchy so that, at the point of interface they manifest as a
lane lists of cars and gaps, with (the equivalent of) predicatized set of binary saliency indicators attached to legally-
assertions as to which cars/gaps are legally adjacent to which designated areas in the linearized metric space.
others.
At the immediately lower level of the LRM, the (symbolic/sub- The top-down communication from the LRM thus take the
symbolic) node is characterized by annotated metrical form of metrical bounding boxes augmented by discrete legal
bounding boxes relating to legal transitions produced by a two- saliency indicators that are used to directly compute the
stage process, corresponding to the two stages of subsumption motor cortex biasing matrix. This annotation thus
at the apex of the PA hierarchy listed above (the annotation simultaneously satisfies the requirements of perception-
aspect of the metric bounding boxes thus correlates to their action bijectivity and legal self-consistency; in particular
high-level representation, illustrating the progressively bijectivity allows the bounding box annotation to be directly
grounded nature of symbols generated in a PA hierarchy). interpretable at the motor cortex in regard to action
selection.
Contextual metric information (distances to, and velocities of,
Note that the top-down LRM logical annotation process is
other cars), received from the agent are hence converted into a
non-metrical list of cars and gaps by means of linear exhaustive, with a complete Herbrand-interpretation of the
extrapolation according the HWC protocols (i.e. assuming scene generated as the annotation output (this is a natural
constant speeds and legally-specified reaction times). This list consequence of the logic program being applied recursively
is passed to the second level of the LRM as the equivalent of until an inferential fixed-point is arrived at).
declaratively-enacted predicate script, from which a set of This means that, in the event of incomplete input data, the
high-level legal intentions with uniform priors are generated system generates a full range of self-consistent ‘completion’
(they are uniform since road protocols do not distinguish sets, which are effectively the equivalent of equally-weighted
between legal intentional possibilities a priori). ‘possible worlds’ (in the modal logic sense) consistent with
the input, composed of alternative groundings of predicate
D. Bottom-up Communication from the Pre-LRM Layer variables with the available constants.
The bottom-up semantic annotation function of the runtime
system thus involves communication through the various F. Dreaming Initiation via Top-down Communication of
levels of the LRM in the form of abstractions of the perceptual Legal-Perceptual Priors (LRM Percept-Motor Babbling).
data consistent with the outlined notion of perceptual
subsumption: As a corollary, where no input is given to the LRM, there are
no grounded logical road configuration variables asserted at
At the symbolic/sub-symbolic interface layer (Linearized the symbolic/sub-symbolic interface. In principle, this allows
Metric Layer), geometric details such as the exact shape of the LRM to initiate an offline learning via a dreaming
the lanes are hence discarded, while the topology and linear process (i.e. high-level percept-action babbling) without any
distance (constituting a higher-level legal-symbolic modification of the system’s subsumptive structure; exactly
parametrization) is retained. The speed and distance of the same mechanism for legal biasing can be utilized for
individual objects in relation to the Ego car, and road dreaming, since the Herbrand fixed-points in the absence of
configuration information in the form of lane numbering, any assertion as to road configuration (i.e. no assertions
width, lane marker types (e.g. whether lane change is relating to either road topology or to vehicle traffic using that
allowed) etc. are passed to the LRM. topology) are simply a uniform set of possible worlds
The net result of the bottom-up communication of road consistent with the legal constraints on the road
configuration, after processing by the logical-reasoning configurations in general (the LRM’s logical axioms
system, is thus a high-level symbolic representation of both necessarily have only a nominal distinction between
the legal status of, and the legal possibilities with respect to, intentional-rules and environmental-consistency rules).
the current road configuration. This hence constitutes a
semantic annotation of the road situation described with G. Action selection loops
respect to a (legal) intentional frame, or equivalently the
high-level scene interpretation. Action selection within the system has is consequently
organized in a hierarchical fashion. There are two distinct
action selection modules, acting at the symbolic (the LRM) unconnected rules, each of which might be only very weakly
and sub-symbolic (physical) levels of description. The levels evidentially supported on its own and involves a combinatorial
are differentiated firstly by their differing inputs, and explosion in the number of neurons.
secondly by the differing timescales over which their We thus construct a “neural logic programme parser” that
decisions are made. simplifies the LRM Logical Programmes via a 3-fold strategy:
The higher-level action selection loop takes the outputs of
the logical reasoning module (LRM) as its inputs. The high-
level action selection module first assigns, at each time step, 1. Appropriate thresholding considerations w.r.t. single predicate clauses
scalar weights representing the “desirability” of each of the potentially reduces the mid-layer neuronal budget by orders
LRM’s bounding boxes. These weights are learned, this way magnitude
enabling the agent to learn long-term strategies.
Once the high-level action-selection loop has concluded its => it also naturally gives rise to a more pyramidal, CNN-like hierarchy
decision-making process, the conclusion can be passed down
to the lower levels.
Neurally-inspired action selection within the agent hence takes
the form of a computational model of the basal ganglia. In
particular, it has been demonstrate that the basal ganglia could
be performing a form of action selection known as multi-
hypothesis sequential probability ratio testing (MSPRT). This
algorithm sums evidence for each action over time, and finds
the log likelihood that each channel is drawn from a 2. Assertion of facts can be accommodated straightforwardly to reduce
input layer size.
distribution with a higher mean than the other channels. Once
the log likelihood crosses a threshold, the action becomes 3. Explicit observance of rule subsumption
selected. The threshold has to be tuned such that some
Applying these three strategies very naturally results in a deep
predetermined error rate is permitted. Subject to a few
neural network structure, moreover one for which the layers
assumptions, the algorithm can be shown to be optimal in
intrinsically form a PA-hierarchy (since the underlying LRM
decision time, given a particular error rate.
logical rule base is constructed so as to respect PA bijectivity).
The MSPRT process is readily neuralizable, and so amenable Furthermore, there is intrinsic parameter-sharing amongst
to back-propagative learning in relation to the system as a certain of the resulting network’s weights. Consequently,
whole. In order to produce fully end-to-end neuralization of the during training, all deep-learning tools appropriate for back-
PA hierarchy it thus only remains to neuralize the LRM. propagation within convolutional neural networks can be
applied. Critically, FOL syntax is retained during training,
irrespective of the network architecture of the sub-symbolic
III. NEURALIZATION OF THE LRM levels. We thus obtain an end-to-end trainable deep network
The direct translation of logic programs into artificial neural for implementing a perception-action hierarchy based frontal
networks has a relatively long history. A standard approach to cortex model within an autonomous driving context.
neuralisation of Horn clauses, using a local representation in Acknowledgment: This work was supported by the EU 2020
which each (ground) atom corresponds to a single dedicated project Dreams4Cars, grant number 731593
neuron, is exemplified by the Knowledge-Based Artificial
Neural Network (KBANN) of Towell & Shavlik [7]. REFERENCES
Networks of this type have been criticized as having a
[1] P. Cisek, «Cortical mechanisms of action selection: the affordance
“propositional fixation”: a finite neural network can represent competition hypothesis», Philos. Trans. R. Soc. B Biol. Sci., vol.
only a finite number of ground atoms, and can therefore 362, n. 1485, pagg. 1585–1599, set. 2007.
represent a logic program only for a finite base. A language [2] G. Pezzulo e P. Cisek, «Navigating the Affordance Landscape:
with first-order syntax but only a finite alphabet of symbols is Feedback Control as a Process Model of Behavior and Cognition»,
equivalent to propositional logic, because any universally Trends Cogn. Sci., vol. 20, n. 6, pagg. 414–424, giu. 2016.
quantified (“for all X”') clause can be translated into finitely [3] K. Meyer e A. Damasio, «Convergence and divergence in a neural
architecture for recognition and memory», Trends Neurosci., vol.
many propositional clauses of the same form, one for each 32, n. 7, pagg. 376–382, lug. 2009.
possible value of X. Thus, networks of the KBANN type [4] D. Windridge, «Emergent Intentionality in Perception-Action
cannot implement ``true'' first-order logic programs, only a Subsumption Hierarchies», Front. Robot. AI, vol. 4, ago. 2017.
finite fragment of first-order logic. [5] D. Windridge e S. Thill, «Representational fluidity in embodied
(artificial) cognition», Biosystems, vol. 172, pagg. 9–17, ott. 2018.
Holldobler et al. [8] give a neural method (in fact a precise [6] R. Bogacz e K. Gurney, «The basal ganglia and cortex implement
duality) that replicates the immediate-consequence operator optimal decision making between alternative actions», Neural
T_P of a true First-Order Logic program, to a desired degree Comput., vol. 19, n. 2, pagg. 442–477, 2007.
[7] G. Towell and J. W. Shavlik, “Knowledge based artificial neural
of accuracy in a real embedding. However, this may involve a networks,” Artificial Intelligence, vol. 70, no. 4, 1994.
thousand or a million copies of a clause, one for each possible [8] S. H¨olldobler, F. Kurfess, and H.-P. St¨orr, “Approximating the
grounding of a variable X. semantics of logic programs by recurrent neural networks,” Applied
Intelligence,vol. 11, no. 1.
In neuralizing a logic program it is thus desirable to maintain
the concept that there is a universal rule, rather than a thousand
Following Social Groups: Socially Compliant Autonomous
Navigation in Dense Crowds
Xinjie Yao1 , Ji Zhang2 and Jean Oh2

Abstract— In densely populated environments, socially com- to that a large set of comprehensive expert demonstrations
pliant navigation is critical for autonomous robots as driving are hard to acquire.
close to people is unavoidable. This manner of social navigation The study of this paper is based on our previous work
is challenging given the constraints of human comfort and social
rules. Traditional methods based on hand-craft cost functions which uses deep learning in solving the socially compliant
to achieve this task have difficulties to operate in the complex navigation problem [9]. This paper extends the work in
real world. Other learning-based approaches fail to address the two ways. First, we consider the findings from a previous
naturalness aspect from the perspective of collective formation study [10] that 70% of people walk in social groups. Crowd
behaviors. We present an autonomous navigation system capa- behavior can be summarized as flows of social groups, and
ble of operating in dense crowds and utilizing information of
social groups. The underlying system incorporates a deep neural humans tend to move along the flow. It is our understanding
network to track social groups and join the flow of a social that the behavior of joining the flow that shares similar
group in facilitating the navigation. A collision avoidance layer heading direction is more socially compliant, causing fewer
in the system further ensures navigation safety. In experiments, collisions and disturbances to surrounding pedestrians. Our
our method generates socially compliant behaviors as state-of- method recognizes social groups and selects the flow to fol-
the-art methods. More importantly, the system is capable of
navigating safely in a densely populated area (10+ people in a low. Second, we ensure safety with a multi-layer navigation
10 m × 20 m area) following crowd flows to reach the goal. system. In this system, a deep learning-based global planning
layer makes high-level socially compliant behavioral deci-
I. I NTRODUCTION sions while a geometry-based local planning layer handles
collision avoidance at a low-level.
The ability to safely navigate in populated scenes, e.g. The paper is further related to previous work on modeling
airports, shopping malls, and social events, is essential for aggregate interactions among social groups [10] and leverag-
autonomous robots. The difficulty comes from the fact that ing learned social relations in tracking group formations [11].
people walk closely to the robot cutting ways in front of the Our main contributions are a deep learning-based method for
robot or between the robot and the goal point. The safety socially compliant navigation with an emphasis on tracking
margin for the robot to drive in crowded scenes is pushed to and joining the crowd flow and an overall system integrated
the minimum. In such a case, the navigation system has to with the deep leaning method capable of safe autonomous
trade-off between driving safely close to people and reaching navigation in dense crowds.
the goal quickly. Furthermore, a previous study of socially
II. M ETHOD
compliant navigation [1] states three aspects in terms of
the robot behaviors – comfort as the absence of annoyance A. System Overview
and stress for humans in interaction with robots, naturalness Fig. 1 gives an overview of the autonomous navigation
as the similarity between the robot and human behaviors, system which consists of three subsystems as follows.
and sociability as to abide by general cultural conventions. • State Estimation Subsystem involves a multi-layer
Among these three aspects, the first aspect essentially reflects data processing pipeline which leverages lidar, vision,
safety of the navigation. and inertial sensing [12]. The subsystem computes the
Previous studies on socially compliant navigation attempt 6-DOF pose of the vehicle as well as registers laser scan
to solve the problem with various methods, including data- data with the computed pose.
driven approaches for human trajectory prediction [2], [3], • Local Planning Subsystem is a low-level planning sub-
potential field-based [4] and social force model-based [5] ap- system in charge of obstacle avoidance in the vicinity of
proaches. In particular, reinforcement learning-based meth-
ods use reward functions to penalizes improper robot be-
haviors eliminating the cause of discomfort [6], [7]. Inverse
reinforcement learning based-methods learn from expert
demonstrations [8]. These methods are hard to generalize due

1 X. Yao is with the Department of Electronic and Computer En-


gineering, Hong Kong University of Science and Technology. Email:
[email protected]
2 J. Zhang and J. Oh are with the Robotics Institute, Carnegie Mellon
University. Emails: [email protected], [email protected] Fig. 1: Navigation software system diagram.
the vehicle. The planning algorithm involves a trajectory
library and computes collision-free paths for the vehicle
to navigate [13].
• Social Navigation Planning Subsystem takes in obser-
vations only consisting of pedestrians by subtracting the
prior map. The subsystem tracks pedestrians in the sur-
roundings of the vehicle, and then extracts the grouping Fig. 2: Group pooling module in the Group-Navi GAN deep
information from the pedestrian walking patterns, with network. The input of the module is the relative displacements of
the surrounding pedestrians w.r.t. the target agent. The module as-
which, the subsystem generates way-points (as input of sociates the target agent to a group based on the motion information
the Local Planning Subsystem), leveraging Group-Navi and outputs path adjustments for the robot to follow the group.
GAN, a generative planning algorithm in an adversarial
training framework based on a deep neural network, III. E XPERIMENTS
Navi-GAN [9].
A. Social Compliance Evaluation

B. Group-Navi GAN We evaluate our method on two publicly available datasets:


ETH [15] and UCY [16]. These datasets include rich social
Following the extended social force model [10], we pro- interactions in real-world scenarios. We follow the same
pose Group-Navi GAN, a framework to jointly address the evaluation methodology as the leave-one-out approach and
safety and naturalness aspects at a group’s level. Group- the error metrics used in the prior work [3]:
Navi GAN is inspired by our previous work Navi-GAN 1) Average Displacement Error: The average L2 distance
[9] which models social forces at an individual’s level. between predicted way-points and ground-truth trajec-
An intention-force generator in the Group-Navi GAN deep
#» tories over the predicted time steps.
network models the driving force as fi 0 for target agent i 2) Final Displacement Error: The L2 distance between
to move toward the goal. A group-force generator models
#» the predicted way-point and true final position at the
the repulsive force from other pedestrians j as fij and the
#» group last predicted time step.
interaction force from other group members as fi . The
joint output of the intention-force generator and group-force We compare against a linear regressor that only predicts
generator defines the path for the robot to navigate. straight paths, Social-GAN(SGAN) [3], and Navi-GAN [9].
We use the past eight time steps to predict the future
In the group-force generator, a group pooling module first
eight time steps. As shown in TABLE I, our method yields
associates the target agent to a group based on the motion
considerable accuracy improvements for some of the datasets
information (see Fig. 2). Then, the group pooling module
where rich group interactions are prevalent. In particular,
computes path adjustments which essentially guide the robot
UNIV and ZARA1 have more than 70% of the pedestrians
to follow the group. We apply a support vector machine
moving in social groups, and thus our model performs better.
classifier [14] trained by [11] to determine if two agents
Our model performs slightly worse than the state-of-the-
belong to the same group. This uses the local spatio-temporal
art approaches with the ETH and HOTEL datasets due to
relation to cluster the agents with similar motions based on
the lack of social group interactions. Further, our method
the coherent motion indicators, i.e. the differences in walking
assumes the existence of a goal point for each person in the
speed, spatial locations, and headings.
dataset. Lacking precise goal point information results in a
We use the following equation to aggregate the hidden
relative low accuracy. In the next experiments, we will show
state from htj to h0j t ,
results with author-collected data where the strength of our
method is more obvious.
h0j t = Iij [si = sj ] ∗ cos(θi − θj ) ∗ htj , (1)
Metric Dataset Group Percentage Linear SGAN [3] Navi-GAN [9] Group-Navi GAN
where Iij [si = sj ] indicates if two agents are in the same ETH [15] 18% 0.84 0.60 0.95 1.33
group, HOTEL [15]
UNIV [16]
19%
73%
0.35
0.56
0.48
0.36
0.43
0.85
0.39
0.29
ADE
ZARA1 [16] 70% 0.41 0.21 0.40 0.21
( ZARA2 [16] 69% 0.53 0.27 0.47 0.30
1, if i and j are in the same group AVG 50% 0.54 0.39 0.62 0.50
Iij [si = sj ] = (2) ETH [15]
HOTEL [15]
18% 1.60
0.60
1.22 1.64 1.98
0, otherwise UNIV [16]
19%
73% 1.01
0.95
0.75
0.74
1.36
0.93
0.68
FDE
ZARA1 [16] 70% 0.74 0.42 0.66 0.40
ZARA2 [16] 69% 0.95 0.54 0.72 0.85
θi and θj are the agent headings. The resulting embedding AVG 50% 0.98 0.78 1.02 0.96

Hit of hidden state h0j t is computed as a row vector which


TABLE I: Social compliance evaluation of Group-Navi GAN and
consists of the maximum elements from all other agents. The other baseline approaches. Two error metrics, Average Displace-
embedding is further concatenated for decoding, ment Error and Final Displacement Error are reported (in meters)
for tobs = 8 and tpred = 8. We manually count the number of
Hi0 t = [Hit , hti , ni ] (3) pedestrians moving in social groups. Our method outperforms the
prior work with the UNIV and ZARA1 datasets where social groups
where ni is random noise drawn from N (0, 1) . are richly available.
Without Social Model
With Social Model
Trajectories

(1) (2) (3)


Fig. 3: Simulation results in a 10 m × 20 m area. The tests involve 18 people walking in 6 groups. Each group moves in a different
direction. The three columns present three representative cases. The first and second rows show screenshots of the simulation environment.
The coordinate frame indicates the robot. The goal point is marked as the magenta dot. The red dots are the tracked pedestrians using
laser scan data. The third row displays the trajectories of the pedestrians (gray and green) and the robot (yellow and red). The dots are
the start points and the star is the goal point of the robot. When using Navigation without Social Model, the robot produces the yellow
path. When using Navigation with Social Model, the robot follows the group in green color and produces the red path. A blue square is
labeled on each robot path where the corresponding screenshot is captured on the first and second rows. Specifically, on the first row, the
screenshots show the moments when the robot drives overly close to people due to not using the social model. On the second row, the
screenshots are taken while the robot follows a group during the navigation.

B. Group Following Evaluation groups moving in other directions.


Finally, we conduct an Amazon Mechanical Turk (AMT)
We further evaluate the method with a robot vehicle as
study to further understand the safety and naturalness of the
shown in Fig. 4. The robot is equipped with a Velodyne Puck
robot navigation. A total of 466 participants evaluate the
laser scanner for collision avoidance and pedestrian tracking.
simulation and real-world results. As shown in Table II, >
Our method is evaluated in two configurations – Navigation
90% of the participants consider Navigation without Social
with Social Model refers to the full navigation system as
Model to be unsafe (with collisions) while the ratio reduces
shown in Fig. 1, and Navigation without Social Model
to < 40% using Navigation with Social Model. With the
has the Social Navigation Planning Subsystem removed.
The State Estimation Subsystem and the Local Planning
Subsystem are directly coupled. The robot navigates directly
toward the goal and uses the Local Planning Subsystem to
avoid collisions locally.
We show results in both simulation and real-work exper-
iments with pedestrian data collected by the robot. In sim-
ulation, we show scenarios with 18 people walking around
the robot in 6 groups. In real-work experiments, we have
6 people walking in 2 groups. One group moves along the
robot navigation direction and the other group moves in the
opposite direction. The results are shown in Fig. 3 and Fig. 5. Fig. 4: Experiment platform. A wheelchair-based robot carries a
In each scenario, the robot selects a group to follow with sensor pack on the top. The sensor pack consists of a Velodyne
the full navigation system (Navigation with Social Model). Puck laser scanner, a camera, and a low-grade IMU. The scan data
If using Navigation without Social Model, the robot drives is used for collision avoidance and pedestrian tracking. A laptop
directly toward the goal and results in interactions with computer carries out all onboard processing.
involves a group pooling mechanism by inferring social
relationships to encourage the autonomous navigation to
join the flow of a social group sharing the same moving
direction. We show the effectiveness of our method through
quantitative and empirical studies in both simulations and
real-world experiments. The result is that by joining the
crowd flow, the robot has fewer collisions with people
crossing sideways or walking toward the robot. Joining the
flow also creates fewer disturbances to the pedestrians. As
a result, the robot navigates in a safe and natural manner.
Since this paper focuses on human-robot interactions at a
group’s level, extension of the work in the future can model
interactions between groups and scattered individuals.
ACKNOWLEDGMENT
Special thanks are given to C.-E. Tsai, Y. Song, D. Zhao
(a) Without Social Model (b) With Social Model
for facilitating experiments.
Fig. 5: Real-world experiments in a 10 m × 20 m area. The first row
shows photos of 6 people walking in 2 groups. One group moves R EFERENCES
along the robot navigation direction and the other group moves in [1] T. Krusea, A. K. Pandeybc, R. Alamibc, and A. Kirschd, “Human-
the opposite direction. The second row shows the corresponding aware robot navigation: A survey,” Robotics and Autonomous Systems,
trajectories of the people (blue and green) and the robot (orange). vol. 61, no. 12, pp. 1726–1743, Dec 2013.
Dots indicate the start points and the star indicates the goal point [2] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, F.-F. Li, and
of the robot. In (a), when using Navigation without Social Model, S. Savarese, “Social lstm: Human trajectory prediction in crowded
the robot drives directly toward the goal point and results in cutting spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
[3] G. Agrim, J. Justin, F.-F. Li, S. Silvio, and A. Alexandre, “Social gan:
through the group on the left that moves against the robot. In (b),
Socially acceptable trajectories with generative adversarial networks,”
when using Navigation with Social Model, the robot follows the arXiv preprint arXiv:1803.10892, 2018.
group on the right and avoids disturbances to the pedestrians. [4] F. Hoeller, D. Schulz, M. Moors, and E. F. Schneider, “Accompanying
persons with a mobile robot using motion prediction and probabilistic
roadmaps,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems
real-world results, 95% of the participants report that the (IROS), 2007.
[5] D. Helbing and P. Molnar, “Social force model for pedestrian dynam-
robot forces other pedestrians to change their paths if using ics,” Physical Review E, vol. 51, 1998.
Navigation without Social Model. When using Navigation [6] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware
with Social Model, the ratio reduces to 4%. The survey result motion planning with deep reinforcement learning,” arXiv preprint
arXiv:1703.08862, 2018.
validates that our method helps reduce disturbances to other [7] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction:
pedestrians as well as improves safety of the navigation. Crowd-aware robot navigation with attention-based deep reinforce-
A video of these results can be seen at www.youtube.com/ ment learning,” arXiv preprint arXiv:1809.08835, 2019.
[8] H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially com-
watch?v=I_SkA9rmxYE. pliant mobile robot naviga- tion via inverse reinforcement learning,”
The International Journal of Robotics Research, vol. 35, no. 11, 2016.
IV. C ONCLUSION [9] C.-E. Tsai, “A generative approach for socially compliant navigation,”
Master’s thesis, Pittsburgh, PA, June 2019.
The paper proposes an autonomous navigation system [10] M. Moussad, N. Perozo, S. Garnier, D. Helbing, and G. Theraulaz,
capable of operating in dense crowds. In this system, a “The walking behaviour of pedestrian social groups and its impact on
crowd dynamics,” arXiv preprint arXiv:1003.3894, 2010.
Social Navigation Planning Subsystem incorporating a deep [11] T. Linder and K. O. Arras, “Multi-model hypothesis tracking of groups
neural network generates socially compliant behaviors. This of people in rgb-d data,” in Proc.of the 17th Intl Conf. on Information
Fusion, 2014.
Metric Scene Without Social Model With Social Model
[12] J. Zhang and S. Singh, “Random field topic model for semantic
(1) 97% 42%
region analysis in crowded scenes from tracklets,” in Journal of Field
Collision (Safety)
(2) 92% 6% Robotics, vol. 35, no. 8, 2018.
(3) 92% 36% [13] J. Zhang, C. Hu, R. Gupta Chadha, and S. Singh, “Maximum likeli-
AVG 93% 28%
hood path planning for fast aerial maneuvers and collision avoidance,”
Path Change (Naturalness) Real world 95% 4%
in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS),
TABLE II: Results of survey study. A total of 466 participants 2019.
evaluate the simulation results in Fig. 3 and the real-world results [14] C. Cortes and V. N. Vapnik, “Support-vector networks,” Machine
Learning, no. 20, 1995.
in Fig. 5. We can see that > 90% of the participants consider the [15] S. Pellegrini, A. Ess, and L. V. Gool, “Improving data association by
Navigation without Social Model to have collisions. For Navigation joint modeling of pedestrian trajectories and groupings,” in Computer
with Social Model, the ratio reduces to < 40%. Further, 95% of the VisionECCV, 2010.
participants report that the robot forces other pedestrians to change [16] L. Leal-Taixe, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and
their paths if using Navigation without Social Model. When using S. Savarese, “Improving data association by joint modeling of pedes-
Navigation with Social Model, the ratio reduces to 4%. The ratios trian trajectories and groupings,” in CVPR, 2014.
reduce by 3 times in terms of collision and 20 times in terms of path
change which validate that our method helps reduce disturbances
to other pedestrians as well as improves safety.

You might also like