Deep Learning For Sensor Fusion
Deep Learning For Sensor Fusion
by
SHAUN M. HOWARD
August, 2017
CASE WESTERN RESERVE UNIVERSITY
Shaun M. Howard
candidate for the degree of Master of Science*.
Committee Chair
Dr. Wyatt S. Newman
Committee Member
Dr. Murat Cenk Cavusoglu
Committee Member
Dr. Michael S. Lewicki
Date of Defense
List of Tables v
Abstract xx
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Sensor Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Current Sensor Fusion Systems . . . . . . . . . . . . . . . . . . . . . 6
1.5 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Feedforward Neural Networks . . . . . . . . . . . . . . . . . . 13
1.5.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.4 Training with Backpropagation . . . . . . . . . . . . . . . . . 17
1.5.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 22
1.5.6 Training with Backpropagation through time . . . . . . . . . . 27
1.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.7 Network Tuning and Evaluation Techniques . . . . . . . . . . . . . . 31
i
CONTENTS
2 Methods 38
2.1 Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Utilized Sensor Streams . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Correlation of Consistent Sensor Tracks . . . . . . . . . . . . . 43
2.5.2 Notion of Sensor Data Consistency . . . . . . . . . . . . . . . 46
2.5.3 Notion of Sensor Stream Comparison for Fusion . . . . . . . . 48
2.6 Data Consumption, Preprocessing and Labeling . . . . . . . . . . . . 48
2.6.1 Reading and Processing of Big Data . . . . . . . . . . . . . . 48
2.6.2 Preprocessing Data for Neural Networks . . . . . . . . . . . . 52
2.6.3 Labeling of Data and Dataset Creation for Consistency . . . . 52
2.6.4 Labeling of Data and Dataset Creation for Sensor Fusion . . . 55
2.7 Sensor Consistency Deep Neural Network . . . . . . . . . . . . . . . . 59
2.7.1 Sensor Consistency Deep Neural Network Design . . . . . . . 60
2.7.2 Sensor Consistency Network Training Method . . . . . . . . . 62
2.7.3 Sensor Consistency Network Training Cycles and Data Parti-
tioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.8 Sensor Consistency Network Validation Cycles and Metrics . . . . . . 64
2.9 Sensor Fusion Deep Neural Network . . . . . . . . . . . . . . . . . . . 64
2.9.1 Sensor Fusion Deep Neural Network Design . . . . . . . . . . 65
2.9.2 Sensor Fusion Network Training Method . . . . . . . . . . . . 66
2.9.3 Sensor Fusion Network Training Cycles and Data Partitioning 67
2.10 Sensor Fusion Network Validation Cycles and Metrics . . . . . . . . . 68
ii
CONTENTS
3 Results 71
3.1 Correlation of Consistent Sensor Tracks . . . . . . . . . . . . . . . . . 71
3.2 Sensor Consistency Neural Network Training . . . . . . . . . . . . . . 72
3.2.1 Training Errors vs. Time . . . . . . . . . . . . . . . . . . . . . 72
3.3 Sensor Consistency Neural Network Validation Results . . . . . . . . 76
3.4 Sensor Fusion Dataset Labels . . . . . . . . . . . . . . . . . . . . . . 80
3.4.1 State-of-the-art Sensor Fusion Labels Statistics . . . . . . . . 80
3.4.2 Generated Sensor Fusion Labels Statistics . . . . . . . . . . . 81
3.5 Sensor Fusion Neural Network Training . . . . . . . . . . . . . . . . . 82
3.5.1 Training Errors vs. Time . . . . . . . . . . . . . . . . . . . . . 82
3.6 Sensor Fusion Neural Network Validation Results . . . . . . . . . . . 86
3.6.1 State-of-the-art Fusion Dataset Validation Results . . . . . . . 88
3.6.2 Sensor Fusion Neural Network Precision and Recall Validation
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.7 Deep Learning Sensor Fusion System Evaluation . . . . . . . . . . . . 95
3.7.1 Agreement between Deep Learning Fusion System and State-
of-the-art Fusion System . . . . . . . . . . . . . . . . . . . . . 95
3.8 Deep Learning Sensor Fusion System Runtime Results . . . . . . . . 103
4 Discussion 105
4.1 Sensor Data Consistency and Correlation . . . . . . . . . . . . . . . . 105
4.2 Sensor Consistency Neural Network Training Results . . . . . . . . . 108
4.3 Sensor Consistency Neural Network Validation Results . . . . . . . . 110
4.4 Sensor Fusion Neural Network Training Results . . . . . . . . . . . . 111
4.5 Sensor Fusion Neural Network Validation Results . . . . . . . . . . . 112
iii
CONTENTS
5 Conclusion 117
Appendices 120
.1 Most Restrictive Fusion Dataset Validation Results . . . . . . . . . . 122
.2 Mid-Restrictive Fusion Dataset Validation Results . . . . . . . . . . . 127
.3 Least Restrictive Fusion Dataset Validation Results . . . . . . . . . . 132
.4 Maximum Threshold Fusion Dataset Validation Results . . . . . . . . 137
iv
List of Tables
2.1 Sensor Consistency Neural Network Encoder and Decoder Layer Types 61
2.2 Sensor Fusion Neural Network Encoder and Decoder Layer Types . . 66
v
LIST OF TABLES
3.9 Confusion Matrix Between Deep Learning Fusion System with Fusion
RNN and State-of-the-art (SOA) Fusion System on Second Most Pop-
ulated Radar Track. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.10 Confusion Matrix Between Deep Learning Fusion System with Fusion
GRU and State-of-the-art (SOA) Fusion System on Most Populated
Radar Track. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.11 Confusion Matrix Between Deep Learning Fusion System with Fusion
GRU and State-of-the-art (SOA) Fusion System on Second Most Pop-
ulated Radar Track. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.12 Confusion Matrix Between Deep Learning Fusion System with Fusion
LSTM and State-of-the-art (SOA) Fusion System on Most Populated
Radar Track. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.13 Confusion Matrix Between Deep Learning Fusion System with Fusion
LSTM and State-of-the-art (SOA) Fusion System on Second Most Pop-
ulated Radar Track. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.14 Deep Learning Fusion System Runtime on Intel CPU with Serial, Par-
allel and Batch Processing. . . . . . . . . . . . . . . . . . . . . . . . . 104
3.15 Deep Learning Fusion System Runtime on with Intel CPU and NVIDIA
GPU with Serial, Parallel and Batch Processing. . . . . . . . . . . . . 104
3.16 Deep Learning Fusion System Runtime on RaspberryPi 3 CPU with
Serial, Parallel and Batch Processing. . . . . . . . . . . . . . . . . . . 104
vi
List of Figures
vii
LIST OF FIGURES
viii
LIST OF FIGURES
1.5 The gated recurrent unit (GRU) cell proposed in [Chung et al., 2014].
In this diagram, r is the reset gate, z is the update gate, h is the
activation, and ĥ is the candidate activation. . . . . . . . . . . . . . . 27
ix
LIST OF FIGURES
2.6 A time-series sample of fusion neural network input and output, which
consist of the absolute differences between fused radar and camera
track streams belonging to one consistent moving object over 25 con-
secutive 40 ms time steps and a corresponding binary fusion label se-
quence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.7 The multi-stream sequence-to-sequence (seq2seq) consistency sequence
deep neural network. In experiments, the encoder and decoder layers
were selected from table 2.1. . . . . . . . . . . . . . . . . . . . . . . . 62
2.8 The multi-stream sensor fusion classification deep neural network. In
experiments, the encoder and decoder layers were selected from table
2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1 An example of the train and test error over time per one complete cycle
of consistency MLP network training. . . . . . . . . . . . . . . . . . . 73
3.2 An example of the train and test error over time per one complete cycle
of consistency recurrent neural network (RNN) network training. . . . 74
3.3 An example of the train and test error over time per one complete cycle
of consistency GRU network training. . . . . . . . . . . . . . . . . . . 75
3.4 An example of the train and test error over time per one complete cycle
of consistency LSTM network training. . . . . . . . . . . . . . . . . . 76
3.5 MSE, RMSE and coefficient of determination (r2 -metric) performance
for the consistency MLP neural network on the test dataset across 9
bootstrapping train/test trials from over 1 million randomly-selected
samples. The average score and its standard deviation are included in
the title of each subplot. . . . . . . . . . . . . . . . . . . . . . . . . . 77
x
LIST OF FIGURES
3.6 MSE, RMSE and r2 -metric performance for the consistency RNN neu-
ral network on the test dataset across 9 bootstrapping train/test trials
from over 1 million randomly-selected samples. The average score and
its standard deviation are included in the title of each subplot. . . . . 78
3.7 MSE, RMSE and r2 -metric performance for the consistency GRU neu-
ral network on the test dataset across 9 bootstrapping train/test trials
from over 1 million randomly-selected samples. The average score and
its standard deviation are included in the title of each subplot. . . . . 79
3.8 MSE, RMSE and r2 -metric performance for the consistency LSTM
neural network on the test dataset across 9 bootstrapping train/test
trials from over 1 million randomly-selected samples. The average score
and its standard deviation are included in the title of each subplot. . 80
3.9 An example of the train and test error over time per one complete cycle
of fusion MLP network training on the most restrictive fusion dataset. 83
3.10 An example of the train and test error over time per one complete cycle
of fusion RNN network training on the most restrictive fusion dataset. 84
3.11 An example of the train and test error over time per one complete cycle
of fusion GRU network training on the most restrictive fusion dataset. 85
3.12 An example of the train and test error over time per one complete cycle
of fusion LSTM network training on the most restrictive fusion dataset. 86
3.13 Accuracy, misclassification, precision, recall, F1 weighted macro, ROC
AUC micro and ROC AUC macro scores for the MLP fusion neural net-
work across 9 bootstrapping trials of random train/test data selection
from over 1 million randomly-selected samples via the state-of-the-art
fusion dataset. The average score and its standard deviation are in-
cluded in the title of each subplot. . . . . . . . . . . . . . . . . . . . . 88
xi
LIST OF FIGURES
xii
LIST OF FIGURES
3.21 A sample true positive fusion agreement between the deep learning
sensor fusion system and the state-of-the-art sensor fusion system over
a one-second period. The fused radar and camera track streams are
plotted against each other to demonstrate the fusion correlation. . . . 97
3.22 A sample false positive fusion disagreement between the deep learning
sensor fusion system and the state-of-the-art sensor fusion system over
a one-second period. The radar and camera track streams are plot-
ted against each other to demonstrate the potential fusion correlation
where the systems disagreed. . . . . . . . . . . . . . . . . . . . . . . . 98
3.23 A sample true negative fusion agreement between the deep learning
sensor fusion system and the state-of-the-art sensor fusion system over
a one-second period. The non-fused radar and camera track streams
are plotted against each other to demonstrate the absence of valid data
in the camera track while the radar track is consistently populated thus
leading to non-fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.24 A sample false negative fusion disagreement between the deep learning
sensor fusion system and the state-of-the-art sensor fusion system over
a one-second period. The radar and camera track streams are plot-
ted against each other to demonstrate the potential fusion correlation
where the systems disagreed. . . . . . . . . . . . . . . . . . . . . . . . 100
xiii
LIST OF FIGURES
xiv
LIST OF FIGURES
xv
LIST OF FIGURES
xvi
LIST OF FIGURES
xvii
List of Acronyms
BP backpropagation.
FC fully-connected.
xviii
List of Acronyms
ID identification.
ms millisecond.
seq2seq sequence-to-sequence.
xix
Deep Learning for Sensor Fusion
Abstract
by
SHAUN M. HOWARD
xx
Chapter 1
Introduction
1.1 Motivation
The application of deep learning for sensor fusion is important to the future of self-
driving vehicles. Autonomous vehicles must be able to accurately depict and predict
their surroundings with enhanced robustness, minimal uncertainty and high reliabil-
ity. Given the lack of human interaction in these vehicles, they need to be self-aware
and resilient in order to deal with their surroundings intelligently and handle errors
logically. In order to determine the effectiveness of deep learning for the sensor fusion
task, this research seeks to evaluate the applicability, success rate, and performance of
novel deep learning techniques in order to learn and provide the complex capabilities
of a preexisting state-of-the-art sensor fusion system.
1.2 Sensors
1
CHAPTER 1. INTRODUCTION
• Spatial coverage limits: Each sensor may only cover a certain region of space.
For example, a dashboard camera will observe less surrounding region than a
camera with a wide-view lens.
• Temporal coverage limits: Each sensor may only provide updates over cer-
tain periods of time, which may cause uncertainty between updates.
Many different types of sensors exist in the world, and each has its own unique ap-
plication. Four sensors are pertinent to the field of autonomous driving [Levinson et al., 2011].
The four sensors, their descriptions, uses, advantages and disadvantages are men-
tioned below:
2
CHAPTER 1. INTRODUCTION
• Radar: RAdio Detection And Ranging (radar) is a remote sensing device that
uses an antenna to scatter radio signals across a region in the direction it is
pointing and listens for response signals that are reflected by objects in that area.
Radar measures signal time of flight to determine the distance. Radars may use
the doppler effect to compute speed based on shift in frequency of scattered
waves as an object moves. Radar is useful for detecting obstacles, vehicles and
pedestrians around a vehicle [Huang et al., 2016]. Tracking multiple targets at
once is a primary use for an automotive radar.
3
CHAPTER 1. INTRODUCTION
Sensor fusion is the act of combining data acquired from two or more sensors sources
such that the resulting combination of sensory information provides a more certain
4
CHAPTER 1. INTRODUCTION
description of factors observed by the separate sensors than would be if used individ-
ually [Elmenreich, 2001]. Sensor fusion is pertinent in many applications that entail
the use of multiple sensors for inference and control. Examples of applications in-
clude intelligent and automated systems such as automotive driver assistance systems,
autonomous robotics, and manufacturing robotics [Elmenreich, 2007].
Sensor fusion methods aim to solve many of the problems inherently present in
sensors. Several important benefits may be derived from sensor fusion systems over
single or disparate sensor sources. The benefits of sensor fusion over single source are
the following [Elmenreich, 2001]:
• Extended spatial coverage: Each sensor may cover different areas. Com-
bining the covered areas will lead to a greater overall coverage of surrounding
environment and accommodate sensor deprivation.
5
CHAPTER 1. INTRODUCTION
According to [Rao, 2001], sensor fusion can yield results that outperform the mea-
surements of the single best sensor in the system if the fusion function class satisfies a
proposed isolation property based on independently and identically distributed (iid)
samples. The fusion function classes outlined that satisfy this isolation property
are linear combinations, certain potential functions and feed-forward piecewise linear
neural networks based on minimization of empirical error per sample.
Many researchers and companies have developed their own versions of sensor fusion
systems for various purposes. Many of these systems are well-known an widely used
in practice within the fields of automotive and robotics. In [Steux et al., 2002], a
vehicle detection and tracking system using monocular color vision and radar data
fusion using a 3-layer belief network was proposed called FADE. The fusion system
focused on lower-level fusion and combined 12 different features to generate target
position proposals each step and for each target. FADE performed in real-time and
yielded good detection results in most cases according to scenarios recorded in a real
car.
A fusion system for collision warning using a single camera and radar was applied
to detect and track vehicles in [Srinivasa et al., 2003]. The detections were fused using
a probabilistic framework in order to compute reliable vehicle depth and azimuth
angles. Their system clustered object detections into meta-tracks for each object and
fused object tracks between the sensors. They found that the radar had many false
positives due to multiple detections on large vehicles, structures, roadway signs and
6
CHAPTER 1. INTRODUCTION
overhead structures. They also found that the camera had false positive detections on
larger vehicles and roadway noise like potholes. Their system worked appropriately
for nearby vehicles that were clearly visible by both sensors, but the system failed to
detect vehicles more than 100 meters away due to insufficient resolution or vehicle
occlusion.
In [Dagan et al., 2004], engineers from Mobileye successfully applied a camera
system to compute the time to collision (TTC) course from size and position of
vehicles in the image. Although they did not test this theory, they mentioned the
future use of radar and camera in a sensor fusion system since the radar would
give more accurate range and range-rate measurements while the vision would solve
angular accuracy problems of the radar. When the research was conducted, it was
suggested that the fusion solution between radar and camera was costly, but since
then, costs have decreased.
A collision mitigation fusion system using a laser-scanner and stereo-vision was
constructed and tested in [Labayrade et al., 2005]. The combination of the compli-
mentary laser scanner and stereo-vision sensors provided a high detection rate, low
false alarm rate, and a system reactive to many obstacle occurrences. They men-
tioned that the laser-scanner was fast and accurate but could not be used alone due
to many false alarms from collisions with the road surface and false detections with
laser passes over obstacles. They also mentioned that stereo-vision was useful for
modeling road geometry and obstacle detection, but it was not accurate for comput-
ing precise velocities or TTC for collision mitigation.
In [Laneurit et al., 2003], a Kalman filter was successfully developed and applied
for the purpose of sensor fusion between multiple sensors including GPS, wheel angle
sensor, camera and LiDAR. They showed that this system was useful for detection
and localization of vehicles on the road, especially when using the wheel angle sensor
for detecting changes in vehicle direction. Their results revealed that cooperation
7
CHAPTER 1. INTRODUCTION
between the positioning sensors for obstacle detection and location paired with LiDAR
were able to improve global positioning of vehicles.
A deep learning framework for signal estimation and classification applicable for
mobile devices was created and tested in [Yao et al., 2016]. This framework applied
convolutional and recurrent layers for regression and classification mobile comput-
ing tasks. The framework exploited local interactions of different sensing modalities
using convolutional neural network (CNN)s, merged them into a global interaction
and extracted temporal relationships via stacked GRU or LSTM layers. Their frame-
work achieved a notable mean absolute error on vehicle tracking regression tasks as
compared to existing sensor fusion systems and high accuracy on human activity
recognition classification tasks while it remained efficient enough to use on mobile
devices like the Google Nexus 5 and Intel Edison.
A multimodal, multi-stream deep learning framework designed to tackle the ego-
centric activity recognition using data fusion was proposed in [Song et al., 2016b]. To
begin, they extended a multi-stream CNN to learn spatial and temporal features from
egocentric videos. Then, they proposed a multi-stream LSTM architecture to learn
features from multiple sensor streams including accelerometer and gyroscope. Third,
they proposed a two-level fusion technique using SoftMax classification layers and dif-
ferent pooling methods to fuse the results of the neural networks in order to classify
egocentric activities. The system performed worse than a hand-crafted multi-modal
Fisher vector, but it was noted that hand-crafted features tended to perform better
on smaller datasets. In review of the research, it seems there were limited amounts of
data, flaws in the fusion design with SoftMax combination and flaws in the sensors,
such as limited sensing capabilities. These factors all may have led to worse results
than hand-crafted features on the utilized dataset.
In [Wu et al., 2015], a multi-stream deep fusion neural network system using con-
volution neural networks and LSTM layers was applied to classify multi-modal tempo-
8
CHAPTER 1. INTRODUCTION
ral stream information in videos. Their adaptive multi-stream fusion system achieved
an accuracy level much higher than other methods of fusion including averaging, ker-
nel averaging, multiple kernel learning (MKL), and logistic regression fusion methods.
1.5.1 Perceptron
The perceptron is classically known as the “fundamental probabilistic model for in-
formation storage and organization of the brain” [Rosenblatt, 1958]. It is the key
to modern machine learning applications. It was developed by an American psy-
chologist, Frank Rosenblatt, during his time at the Cornell Aeronautical Labora-
tory in 1957. The perceptron is a simplified form of a hypothetical nervous sys-
tem designed to emphasize central properties of intelligent systems that could learn
without involving the unknown circumstances often portrayed by biological organ-
isms [Rosenblatt, 1958]. The perceptron was developed based on early brain models
which responded to stimuli sequences in order to perform certain algorithms using
elements of symbolic logic and switching theory. Some of its predecessors were logi-
cal calculus models that sought to characterize nervous activity using propositional
logic as in [Mcculloch and Pitts, 1943] and a finite set of “universal” elements as in
[Minsky, 1956].
The perceptron differed from existing models in the sense that it sought to model
deterministic systems like those of [Mcculloch and Pitts, 1943], [Kleene, 1956], and
[Minsky, 1956], by the utilization of imperfect neural networks with many random
connections, like the brain. The type of logic and algebra used to create the previ-
ous models was unsuitable for describing such imperfect networks, and this inabil-
ity brought the onset of the perceptron which instead applied probability theory to
describe such behaviors. The perceptron theory was based on work derived from
9
CHAPTER 1. INTRODUCTION
[Hebb, 1949] and [Hayek, 1952], who were much more concerned with the learning
mechanisms of biological systems.
The theory on which the perceptron was based is summarized in the following:
• Physical nervous system connections used in learning and recognition are not
identical between organisms.
• The system of connected cells may change, and the probability that a stimulus
applied to one set of cells will cause a response in another set will change over
time.
10
CHAPTER 1. INTRODUCTION
11
CHAPTER 1. INTRODUCTION
Figure 1.1: A perceptron model with n real-valued inputs each denoted as an element of
the vector X along with a bias input of 1. There is a weight parameter along each input
connection which leads to the dot product of the combined inputs and weights including
the bias term and weight. The weights belong to the vector W, which is adjusted to tune
the perceptron network. The output of the dot product is fed to an activation function
which determines whether the perceptron fires with a 1 on output or does not fire with a 0
on output.
Given the fact that the perceptron by itself is considered a single element or layer
of a neural network, it has limitations to what functions it can model by itself. Since
the perceptron was designed to model a single decision boundary, it fails in multi-
class classification with greater than two linearly-separable classes. One of the initial
drawbacks found in the single-layer perceptron model was that it could not solve
the exclusive OR (XOR) logic problem [Minsky and Papert, 1969]. This publication,
despite its questionable assumptions, led to a great decrease in neural network re-
search funding and interest until the 1980s. In order to solve this problem, it was
found that multiple perceptrons could be chained together such that the outputs of
the first perceptron layer could be fed-forward as the inputs of the second percep-
tron layer in order to solve the problem of multi-class classification. This multi-layer
12
CHAPTER 1. INTRODUCTION
network configuration, now known as the MLP, was proven to easily solve the XOR
problem and even more complex mathematical functions in several works including
[Grossberg, 1973] and [Hornik et al., 1989]. A modern MLP is shown in figure 1.2.
The MLP came to belong to a class of neural networks called feedforward neural
networks. These biologically-inspired perceptron models helped form the foundation
of modern machine learning techniques which are successfully used in many practical
applications today.
13
CHAPTER 1. INTRODUCTION
the preceding and following layers, if either are present. The input is given at the first
visible layer. The output is generated by the last visible layer. The layers in-between
are known as hidden layers. A MLP with a single hidden layer is shown in figure
1.2. Layers are called “hidden” because the values used and created by these layers
are not provided as input data, but rather they are generated by abstract, weighted
mathematical functions contained within the network. More hidden neurons, which
often entail more hidden layers, allow the network to generate more abstract repre-
sentations and avoid over-fitting any specific examples, which is a desirable property
for generalizing across a large feature space proportional to the number of training
examples [Hornik et al., 1989]. However, too many hidden neurons or layers may slow
training time, increase training difficulty given increased metric entropy, or drown out
notable features when the learning space or number of training instances is small.
14
CHAPTER 1. INTRODUCTION
Figure 1.2: A single-hidden layer MLP neural network. Each neuron in each layer has
its own activation function. Each layer of neurons will typically have the same activation
function. Each neuron in each layer is connected to each of the preceding and following layer
neurons by directed, non-cyclic, weighted connections, making the MLP a fully-connected
feedforward neural network. From the left, i real-valued inputs from the Xi vector are
fed-forward to the input layer. After passing through the input layer neuron activation
functions, j outputs included in the Xj vector are fed-forward to the first hidden layer
through weighted connections. The weights for each of these connections are kept in the
Wi,j vector. A weighted bias term is also fed to the hidden layer per each neuron. This
term helps the network handle zero-input cases. The hidden layer has j neurons. k outputs
from the hidden layer neurons, stored in the Xk vector, are fed-forward through weighted
connections to the output layer. The weights for these connections are stored in the Wj,k
vector. The output layer is also fed a weighted bias term. The output layer has k neurons.
The outputs from these neurons, stored in the O vector, are the final outputs generated by
the network.
There are several neuronal activation functions that can be applied in a neural net-
work. Most of these activation functions are non-linear in order to model a target
response that varies non-linearly with the explanatory input variables. In this sense,
non-linear combinations of functions generate output that cannot be reproduced by
a linear combination of the inputs. Some notable activation functions applied in this
15
CHAPTER 1. INTRODUCTION
• Rectified Linear Unit (ReLU): The rectified linear unit activation function,
where x is a real-valued input vector, is the following:
0
for x < 0
f (x) = (1.1)
x
for x ≥ 0
1
f (x) = (1.3)
1 + exp−x
It has a range of (0, 1) and is a monotonic function that does not have a mono-
tonic derivative. It is, by characteristic, a sigmoidal function due to its S-shape.
2
f (x) = tanh(x) = −1 (1.5)
1 + exp−2x
16
CHAPTER 1. INTRODUCTION
It has a range of (−1, 1), and it is a monotonic function that does not have a
monotonic derivative. It is, by characteristic, a sigmoidal function due to its
S-shape.
expxj
σ(x)j = PK for j = 1, ..., K. (1.7)
xk
k=1 exp
17
CHAPTER 1. INTRODUCTION
In theory, the algorithm has two phases: propagation and weight update. The
phases are outlined below via [Rojas, 1996]:
• Phase 2: Weight update: The following two steps must be followed for each
and every weight in the network: A. The output delta is multiplied by the
input activation of the weight to find the weight gradient. B. The weight is
decremented by a fraction of the weight gradient.
These phases are repeated until the network error is tolerable. In phase 2, the
fraction of weight gradient subtracted affects the quality and speed of the learning
algorithm. The learning rate is the fraction of weight gradient subtracted from a given
weight. The higher the fraction, the higher the learning rate. A general approach is
to increase the learning rate until the errors diverge, but more analytical procedures
have been developed [Werbos, 1990]. As the learning rate increases, so does the rate
at which the neuron learns, but there is an inverse relationship between learning rate
and learning quality [Thimm et al., 1996]. As the learning rate increases, the learning
quality decreases. The sign of the weight gradient determines how the error varies
with respect to the weight. The weight must be updated in the opposite direction in
order to minimize the error in the neuron [Werbos, 1990].
Gradient Descent
18
CHAPTER 1. INTRODUCTION
tive function minimization is frequently posed in machine learning and often results
in a summation model such as:
n
1X
O(θ) = Oi (θ) (1.8)
n i=1
n
X ∇Oi (θ)
θ := θ − η (1.9)
i=1
n
Stochastic gradient descent (SGD) is a technique based on the gradient descent op-
timization method for stochastically approximating the minimization of an objective
function formed by a sum of differentiable functions. The iterative SGD algorithm
computes updates on a per-example basis by approximating the gradient of O(θ)
using one example as follows [Bottou, 2010], [Ruder, 2016]:
This update is performed over all examples. After all example updates, the train-
ing epoch is complete. Upon the start of a new epoch, shuffling of the training exam-
ples may occur to prevent repetition, and the process is repeated until the optimal
weights are found.
Calculation of the gradient against a “mini-batch” of examples upon each step
leads to a balance between computing the gradient per a single example as in SGD
and computing the gradient across a large batch. Vectorization libraries tend to
19
CHAPTER 1. INTRODUCTION
optimize this method based on the vectorized batch update process. Essentially, the
mini-batch gradient descent algorithm is equation 1.9 computed on batches of smaller
sizes, b, than the total training dataset size of n [Ruder, 2016]. For example, if there
were n = 50 examples in the training set and a mini-batch size of b = 5 was used
for training, then there would be 10 mini-batch training updates to the weight vector
per each training epoch for the amount of epochs needed until error minimization.
AdaGrad
20
CHAPTER 1. INTRODUCTION
RMSProp
θt+1 = θt − vt+1
21
CHAPTER 1. INTRODUCTION
A RNN is an ANN where information flows in many directions and has at least
one feed-back connection between neurons to form a directed cycle. This type of
network demonstrates dynamic temporal behavior because it forms an internal state
of memory useful for learning time-dependent sequences. One of the simplest forms
of recurrent neural networks is a MLP with outputs from one hidden layer feeding
back into the same hidden layer over a discretized time scale set by a delay unit. A
simple RNN, recognized as the Elman RNN proposed in [Elman, 1990], is pictured in
figure 1.3. The weights along the edges from hidden units to context units are held
constant at 1, while the context unit outputs feed back into the same hidden units
along standard, weighted edges. Fixed-weight recurrent edges laid the foundation
for future recurrent networks such as the LSTM [Lipton et al., 2015]. Additionally,
it has been shown that RNNs outperform MLPs at time-series forecasting of stock
market outcomes [Oancea and Ciucu, 2014]. It was shown that the RNNs were able
to provide significantly greater predictive power then MLPs in the time-series realm
given their ability to model temporal context.
22
CHAPTER 1. INTRODUCTION
Figure 1.3: A simple Elman recurrent neural network with a single hidden layer as proposed
in [Elman, 1990] and reviewed in [Lipton et al., 2015]. The hidden units feed into context
units, which feed back into the hidden units in the following time step.
Despite the usefulness of simple RNN units, they tend to utilize typical activa-
tion functions that have trouble capturing long-term dependencies due to vanishing
or exploding gradients. These gradients tend to mask the importance of long-term
dependencies with exponentially larger short-term dependencies. Two notable ap-
proaches have been applied to handle these gradient problems which include gradient
clipping, where the norm of the gradient vector is clipped, and second-order deriva-
tive methods where the second derivative is assumed less susceptible to the gradient
problems than the first derivative [Chung et al., 2014]. Unfortunately, these solutions
are not generally applicable to all learning problems and have their own flaws.
The activation functions of the simple RNN units are not as robust for modeling
long-term dependencies as other methods which track important occurrences over
time. The LSTM introduced an intermediate storage to simple RNN units by the
addition of a memory cell. The memory cell of the LSTM was designed to main-
tain a state of long and short term dependencies using gated logic. The memory
23
CHAPTER 1. INTRODUCTION
cell was built from simpler cells into a composite unit including gate and multiplica-
tive nodes. The elements of the LSTM are described below [Lipton et al., 2015],
[Gers et al., 2000]:
• Input Node: This unit is labeled as gc . It takes the activation from the input
layer x(t) at the current time step in the standard way and along recurrent
edges from the hidden layer at previous time step h(t−1) . A squashing function
is applied to the summed weight input.
• Input Gate: A gate is a sigmoidal unit that takes activation from the current
data point x(t) and hidden layer at the previous time step. The gate value is
multiplied with another node’s value. The value of the gate determines the flow
of information and its continuation. For instance, if the gate value is 0, the
flow from another node is discontinued whereas if the gate value is 1, the flow
is continually passed. The input node value is multiplied by the input gate, ic .
• Internal State: Each memory cell has a sc node with linear activation, which is
called the internal state by the original authors. This state has a recurrent, self-
connected edge with a fixed weight of 1. Due to the constant weight, the error
flows across multiple time steps without exploding or vanishing. The update
for the internal state is outlined in figure 1.4, which also includes influence from
the forget gate, fc .
• Forget Gate: These gates were introduced by [Gers et al., 2000] and provide
the network ability to flush the internal state contents when necessary. These
are useful for networks that continuously run. The update equation for the
internal state in figure 1.4 shows the influence of the forget gate in the update
of the internal state.
• Output Gate: The output gate, oc , multiplies the internal state value, sc ,
to generate the output value, vc . Sometimes, the value of the internal state
24
CHAPTER 1. INTRODUCTION
is run through a non-linear activation function to give the output values the
same range as ordinary hidden units with squashing functions, but this may be
omitted depending on the scenario.
A LSTM unit with a forget gate in the style described in [Gers et al., 2000] is pre-
sented in figure 1.4. LSTM-based neural networks have been shown to significantly
outperform feedforward neural networks for speech recognition tasks [Sundermeyer et al., 2013].
In [Sundermeyer et al., 2013], the LSTM network was able to achieve perplexities
similar to those of the state-of-the-art speech recognition model, even though the
LSTM was trained on much less data. Multiple LSTMs used in combination yielded
even better performance on the test data, however, it was found that standard speech
recognition technology limited the potential capabilities of the LSTM given the limits
of preexisting speech recognition models.
25
CHAPTER 1. INTRODUCTION
Figure 1.4: A LSTM unit with a forget gate as proposed in [Gers et al., 2000] and reviewed
in [Lipton et al., 2015].
A recurrent unit simpler than the LSTM, called the GRU, was proposed by
[Cho et al., 2014]. This unit, like the LSTM, modulates the flow of information inside
the unit but is simpler because it does not have separate memory cells. The GRU,
unlike the LSTM, exposes the whole state at each time step because it does not have
a way to control the degree of the state that is exposed. Both the LSTM and GRU
add to their internal content between updates from one time step to another, where
the RNN unit typically replaces the previous activation with a new one.
26
CHAPTER 1. INTRODUCTION
Figure 1.5: The GRU cell proposed in [Chung et al., 2014]. In this diagram, r is the reset
gate, z is the update gate, h is the activation, and ĥ is the candidate activation.
In the GRU, the activation hjt of the GRU at time t is a linear interpolation
between the previous activation hjt−1 and the candidate activation ĥjt , where j is
the current unit. The update gate, ztj , determines how much the unit activation is
updated. The candidate activation computation is similar to that of traditional RNN
units, which entails the reset gate to provide the ability to forget the previous state
computation like the LSTM. The reset gate computation, rtj , is similar to that of the
update gate, which both utilize linear sums and element-wise multiplication.
RNNs are harder to train than feedforward neural networks due to their cyclic nature.
In order to deal with the complex, cyclic nature of such a network, it must first be un-
folded through time into a deep feedforward neural network [Lipton et al., 2015]. The
deep unfolded network has k instances of a feedforward network called f [Werbos, 1990].
After unfolding, the training proceeds in a manner similar to backpropagation as de-
scribed in section 1.5.4. One difference is that the training patterns are visited in
27
CHAPTER 1. INTRODUCTION
sequential order. The input actions over k time steps are necessary to include in
the training patterns because the unfolded network requires inputs at each of the k
unfolded levels. After a given training pattern is presented to the unfolded network,
weight updates in each instance of f are summed together and applied to each in-
stance of f [Werbos, 1990]. The training is usually initialized with a zero-vector. In
theory, the algorithm has eight steps, which are listed below via [Werbos, 1990]:
Deep learning is a large part of modern day intelligent applications. Deep learning
involves the use of deep neural networks for the purpose of advanced machine learning
techniques that leverage high performance computational architectures for training.
These deep neural networks consist of many processing layers arranged to learn data
representations with varying levels of abstraction [LeCun et al., 2015]. The more lay-
ers in the deep neural network, the more abstract learned representations become.
28
CHAPTER 1. INTRODUCTION
Previous machine learning techniques worked well on carefully engineered features ex-
tracted from the data using heavy preprocessing and data transformation. However,
given the inherent variations in real-world data, these methods failed at being robust
because they required a disentanglement and disregard for variations that were un-
desirable. Given the variability and unpredictable nature of real data, such methods
of machine learning became unfeasible as we advanced into the future.
In order to solve the difficulties of learning the best representation to distin-
guish data samples, representation learning is applied. Representation learning al-
lows a computer provided with raw data the ability to discover features useful for
learning on its own. Deep learning is a form of representation learning that aims
to express complicated data representations by using other, simpler representations
[Ian Goodfellow and Courville, 2016]. Deep learning techniques are able to under-
stand features using a composite of several layers each with unique mathematical
transforms to generate abstract representations that better distinguish high-level fea-
tures in the data for enhanced separation and understanding of true form.
A tremendous breakthrough allowing deep neural networks to train and evaluate
on GPUs caused rapid growth into the research area. NVIDIA is a pioneer in GPU
computing and has many graphics cards which can be used to enhance the deep
learning experience. NVIDIA has a span of application programming interface (API)s
to program their branded GPUs in order to handle mathematical operations as those
involved in neural network computations in parallel. Due to the highly parallel and
asynchronous architectures of GPUs, they can be applied to parallelize and speed
up math operations in many diverse ways. NVIDIA CUDA is a parallel computing
platform and API model created by NVIDIA for the purpose of CUDA-enabled GPU
programming [Nvidia, 2017]. It was proven that network training and evaluation on
the GPU is quicker and more power efficient than on CPU [Nvidia, 2015].
Many of the neural network layers described in sections 1.5.2 and 1.5.5 can be
29
CHAPTER 1. INTRODUCTION
applied in deep, multi-layer structures. Typically, more layers lead to more abstract
features learned by the network. Not all combinations of layers or depth of layers work
properly though, at least not without specific weight initialization, regularization,
training parameters or gradient clipping techniques [Zaremba et al., 2014].
It has been proven that stacking multiple types of layers in a heterogeneous mix-
ture can outperform a homogeneous mixture of layers [Plahl et al., 2013]. One of the
main reasons that stacking layers can improve network performance is that recurrent
layers generate features with temporal context while feedforward networks generate
features with static context. When the recurrent, temporal features are provided to
a feedforward, static learning layer, generated outcomes may have a greater certainty
than when generated either by homogeneous mixtures of recurrent or feedforward
layers. The reasoning is that recurrent networks are not always certain of outcomes
given their greater complexity and multi-path cyclic network while feedforward net-
works are more certain of outcomes but have lower complexity and less predictive
power. Combining these two types of layers, in example, can greatly increase the
predictive power of the overall neural network.
Multi-stream deep neural networks were created to deal with multi-modal data.
These networks have multiple input layers each with potentially different input for-
mats and independent modules of multiple layers that run in parallel. After for-
ward propagation of each stream network, the resultant features are merged into a
single module with multiple additional layers combined for joint inference using all
independently-generated features. Multi-stream neural networks are useful generat-
ing predictions from multi-modal data where each data stream is important to the
overall joint inference generated by the network. Multi-stream approaches have been
shown successful for multi-modal data fusion in [Yao et al., 2016], [Singh et al., 2016],
[Wu et al., 2015], [Xing and Qiao, 2016], [Ngiam et al., 2011], and [Song et al., 2016b].
Overall, deep neural networks have been applied successfully in multiple applica-
30
CHAPTER 1. INTRODUCTION
tions such as neural machine translation in [Bahdanau et al., 2014] and [Cho et al., 2014],
time-series sensor data fusion [Yao et al., 2016], handwriting recognition [Xing and Qiao, 2016],
speech recognition [Graves et al., 2013], sequence labeling [Graves, 2012], multiple
kernel fusion [Song et al., 2016a], egocentric activity recognition [Song et al., 2016b],
video classification [Wu et al., 2015], action detection [Singh et al., 2016], and multi-
modal deep learning between noisy and accurate streams [Ngiam et al., 2011].
A deep neural network weight initialization scheme was developed that brings quicker
convergence for training called GlorotUniform weight initialization [Glorot and Bengio, 2010].
This initialization procedure takes into account the variance of the inputs to deter-
mine an optimal uniformly-distributed weight initialization strategy for the provided
training examples. This weight initialization scheme takes much of the guesswork out
of network initialization.
The optimal number of hidden nodes for feedforward networks is usually between
the size of the input and size of the output layers. Deep feedforward and recur-
rent neural networks may be hard to train. Feedforward networks tend to over-fit
on training data and perform poorly on held-out test datasets [Hinton et al., 2012].
Recurrent networks in deep configurations tend to have the problems of exploding or
vanishing gradient. These problems have to do with the inability of the backprop-
agation through time (BPTT) algorithm to properly derive the error gradient for
training the weights between unit connections in the RNNs. One way to solve this
problem is with dropout [Zaremba et al., 2014]. Dropout is a technique for reducing
network overfitting by randomly omitting a percentage of the feature detectors per
each training case [Hinton et al., 2012]. In the case of feedforward neural networks,
the feature detectors would be connections to random neurons while in the recur-
31
CHAPTER 1. INTRODUCTION
rent case they would be random connections to either the same unit or another unit.
Dropout can be applied in many cases to regularize the network generalization per-
formance. It has been proven effective in [Hinton et al., 2012], [Srivastava, 2013], and
[Srivastava et al., 2014].
Bootstrapping is a comprehensive approach for modification of the prediction ob-
jective in order to consistently manage random fluctuations in and absence of accurate
labeling [Reed et al., 2014]. Many deep neural networks are trained with supervised
learning where datasets are labeled and the networks learn to map the inputs to the
labeled outputs. However, in many cases, not all possible examples are present in
the dataset and many labels may be missing. Thus, the labeling for each dataset
is subjective. Bootstrapping is a way to determine if a given network architecture,
specifically a deep network, is able to generalize on a variety of data sampled from a
sample distribution. This method can be applied for enhanced training and regular-
ization by sampling random training examples with replacement from a large dataset
useful for modeling the population distribution of the overall data. This allows the
network to approximately learn the population data distribution without over-fitting
on specific examples.
32
CHAPTER 1. INTRODUCTION
squared error (RMSE). In these error metrics, the smallest possible value is 0 and
it happens when the predicted values exactly match the target values. The largest
possible value is unbounded and depends greatly on the values being compared.
The single sample RSS metric for computing the residual sum of squares in a
regression prediction is the following:
N
X
RSS(Y, Ŷ ) = (Yi − Ŷi )2 (1.13)
i=1
M N
!
X X
(Yi,j − Ŷi,j )2
i=1 j=1
MSE(Y, Ŷ ) = (1.14)
MN
q
2
RMSE(Y, Ŷ ) = MSE(Y, Ŷ ) (1.15)
33
CHAPTER 1. INTRODUCTION
N
X
(Yi − Ŷi )2
i=1
R2 (Y, Ŷ ) = N
(1.16)
X
(Yi − Ȳ )2
i=1
N
X
Yi
i=1
Ȳ = (1.17)
N
34
CHAPTER 1. INTRODUCTION
other output confidences for a given input sample, then the model predicts that the
input sample belongs to that class. Typically, these confidence vectors are converted
into what is called a binary one-hot representation, where the list index for the class
with the highest confidence is converted to a 1 while all other classes are converted
to 0s.
An often used metric for evaluating the error of a binary classifier is the binary
cross-entropy error. The binary cross-entropy cost function is the following:
X
C= −tlog(y) − (1 − t)log(1 − y) (1.18)
where t is the target class label and y is the network output [Nervana Systems, 2017c].
When this cost metric is applied on a neural network with a SoftMax classifier for
the output layer, backpropagation (BP) training takes advantage of a shortcut in the
derivative that saves computation.
Another valuable metric for classification is accuracy. Accuracy is the number of
correct predictions made over the total number of predictions made as shown below:
TP + TN
ACC = (1.19)
P +N
35
CHAPTER 1. INTRODUCTION
TP
PPV = (1.20)
TP + FP
TP
TPR = (1.21)
TP + FN
2∗P ∗R
F1 = (1.22)
P +R
36
CHAPTER 1. INTRODUCTION
where P is the precision and R is the recall [Asch, 2013]. Another way to visually
evaluate the performance of a classifier is through a receiver operating characteristic
(ROC) curve. This curve is created by plotting the true positive rate as the indepen-
dent variable and the false positive rate as the dependent variable. The ROC curve
is thus the sensitivity of prediction as a function of false alarm rate [Fawcett, 2006].
The true positive rate is synonymous to the recall shown in equation 1.21 while the
false positive rate (FPR) is the rate of false alarm. The FPR is shown numerically
below:
FP
FPR = (1.23)
FP + TN
where F P is the number of false positive predictions and T N is the number of true
negative predictions. When the ROC curve is a perfect horizontal line, the classifier
is perfect according to this metric. The area under the precision-recall curve or the
ROC curve can be calculated. This area is referred to as the area under the curve
(AUC) score. The larger the area, the better the classifier performance.
The scores can be averaged to determine a single number representative of the
classifier performance. There are several ways to average the scores, a few of which
are the following:
According to [Asch, 2013], each of these scores has its own pitfalls based on label
imbalance. However, many of the averaging methods perform similarly when labels
are balanced. Thus, it is a good idea to achieve balanced datasets when possible.
37
Chapter 2
Methods
Time-series sensor data was collected from multiple vehicles with forward-facing sen-
sor systems as provided by supplier A. The systems consisted of radar and camera
mounted on the front of vehicles. Each sensor was able to track up to 10 objects in
front of the vehicle at once. Per each sensor, the important objects were provided with
their own track of multiple informative data streams yielding properties observed by
the sensor. Synchronized sensor data was collected from vehicles at a 40 millisecond
(ms) sample rate while driving along various roadways with varying light, weather
and traffic conditions.
The data used was sampled from week-long spans of synchronized multi-track
sensor data acquired from both sensors. The sampled data captured a multitude of
scenarios useful for data analysis, training and evaluation of a robust sensor fusion
system involving deep learning. The fusion methodology presented was developed in
order to fuse synchronized pairs of the acquired types of radar and camera sensor track
data based on a preexisting state-of-the-art sensor fusion system with demonstrated
success on similar data.
38
CHAPTER 2. METHODS
2.2 Camera
The camera was provided by vendor A. The camera could track up to ten detectable
vehicles at once within its field-of-view. These vehicles were each arbitrarily assigned
to one of ten object tracks. Unique vehicle identification (ID) labels were provided
by the camera in order to identify each tracked vehicle at any given time. The
camera was noisy in its approximation of various attributes like longitudinal distance
and relative velocity with respect to the mounted observer vehicle. Two notable
camera advantages were its wide field-of-view and unique vehicle recognition and
tracking abilities. On the other hand, two disadvantages of the camera were its noisy
approximation of depth and susceptibility to noise from varying conditions. Figure
2.1 shows the camera field-of-view with respect to the observer vehicle and other
moving vehicles.
2.3 Radar
The radar was provided by vendor B. The radar could track up to ten targets at once
within its field-of-view. A target is synonymous to a detected object in front of the
observer vehicle. The radar was able to provide the same information as the camera
along with more accurate tracking of distance and relative velocity with respect to
the observer vehicle. Unlike the camera, however, targets tracked by the radar were
arbitrarily identified and did not necessarily represent vehicles on the road. There
was no unique ID assigned to any one target at any time. Targets would abruptly be
re-assigned to new tracks over time. This arbitrary target track re-assignment made
it difficult to determine fusion between both sensors for a long period of time. For this
reason, the radar data required preprocessing in order to uniquely segment targets
with the same consistent properties over time. Two notable radar advantages were its
deeper depth-of-field for target detection and tracking as well as smooth signals. Two
39
CHAPTER 2. METHODS
disadvantages of the radar were its narrow field-of-view and spontaneous target track
re-assignment. Figure 2.1 shows the radar field-of-view with respect to the observer
vehicle and other moving vehicles.
40
CHAPTER 2. METHODS
By visual analysis with plotting tools, it was determined that four specific streams
of data per each object track would be useful to fuse between the sensors. While
addressing the sensor fusion problem, these four streams were found reliable and con-
sistent between sensors when overlap occurred without object re-assignment. Given
the formulation of a machine learning solution to address sensor fusion between the
two sensors, it was necessary to find correlated time-series data streams between any
two selected pairs of object tracks in order to train and evaluate the fusion models
constructed. Such data would have to be separable in order to distinguish between
fused and non-fused examples. The four chosen streams that seemed to best correlate
visually were the following:
• Lateral Position: This data represented the lateral position of the object
relative to the lateral axis of the observer vehicle. Left of the vehicle was a
distance increase and right of the vehicle was a distance decrease, where the
center of the vehicle was the origin. This data was equivalent, but not limited
to, tracking the lane occupied by the object in focus according to the observer
vehicle. The units were meters.
• Relative Velocity: This data represented the velocity of the tracked object
relative to the observer vehicle. The units were meters per second.
• Width: This data represented the width of the object tracked by the observer
vehicle. The units were meters.
The radar already provided the chosen data streams. The camera data required
41
CHAPTER 2. METHODS
preprocessing to compute the lateral position based on azimuth angles and longitu-
dinal distance. The units of the azimuth angles were in radians per second. The
following formula was used to compute the lateral position of each vehicle with re-
spect to the observer vehicle based on the azimuth angles and longitudinal distance:
where tan was the tangent function, min was the minimum value function of
a collection, max was the maximum value function of a collection, dist was the
longitudinal distance in meters of the tracked object from the vehicle and angles
were the left and right azimuth angles of the tracked object with respect to the center
of the mounted camera in radians per second. Figure 2.2 shows the measurements
provided by each of the four chosen sensor streams.
42
CHAPTER 2. METHODS
In order to determine the strength of correlation between the four chosen object
track streams reconstructed from both sensors where fusion was implied, two metrics
were computed including the coefficient of determination, referred to as r2 -metric,
43
CHAPTER 2. METHODS
∞
X
xcorr(f, g) = (f ? g)[n] = f ∗ [m]g[m + n] (2.2)
m=−∞
where f and g were aligned temporal sequences, f ∗ was the complex conjugate of
f , m was the current point in time, and n was the displacement or time lag between
data points being compared. There were several modes of this algorithm that could
have been used depending on the type of convolution desired. It was chosen to use
the convolution mode for cross-correlation only where the signal vectors overlapped
completely.
In order to compute the normalized, discretized cross-correlation, the auto-correlation
of each vector was taken using equation 2.2 by providing the same vector as both
inputs. The selected stream vector was then divided by the square-root of the auto-
correlation results for that vector to obtain the normalized auto-correlation per each
vector. These normalized auto-correlation vectors were subsequently fed to equa-
tion 2.2 to compute the normalized cross correlation. Hence, the normalized cross-
correlation between a given pair of reconstructed, fused streams, according to the
44
CHAPTER 2. METHODS
f g
X = xcorr( p ,p ) (2.3)
xcorr(f, f ) xcorr(g, g)
dist(X, Y ) = |X − Y | (2.4)
where X is one real-valued time-series sensor stream sample vector from radar and
Y is another real-valued time-series sensor stream sample vector from camera. For
proper comparison, both samples were time-synchronized and the streams compared
were of the same type as listed in section 2.4. In this formula, the closest possible
correlation was represented by zero.
Understanding the noise distribution and tracking error between the sensors with
reference to the preexisting fusion system was necessary to develop alternative sensor
fusion datasets. The absolute differences acquired over a variety of conditions were
subsequently used to represent allowable error tolerances between the two sensors in
a variety of generated datasets. Furthermore, it was conjectured that comparable sets
of data streams from both sensors where the objects were consistent and close with
45
CHAPTER 2. METHODS
Figure 2.3: An example of consistent sensor fusion between a single radar track and camera
object for approximately 36 seconds. The fusion is apparent through correlation of all four
selected sensor streams.
The notion of sensor data consistency was developed based on the inability to fuse
object track data directly between both sensors due to spontaneous reassignment of
46
CHAPTER 2. METHODS
an object to a new track by either sensor. It was found that the radar tended to swap
target information between two tracks over time or even reassign a target to a new
track spontaneously. The camera, however, had less predictable track reassignment.
Once object re-assignment took place on either fused sensor track, the fusion corre-
lation between those particular tracks was lost. Figure 2.4 reveals the swapping of
target information between two radar tracks over time. Such data swapping would
break the correlation between fused sensor tracks without inherent knowledge of the
swap.
Figure 2.4: A demonstration of target information swapping between two radar tracks over
a 4-second time period.
47
CHAPTER 2. METHODS
To solve this problem, it was determined that fusion would be considered only
after the consistency of all object tracks per each sensor was marked per each time
step in the time-series sample. Thus, if a given track held information about the same
object consistently, it would be marked as consistent for each possible time step. If
a given track started providing information about a new object or unknown data,
that particular track would be marked as inconsistent for each time a drastic change
occurred. Finally, if a given track began providing unknown data as in empty space,
very noisy data, or zeros, then the track would be marked unknown for that amount
of time. These three classifications of data consistency were necessary for building a
robust fusion system.
The state-of-the-art sensor fusion system was able to provide one means of labeling
sensor fusion to create a dataset. Based on these labeled fusion occurrences, an error
distribution was created for the absolute differences between each of the four sensor
streams. Each sensor stream type had its own independent distribution of acceptable
tracking error with respect to the other sensor. Maximum allowable error thresholds
per each stream were considered when labeling fusion occurrences between any two
sensor tracks for further generated sensor fusion datasets.
ing
In order to experiment with the idea of fusion between two sensor sources and create
a robust system, several gigabytes of data was acquired from real vehicles. This
48
CHAPTER 2. METHODS
amount of data was equivalent to several million samples of input sensor data. Given
the fact that there were millions of data samples, a fast, robust programming language
was necessary for data processing and analysis. Given the involvement of machine
learning, it was inherently necessary to involve the GPU for model training and
evaluation. The tools used had to be compatible with both CPU and GPU in this
case.
There was also the idea of building a embeddable fusion system, so the language
had to be compatible with multiple platforms and architectures such as Intel x86-64
CPU, NVIDIA GPU, and ARM 32-bit and 64-bit. Given the fact that the project
had many aspects requiring rapid development across multiple areas, a general pur-
pose language possessing the necessary traits was chosen. The choice was to use
Python 2.7.12, given its versatility, easy-to-use interface, compatibility across multiple
systems with multiple architectures and operating systems, breadth of available open-
source libraries, and speed when properly implemented [Python Software Foundation, 2017].
Python has many open-source libraries in the scientific computing and machine
learning realms that are maintained by industry experts. Python also has inter-
faces to other languages like C, for tremendous speedup. Cython is a python in-
terface to C that provides enhanced performance for python scripts and libraries
[Behnel et al., 2011].
The acquired database was read, stored and manipulated using pandas and numpy.
Pandas is a python package that provides fast, flexible, and expressive data structures
designed to facilitate work on data that is relational [McKinney, 2010]. Numpy is a
fundamental library for scientific computing in python that provides vectorized, high-
speed data structures and directly interfaces with cython [van der Walt et al., 2011].
While pandas made it easy to manipulate relational data with multiple named rows
and columns, numpy made it easy to quickly manipulate matrices of numerical data
for vectorized math operations. Many of the tools used relied heavily on numpy
49
CHAPTER 2. METHODS
due to its tremendous speedup over typical Python data structures. Optimized Ba-
sic Linear Algebra Subprograms (BLAS) libraries, like OpenBLAS, were also used
to tremendously enhance the speed of applications by means of vectorized numpy
functions.
In order efficiently compute metrics and statistics on the data, scipy and scikit-
learn were used in addition to built-in functions from numpy. Scipy is an open-source
library for scientific computing in Python that relies on numpy [Jones et al., 01 ]. It
provides many functions for signal processing and data analysis. Scikit-learn is an
open-source Python library that provides implementations for a series of data pre-
processing, machine learning, cross-validation and visualization algorithms for unified
data science applications [Pedregosa et al., 2011]. Scikit-learn also has a heavy depen-
dence on numpy, like scipy. Most notably, scikit-learn provides evaluation metrics for
regression and classification problems that are versatile, utilize scipy and numpy for
high performance, and are tested by many involved in machine learning development.
Neon is Intel Nervana Systems’ open-source deep learning framework for design-
ing, training and evaluating deep neural networks [Nervana Systems, 2017a]. Neon
has many of the important building blocks necessary to create feedforward, recurrent
and convolutional neural networks or load a pre-trained deep neural network. Neon
has a fast and robust implementation of most network layer types, but more impor-
tantly, it is easier to implement than the other machine learning frameworks in Python
including TensorFlow and Theano. For machine learning purposes, it has been shown
that the Intel Nervana Neon Deep Learning Framework is one of the fastest perform-
ers in Python alongside Theano [Bahrampour et al., 2015]. However, the abilities of
fast prototyping and ease-of-use outweighed the slight performance gains available
using Theano. Given the advantages of using Neon, it was the selected option for
implementing and evaluating deep neural networks in this research.
50
CHAPTER 2. METHODS
Parallelization
In order to design algorithms with high performance for rapid prototyping, high
productivity and greater scalability, it was necessary to parallelize multiple aspects
of the data read/write, preprocessing, labeling, post-processing, machine learning,
statistical analysis and deep learning fusion system scripts. In order to parallelize
these aspects, multiple libraries were used and compared. The libraries that were
used included the “threading,” “multiprocessing,” “joblib,” and “PyCuda” libraries
in Python. The “threading” library in Python provides multi-threading ability but
has the limitation of the global interpreter lock (GIL). Thus, threading in Python is
not nearly as fast or scalable as it would be in a language like C or C++, however, it
was used for experimentation with runtimes in the deep learning fusion system.
To unlock the GIL in Python, multiprocess programming was used. The multi-
processing module in Python provided the ability to fork multiple processes and share
memory between processes in addition to providing control for all child processes from
the main parent process. The joblib module had an implementation of “embarrass-
ingly parallel” utilities for parallelizing “for” loops and encapsulating threading and
multiprocess modules from an abstract, straight-forward interface. The multiprocess-
ing and joblib libraries also allowed the creation of process pools for reusing processes
for multiple jobs to amortize costs. Multiprocessing was used to parallelize just about
every aspect possible in all the mentioned portions of this project.
Parallelization for most script aspects took place on the CPU via multiprocessing
or joblib. However, training and evaluation of neural networks often took place on
the GPU via NVIDIA CUDA 8 through the PyCuda module. PyCuda is a python
library designed for interfacing with the NVIDIA CUDA parallel computation API
[Klockner, 2017]. NVIDIA CUDA 8 is a version of CUDA that supports the latest
Pascal architecture as of 2017. CuBLAS is an implementation of BLAS for CUDA by
NVIDIA that greatly enhances the speed of vectorized GPU functions during linear
51
CHAPTER 2. METHODS
algebra and batch operations, like dot-products across large matrices. It was also
applied for faster neural network training and evaluation.
In typical neural networks, there are a fixed number of inputs provided. The neural
networks used for this project consisted of a fixed number of inputs to simplify the
overall problem. Given the 40 ms update rate for the sensor signals and a need
for the number of inputs to represent a consistent sensor signal, it was determined
that the use of the past 25 time steps of acquired data points would suffice in order
to provide enough data to determine whether two tracked objects were fused. The
chosen number of time steps was equivalent to using the past one second of sensor
data in order to determine whether a pair of independent sensor tracks were fused
between the two sensors at the current time step.
Given the need to train a neural network for generating the consistency sequence of
any given sample of four synchronized input stream samples for a given radar track,
as discussed in section 2.4, it was determined that integer sequences would be the
easiest to learn for the neural network. This determination was based on the idea that
simplifying the separation of learned representations for the neural network would lead
to better recognition outcomes in the sequence generation respect. The camera would
also need the notion of consistency sequences, but those sequences would be generated
based on the provided video IDs given the camera vehicle recognition algorithm.
Since three consistency categories were already necessary to best determine when
fusion was applicable between any two sensor object tracks, these three categories
were mapped to integers as in a classification problem per each time step in the
sample series. Thus, due to the time-series nature of the data, each time-step had
52
CHAPTER 2. METHODS
to have its own consistency classification, which would require the consistency neural
network to generate a numerical consistency sequence per the four input streams.
Following were the three consistency labeling categories:
• Consistent (2): For each time step that a sensor track was considered consis-
tent as determined by the tracking of a single object at any given time according
to all four applicable streams, that particular sensor track was marked with a 2.
Consistent objects are determined after two consistent time steps of the same
data coming from any given sensor track.
• Inconsistent (1): For each time step that a sensor track was considered in-
consistent as determined by a change in tracking from any of the following, ac-
cording to all four applicable streams, that particular sensor track was marked
with a 1:
• Unknown (0): For each time step that a sensor track was considered unknown
as determined by tracking very noisy, unrecognizable or zero data, according to
all four applicable streams, that particular sensor track was marked with a 0.
53
CHAPTER 2. METHODS
tracks. Sample consistency neural network inputs and outputs, which showcase an
inconsistent radar track with its generated consistency sequence, are shown in figure
2.5.
Figure 2.5: A time-series sample of consistency neural network input and output, which
consist of synchronized radar track streams over 25 consecutive 40 ms time steps and a
corresponding consistency sequence. In this case, the radar track streams reveal an incon-
sistent radar track. The consistency sequence is 2 when the track follows the same target
and 1 when the track begins to follow a new target.
54
CHAPTER 2. METHODS
generated by the labeling algorithm. The input values were the radar track streams
and the target output values for model training and validation were the consistency
sequences.
sion
In order to create fusion datasets for training, testing and validating the fusion neural
networks, examples of where two sensor tracks were fused or not fused were extracted
and labeled. The examples consisted of 25 time steps each of the absolute differences
between the two paired sensor track streams, where the data was either consistent
and fused across both sensor tracks for fused cases or had mixed consistency and
was not fused for non-fused cases. Thus, each example consisted of 100 time-series
data points, 25 for each of the four stream differences between the two sensor tracks.
Integer classes were used to represented the binary fusion cases. The fused cases were
mapped to the class represented by a one while the non-fused cases were mapped to
the class represented by a zero.
Five different fusion datasets were constructed to test the fusion neural networks
using the same sensor data. One dataset consisted of fusion examples from the state-
of-the-art sensor fusion system applied to the data. Four other datasets were created
based on a hand-engineered fusion labeling algorithm which was constructed to fuse
sensor tracks based on configured error thresholds between each of the four chosen
synchronized sensor stream track pairs. The error thresholds were determined using
statistics from fusion occurrences based on the state-of-the-art fusion system. Sample
fusion neural network inputs and outputs, which showcase a fused pair of camera and
radar track absolute differences with their corresponding binary fusion label sequence,
are shown in figure 2.6.
55
CHAPTER 2. METHODS
Figure 2.6: A time-series sample of fusion neural network input and output, which consist
of the absolute differences between fused radar and camera track streams belonging to one
consistent moving object over 25 consecutive 40 ms time steps and a corresponding binary
fusion label sequence.
One labeled fusion dataset was extracted from the state-of-the-art fusion system with
outliers filtered out. The state-of-the-art fusion system occurrences were generated
based on the provided sensor data database by black-box sensor fusion system known
to perform well in real-world scenarios. This preexisting sensor fusion system used
more than the four selected streams discussed in section 2.4 to determine fusion,
56
CHAPTER 2. METHODS
which most likely included actual video data from the camera sensor. The fusion
results generated by this system were verified to be accurate within their own means
based on calibrated sensor errors and noise learned from the actual vehicles. The
scenarios were gathered from many cases of real-world driving across multiple road,
lighting, and weather conditions. All fusion occurrences from this system, excluding
outliers, were selected for analysis to determine reasonable error thresholds between
the utilized sensor track streams for generating other datasets.
57
CHAPTER 2. METHODS
This algorithm allowed for each stream error threshold to be set individually, so that
the user could decide their own separate stream error thresholds based on a selected
level of noise found in each sensor stream as well as a selected level of error between
the two chosen sensor streams when fusion occurred. This method had no guarantee
that the fused cases would be true fusions, but it left the discretion of how to label
fusion cases up to the user. The user would measure the typical error in sensor streams
between the two sensor tracks when fused by the state-of-the-art fusion system and
consider the typical noise of each sensor stream in order to best determine allowable
error thresholds for fused samples.
The errors considered when labeling fused and non-fused example cases between
any two pair of object tracks were the following:
58
CHAPTER 2. METHODS
Multiple fusion datasets were generated to showcase the fusion network perfor-
mance across multiple variations of noise and notions of “fusion” between the two
sensors. All sensors are different and have different noise distributions and assump-
tions. The idea was that depending on the two sensors and their noise amounts,
the dataset generation algorithm could be tuned to generate a binary fusion dataset
with “fused” examples that allowed for the assumed amount of error between the
selected sensor streams up to a certain threshold in any of the four streams. Once the
absolute error was beyond any of these tuned thresholds in any of the four streams
between the two sensor tracks, the examples were deemed “non-fused” in the dataset.
Given that each dataset allowed for different amounts of absolute error between the
two sensors, the training and testing of the neural network versions on each of these
datasets showcased their ability to perform on multiple variations of sensor noise
fusion assumptions.
The sensor consistency neural network was a model created for the purpose of pre-
dicting the consistency of each time step of a 25 time step sample of a radar target
59
CHAPTER 2. METHODS
track. The network was trained to accept 4 streams of 25 time step samples of radar
track data in order to generate a 25 time step sequence of consistency values in the
range [0, 2]. The consistency labels were discussed in section 2.6.3.
The consistency network was a multi-stream neural network. It accepted 100 inputs
and generated 25 outputs. It consisted of four pathways, each of which accepted 25
inputs. Each pathway was designated for one of the four chosen streams of radar input
data consisting of 25 consecutive time steps from a given target track. Each pathway
of the network consisted of the same sequential neural network model, but they were
all trained to learn different weights between the connections. The pathway model was
developed based on [Cho et al., 2014], which applied an RNN encoder-decoder model
as a seq2seq network for translating one language sequence into another. This method
was shown to perform well on short sequences. Their model accepted a sequence in
one language which passed through an encoder RNN layer and a decoder RNN layer,
subsequently yielding a translated output sequence. A follow-up to their work show-
casing an advanced version of this model was developed in [Bahdanau et al., 2014],
which performed even better on longer sequences.
The chosen pathway model had an encoder layer and a decoder layer as in [Cho et al., 2014]
and [Bahdanau et al., 2014], but it also had a fully-connected (FC) central embed-
ding layer. This model was very similar to an auto-encoder in nature, except that it
understood temporal context and its training was supervised. It has been shown
that stacking network layers, especially recurrent with FC layers, improves net-
work predictive performance by providing temporal context for sequence generation
[Plahl et al., 2013]. The encoder and decoder layers each had 25 units and their own
sigmoidal activation functions. Each of these layers had its own GlorotUniform ran-
dom weight initialization scheme and was followed by a dropout layer. In training,
60
CHAPTER 2. METHODS
the dropout layer was necessary for network regularization in order to prevent over-fit
of certain training examples [Hinton et al., 2012], [Srivastava et al., 2014]. A central
embedding layer with non-linear activation functions and random weight initializa-
tion connected the encoder and decoder layers. Experiments were conducted with
four versions of simultaneously paired encoder and decoder layers in the multi-stream
network. The four chosen alternatives of encoder and decoder layers are listed in
table 2.1.
Table 2.1: Sensor Consistency Neural Network Encoder and Decoder Layer Types
1 Fully-Connected
2 Recurrent
Each of these pathway networks yielded a series of features with potentially tem-
poral context, which were concatenated into a single vector of features and fed to a
MLP network. The MLP consisted of one hidden layer and an output layer. The
neurons of both layers had non-linear activation functions. The last layer generated
the consistency sequence of the multi-modal input sample consisting of real-valued
numbers in the range [0, 2]. These numbers represented where data was unknown,
inconsistent, or consistent. The general model structure is shown in figure 2.7.
61
CHAPTER 2. METHODS
Figure 2.7: The multi-stream seq2seq consistency sequence deep neural network. In exper-
iments, the encoder and decoder layers were selected from table 2.1.
For training the seq2seq neural network, the RMSProp optimization algorithm as
described in section 1.5.4 was utilized with either BP or BPTT. BP was applied
when the network was feedforward, while BPTT was applied when the network was
recurrent. Batch training was applied with a batch size of 10,000 in order to increase
speed on large datasets and accommodate nearly one million examples across train
and test partitions. A learning rate of 0.002 and a state decay rate of 0.95 were
used for the problem. These parameters provided a balance of learning speed and
quality. The cost function used to evaluate and minimize the error of outputs from
the network was the RSS cost function. The RSS cost function was chosen because it
62
CHAPTER 2. METHODS
was best able to minimize the error of sequences generated by the network across all
components of the output sequence rather than focusing on the mean of the squared
errors as in the MSE function.
The RMSProp algorithm with RSS was proven effective for training seq2seq re-
current neural networks as in the work of [Graves, 2014]. In [Graves, 2014], an LSTM
was trained to generate cursive handwriting sequences with a high level of realism. In
the case of sensor fusion, the smallest errors needed to be acknowledged when training
the neural network due to the inherent reliability required in such a system. Given
the susceptibility of the RSS to outliers and the lack thereof in the MSE cost function,
the RSS was needed for training this particular network in order to minimize error
even in rare cases.
Partitioning
Each of the four consistency network versions constructed based on table 2.1 were
trained and tested to completion over 9 bootstrapping cycles to determine the average
regression performance of the selected network version. Bootstrapping was utilized
to provide statistical certainty that a given network version was appropriate across a
sample distribution over all available consistency sequence data.
In each bootstrapping cycle, the labeled consistency network data was randomly
partitioned into two separate categories of 70 percent train and 30 percent test data
sampled from the overall set of millions. Per each bootstrapping cycle, there were
hundreds of thousands of samples per train and test partition.
Upon each training epoch, the network was trained using RMSProp with RSS on
the train data partition. After each training epoch, the network was tested using the
same metric on the test data partition. The test data was never exposed to the net-
work during training, thus it was used to evaluate the general regression performance
63
CHAPTER 2. METHODS
of the network. Once the network reached a plateau in cost between both train and
test partitions, the network training was ceased. The number of selected train epochs
for each network was determined through observation of the train and test cost trends
characteristic of the network version across multiple random samples of the data.
and Metrics
At the end of each bootstrapped training cycle, the trained consistency network model
was validated using the randomly-selected test dataset with multiple metrics in in-
ference mode to determine how well the trained network model performed on the
test set after training was complete. The metrics used to evaluate the consistency
seq2seq network after training were the MSE, RMSE, and r2 -metric. The implemen-
tation and visualizations for each of these metrics were constructed via scikit-learn
and matplotlib.
The sensor fusion neural network was a model created for the purpose of predicting
whether two pairs of synchronized sensor track samples, each consisting of 25 consec-
utive time steps, were fused, based on the absolute differences of the 4 chosen streams
between the two tracks. The fusion dataset generation was discussed in section 2.6.4.
The network was trained to accept 4 difference sequences between the 4 streams of
the paired sensor tracks in order to classify whether the two sensor tracks were fused
via binary classification.
64
CHAPTER 2. METHODS
The fusion network, much like the consistency network, was a multi-stream neural
network. It accepted 100 inputs and generates 25 outputs. It consisted of four path-
ways, each of which accepted 25 inputs. Each pathway was designated for the abso-
lute difference sequence computed between one of the four chosen stream types of two
synchronized, paired sensor tracks each consisting of 25 consecutive time steps. Each
pathway of the network consisted of the same sequential neural network model, but
they were all trained to learn different weights between the connections. This pathway
model was also developed based on [Cho et al., 2014] and [Bahdanau et al., 2014],
which applied a RNN encoder-decoder network for seq2seq translation from one lan-
guage sequence into another.
The chosen pathway model had an encoder layer and a decoder layer as in [Cho et al., 2014]
and [Bahdanau et al., 2014], but it also had a FC central embedding layer. This model
was very similar to an auto-encoder in nature, except that it understood temporal
context and its training was supervised. It has been shown that stacking network lay-
ers, especially recurrent with FC layers, improves network predictive performance by
providing temporal context for sequence generation [Plahl et al., 2013]. The encoder
and decoder layers each had 25 units and their own sigmoidal activation functions.
Each of these layers had its own GlorotUniform random weight initialization scheme
and was followed by a dropout layer. In training, the dropout layer was necessary
for network regularization in order to prevent over-fit of certain training examples
[Hinton et al., 2012], [Srivastava et al., 2014]. A central embedding layer with non-
linear activation functions and random weight initialization connected the encoder
and decoder layers. Experiments were conducted with four versions of simultane-
ously paired encoder and decoder layers in the multi-stream network. The four chosen
alternatives of encoder and decoder layers are listed in table 2.2.
65
CHAPTER 2. METHODS
Table 2.2: Sensor Fusion Neural Network Encoder and Decoder Layer Types
1 Fully-Connected
2 Recurrent
Each of these pathway networks yielded a series of features with potentially tem-
poral context, which were concatenated into a single vector of features and fed to
a MLP network. The MLP consisted of two hidden layers and an output SoftMax
classification layer. The neurons in the hidden layers had non-linear activation func-
tions. The output layer applied the SoftMax activation function with two classes to
generate confidences in whether the samples were fused via binary classification. The
general model structure is shown in figure 2.8.
The fusion neural network was trained via the AdaGrad optimization algorithm as
described in section 1.5.4 with either BP or BPTT. BP was applied when the network
was feedforward, while BPTT was applied when the network was recurrent. Batch
training was applied with a batch size of 10,000 to increase speed on large datasets
and accommodate millions of examples across train and test partitions. The cost
function used to evaluate and minimize error in network classification was the cross-
entropy binary cost function. This cost function was useful for minimizing error in
the binary classification problem. A learning rate of 0.01 for AdaGrad was applied
for the problem in order to provide a balance between learning speed and quality.
66
CHAPTER 2. METHODS
Figure 2.8: The multi-stream sensor fusion classification deep neural network. In experi-
ments, the encoder and decoder layers were selected from table 2.2.
titioning
Each of the four fusion network versions constructed based on table 2.2 were trained
and tested to completion over 9 bootstrapping cycles per each of the five fusion dataset
types to determine the average classification performance of the selected network
version per the selected dataset type. Bootstrapping was utilized to provide statistical
certainty that for a given dataset with specific fusion error thresholds, a given network
version was appropriate to apply by training and testing across a sample distribution
of the given dataset over all available fusion data for that dataset.
In each bootstrapping cycle, per each network and dataset, the labeled fusion
67
CHAPTER 2. METHODS
network data was randomly partitioned into two separate categories of 70 percent
train and 30 percent test data randomly sampled from the overall specifically-labeled
dataset of millions of examples. Per each bootstrapping cycle, there were millions of
training and testing samples.
Upon each training epoch, the network was trained using AdaGrad with the cross-
entropy binary cost function on the train data partition. After each training epoch,
the network was tested using the same metric on the test data partition. The test data
was never exposed to the network during training, thus it was used to evaluate the
general classification performance of the network. Once the network reached a plateau
in cost between both train and test partitions, the network training was ceased. The
number of selected train epochs for each network was determined through observation
of the train and test cost trends characteristic of the network version across multiple
random samples of the selected dataset.
Metrics
At the end of each bootstrapped training cycle, the trained fusion network model was
validated on the randomly-selected test dataset with multiple metrics in inference
mode to determine how well the trained network model performed on the test set after
training was complete. The metrics used to evaluate the fusion network after training
were accuracy, misclassification, precision, recall, F1 weight macro, ROC AUC micro
average and ROC AUC macro average. The implementation and visualizations for
each of these metrics were constructed via scikit-learn and matplotlib.
68
CHAPTER 2. METHODS
tion
The training results for all networks were stored in the hierarchical data format 5
(hdf5) format. This was a format developed to store huge amounts of numerical data
in binary, hierarchical format for easy manipulation and fast access [hdf, 2017]. The
library used for reading data of this type in python was h5py. The h5py library al-
lowed one to store and manipulate data in this format as if it were a numpy array
[Collette, 2013]. The results were visualized using the Nervana Systems nvis script
designed for easy training results visualization using h5py and Bokeh to generate
html-embedded Bokeh plots [Nervana Systems, 2017d]. Bokeh is an interactive visu-
alization library designed around modern web browsing for high-performance interac-
tivity over large datasets much like the popular D3.js [Continuum Analytics, 2015].
The deep learning fusion system was designed much in the style of fusion dataset
generation algorithm with the use of trained deep neural networks instead of static
error thresholds for measuring similarity. The best version of the consistency network
was employed to generate radar track consistency sequences based on the latest 25
time steps of radar track data. The vehicle IDs provided by the camera software were
utilized to determine the consistency sequences of the latest 25 time steps of camera
track data. These sequences were binarized into bins of consistent and unknown
data. Then, for each radar track and each camera track, the sequences were logically
AND’ed to determine where the tracks were both consistent. The absolute differences
were taken between the given track pairs, and these differences were masked based
69
CHAPTER 2. METHODS
70
Chapter 3
Results
According to the state-of-the-art fusion system, when any two fused sensor tracks were
simultaneously consistent, with no sudden track or ID reassignments over a selected
period of time, there was a strong correlation between the data streams of those two
tracks. The correlations in these cases were measured with the r2 -metric (R2-Metric),
the normalized cross-correlation (Norm-XCorr), and the median 1-dimensional Eu-
clidean distance (Median 1-D Euclid) between each of the four stream pairs. The
correlations found between selected samples of fused, consistent sensor track pairs
are shown in table 3.1.
Table 3.1: Sample Correlation Metrics between Consistent, Fused Sensor Tracks according
to State-of-the-art Fusion System.
71
CHAPTER 3. RESULTS
All consistency neural networks were trained using the NVIDIA GTX 980Ti GPU
with CUDA 8. Each network was trained over 9 bootstrapping trials on over one mil-
lion samples. For each trial, there were hundreds of thousands of randomly-sampled
training examples, amounting to 70 percent of the dataset, and hundreds of thou-
sands of test examples, derived from the remainder of the dataset. For each training
trial, the sum-squared error according to equation 1.13 was measured between the
predicted outputs of the network and the target outputs for both the train and test
sets of data. Each of the following plots showcases an example of a training trial for
each of the four consistency neural network versions from random weight initialization
to training completion once minimal regression error was met. Plot 3.1 showcases the
regression error over time for a train/test trial of the MLP version of the consistency
neural network.
72
CHAPTER 3. RESULTS
Figure 3.1: An example of the train and test error over time per one complete cycle of
consistency MLP network training.
According to plot 3.1, the MLP version of the consistency network took 100 epochs
to train until the minimal error amount was reached. Each training epoch took 1.05
seconds over 85 batches, and each evaluation cycle took 0.1 seconds on average. Plot
3.2 showcases the regression error over time for a train/test trial of the RNN version
of the consistency neural network.
73
CHAPTER 3. RESULTS
Figure 3.2: An example of the train and test error over time per one complete cycle of
consistency RNN network training.
According to plot 3.2, the RNN version of the consistency network took 150 epochs
to train until the minimal error amount was reached. Each training epoch took 1.2
seconds over 85 batches, and each evaluation cycle took 0.12 seconds on average. Plot
3.3 showcases the regression error over time for a train/test trial of the GRU version
of the consistency neural network.
74
CHAPTER 3. RESULTS
Figure 3.3: An example of the train and test error over time per one complete cycle of
consistency GRU network training.
According to plot 3.3, the GRU version of the consistency network took 100 epochs
to train until the minimal error amount was reached. Each training epoch took 1.8
seconds over 85 batches, and each evaluation cycle took 0.25 seconds on average. Plot
3.4 showcases the regression error over time for a train/test trial of the LSTM version
of the consistency neural network.
75
CHAPTER 3. RESULTS
Figure 3.4: An example of the train and test error over time per one complete cycle of
consistency LSTM network training.
According to plot 3.4, the LSTM version of the consistency network took 200
epochs to train until the minimal error amount was reached. Each training epoch
took 1.8 seconds over 85 batches, and each evaluation cycle took 0.25 seconds on
average.
tion Results
Figures 3.5, 3.6, 3.7, and 3.8 showcase the validation performance of the consistency
neural network variations on randomly-selected test datasets generated from actual
sensor data using the MSE, RMSE, and r2 -metric regression metrics across 9 boot-
strapping trials. For none of the evaluations were the neural networks exposed to the
test dataset during the training cycle.
76
CHAPTER 3. RESULTS
Figure 3.5: MSE, RMSE and r2 -metric performance for the consistency MLP neural network
on the test dataset across 9 bootstrapping train/test trials from over 1 million randomly-
selected samples. The average score and its standard deviation are included in the title of
each subplot.
77
CHAPTER 3. RESULTS
Figure 3.6: MSE, RMSE and r2 -metric performance for the consistency RNN neural network
on the test dataset across 9 bootstrapping train/test trials from over 1 million randomly-
selected samples. The average score and its standard deviation are included in the title of
each subplot.
78
CHAPTER 3. RESULTS
Figure 3.7: MSE, RMSE and r2 -metric performance for the consistency GRU neural network
on the test dataset across 9 bootstrapping train/test trials from over 1 million randomly-
selected samples. The average score and its standard deviation are included in the title of
each subplot.
79
CHAPTER 3. RESULTS
Figure 3.8: MSE, RMSE and r2 -metric performance for the consistency LSTM neural
network on the test dataset across 9 bootstrapping train/test trials from over 1 million
randomly-selected samples. The average score and its standard deviation are included in
the title of each subplot.
Statistics between fused camera and radar sensor tracks were computed to determine
thresholds for the allowable error between each of the four sensor stream types when
paired. The statistics were computed based millions of randomly-sampled fusion
occurrences between radar and camera from the state-of-the-art fusion system with
outliers removed. These statistics shaped the fused examples found in the first fusion
80
CHAPTER 3. RESULTS
dataset based on the state-of-the-art fusion system. The statistics are shown in table
3.2.
Table 3.2: State-of-the-art fusion system error statistics between fused sensor track streams.
The amount of allowable error between the sensors depended directly on the sensor
noise present in either sensor track. The amount of sensor noise depended on multiple
factors, including time of day, weather, calibration and mounting point of sensor.
Per each vehicle, the sensors were slightly different in regards to all of these factors
and thus the sensor noise behaved differently depending on the vehicle the data was
acquired from. These statistics were useful to determine multiple levels of allowable
error thresholds for creating binary fusion datasets via the hand-engineered fusion
labeling algorithm.
Four binary fusion datasets were generated based on the computed statistics from
the state-of-the-art fusion system. The error thresholds for each generated dataset
version, as derived from the computed fusion statistics, were modeled according to
table 3.3.
mean − 0.5 ∗ std mean + 0.5 ∗ std 1/3 ∗ max 1/2 ∗ max + noise
81
CHAPTER 3. RESULTS
The tolerated error thresholds between fused camera and radar tracks for each
of the four streams for each of the four fusion datasets were computed based on the
state-of-the-art fusion system statistics. These error thresholds are listed in table
3.4, where stream 1 is longitudinal distance, stream 2 is lateral position, stream 3 is
relative velocity, and stream 4 is width.
Table 3.4: Tolerated Fusion Dataset Error Thresholds between Fused Sensor Streams.
All fusion neural networks were trained using the NVIDIA GTX 980Ti GPU with
CUDA 8. Each network was trained over 9 bootstrapping trials on over two million
samples per each fusion dataset. For each trial, there were 2.5 million randomly-
sampled training examples, amounting to 70 percent of the dataset, and over 1 million
test examples, derived from the remainder of the dataset. For each training trial, the
cross-entropy error according to equation 1.18 was measured between the predicted
outputs of the network and the target outputs for both the train and test sets of
data. Each of the following plots showcases an example of a training trial for each of
the four fusion neural network versions from random weight initialization to training
completion once minimal classification error was met. Example plots are used for
brevity since most of the plots per network version contained very similar outcomes.
82
CHAPTER 3. RESULTS
Plot 3.9 showcases the classification error over time for a train/test trial of the MLP
version of the fusion neural network.
Figure 3.9: An example of the train and test error over time per one complete cycle of
fusion MLP network training on the most restrictive fusion dataset.
The MLP version of the fusion neural network took 10 epochs to train per each
fusion dataset. Each training epoch took 3.4 seconds over 254 batches, and each
evaluation cycle took 0.33 seconds on average. Plot 3.10 showcases the classification
error over time for a train/test trial of the RNN version of the fusion neural network.
83
CHAPTER 3. RESULTS
Figure 3.10: An example of the train and test error over time per one complete cycle of
fusion RNN network training on the most restrictive fusion dataset.
The RNN version of the fusion neural network took 10 epochs to train per each
fusion dataset. Each training epoch took 3.5 seconds over 250 batches, and each
evaluation cycle took 0.4 seconds on average. Plot 3.11 showcases the classification
error over time for a train/test trial of the GRU version of the fusion neural network.
84
CHAPTER 3. RESULTS
Figure 3.11: An example of the train and test error over time per one complete cycle of
fusion GRU network training on the most restrictive fusion dataset.
The GRU version of the fusion neural network took 10 epochs to train per each
fusion dataset. Each training epoch took 5 seconds over 250 batches, and each eval-
uation cycle took 0.7 seconds on average. Plot 3.12 showcases the classification error
over time for a train/test trial of the LSTM version of the fusion neural network.
85
CHAPTER 3. RESULTS
Figure 3.12: An example of the train and test error over time per one complete cycle of
fusion LSTM network training on the most restrictive fusion dataset.
The LSTM version of the fusion neural network took 10 epochs to train per each
fusion dataset. Each training epoch took 6 seconds over 285 batches, and each eval-
uation cycle took 0.8 seconds on average.
sults
Four different figures follow which showcase the validation performance of the four
fusion neural network variations on the state-of-the-art sensor fusion system dataset.
The networks were validated with the accuracy, misclassification, precision, recall,
F1 weighted macro, ROC AUC micro, and ROC AUC macro metrics along 9 boot-
strapping trials of random train/test data selection. For none of the evaluations were
the neural networks exposed to the test dataset during the training cycle. Validation
86
CHAPTER 3. RESULTS
results for each of the four fusion network variations trained on the generated fusion
datasets based on the hand-engineered fusion algorithm described in section 2.6.4 are
included in the appendices. The neural network validation figures in the appendix
are divided into sections based on the fusion dataset types outlined in table 3.4.
87
CHAPTER 3. RESULTS
Figure 3.13: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC
micro and ROC AUC macro scores for the MLP fusion neural network across 9 bootstrap-
ping trials of random train/test data selection from over 1 million randomly-selected samples
via the state-of-the-art fusion dataset. The average score and its standard deviation are
included in the title of each subplot.
88
CHAPTER 3. RESULTS
Figure 3.14: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC
micro and ROC AUC macro scores for the RNN fusion neural network across 9 bootstrap-
ping trials of random train/test data selection from over 1 million randomly-selected samples
via the state-of-the-art fusion dataset. The average score and its standard deviation are
included in the title of each subplot.
89
CHAPTER 3. RESULTS
Figure 3.15: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC
micro and ROC AUC macro scores for the GRU fusion neural network across 9 bootstrap-
ping trials of random train/test data selection from over 1 million randomly-selected samples
via the state-of-the-art fusion dataset. The average score and its standard deviation are
included in the title of each subplot.
90
CHAPTER 3. RESULTS
Figure 3.16: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC
micro and ROC AUC macro scores for the LSTM fusion neural network across 9 boot-
strapping trials of random train/test data selection from over 1 million randomly-selected
samples via the state-of-the-art fusion dataset. The average score and its standard deviation
are included in the title of each subplot.
91
CHAPTER 3. RESULTS
Validation Results
Following are sample precision and recall curve plots including micro-averaged preci-
sion and recall curves for each of the fusion neural network versions. Only samples
are included for brevity. These plots echo samples of the information in the previous
figures but are still worthy to visualize. Additionally, the plots per network version
were very similar per dataset type.
Figure 3.17: Sample precision, recall and micro-averaged precision and recall curves for the
fusion MLP network.
92
CHAPTER 3. RESULTS
Figure 3.18: Sample precision, recall and micro-averaged precision and recall curves for the
fusion RNN network.
93
CHAPTER 3. RESULTS
Figure 3.19: Sample precision, recall and micro-averaged precision and recall curves for the
fusion GRU network.
94
CHAPTER 3. RESULTS
Figure 3.20: Sample precision, recall and micro-averaged precision and recall curves for the
fusion LSTM network.
tion
The agreement between the deep learning sensor fusion system and the state-of-the-
art fusion system outputs was evaluated. The deep learning sensor fusion system
with both consistency and fusion neural networks was used for the evaluations. The
LSTM version of the consistency neural network was combined with each of the four
versions of the fusion neural network trained on the state-of-the-art fusion system
95
CHAPTER 3. RESULTS
labels for each separate evaluation. The outputs between the state-of-the-art fusion
system and deep learning fusion systems were compared in order to determine the
percentage agreement between the two systems.
The confusion matrices that follow correspond to the fusion results between the
two most populated radar tracks with multiple camera objects over 1 hour and 15
minutes of never-before-seen test sensor data. Over the course of that time period,
approximately 542 unique cars were visible to both radar and camera sensors. In
each confusion matrix, the diagonal elements imply agreement between the two fusion
systems while the off-diagonal elements imply disagreement between the two fusion
systems. An ideal agreement between the two would be a diagonal matrix where the
diagonal elements sum to 1. A template confusion matrix is shown in figure 3.5.
Table 3.5: Confusion Matrix Template Between Deep Learning Fusion System and State-
of-the-art (SOA) Fusion System.
NN Fused NN Non-Fused
The upper left cell in each matrix represents the percentage of true positive fusion
agreements between both fusion systems, shown by example in figure 3.21. The bot-
tom left cell of each matrix represents the percentage of false positive fusion disagree-
ments, as depicted in figure 3.22. In these cases, the state-of-the-art fusion system
found fusion but the deep learning fusion system did not. The bottom right cell of
each matrix represents the percentage of true negative fusion agreements, where nei-
ther system found fusion. This occurrence is shown by example in figure 3.23. Lastly,
the upper right cell in each matrix represents the percentage of false negative fu-
sion disagreements between the two fusion systems, shown by example in figure 3.24.
In these cases, the deep learning fusion system found fusion but the state-of-the-art
fusion system did not.
96
CHAPTER 3. RESULTS
Figure 3.21: A sample true positive fusion agreement between the deep learning sensor
fusion system and the state-of-the-art sensor fusion system over a one-second period. The
fused radar and camera track streams are plotted against each other to demonstrate the
fusion correlation.
97
CHAPTER 3. RESULTS
Figure 3.22: A sample false positive fusion disagreement between the deep learning sensor
fusion system and the state-of-the-art sensor fusion system over a one-second period. The
radar and camera track streams are plotted against each other to demonstrate the potential
fusion correlation where the systems disagreed.
98
CHAPTER 3. RESULTS
Figure 3.23: A sample true negative fusion agreement between the deep learning sensor
fusion system and the state-of-the-art sensor fusion system over a one-second period. The
non-fused radar and camera track streams are plotted against each other to demonstrate
the absence of valid data in the camera track while the radar track is consistently populated
thus leading to non-fusion.
99
CHAPTER 3. RESULTS
Figure 3.24: A sample false negative fusion disagreement between the deep learning sensor
fusion system and the state-of-the-art sensor fusion system over a one-second period. The
radar and camera track streams are plotted against each other to demonstrate the potential
fusion correlation where the systems disagreed.
The confusion matrices for the fusion MLP are shown in tables 3.6 and 3.7. The
first matrix showcases a 95.08% agreement between the deep learning fusion system
and the state-of-the-art fusion system for the most populated radar track. The second
matrix showcases a 93.88% agreement between the two fusion systems for the second
most populated track. The agreement per each matrix was determined by summing
the diagonal elements of the matrix.
100
CHAPTER 3. RESULTS
Table 3.6: Confusion Matrix Between Deep Learning Fusion System with Fusion MLP and
State-of-the-art (SOA) Fusion System on Most Populated Radar Track.
NN Fused NN Non-Fused
Table 3.7: Confusion Matrix Between Deep Learning Fusion System with Fusion MLP and
State-of-the-art (SOA) Fusion System on Second Most Populated Radar Track.
NN Fused NN Non-Fused
The confusion matrices for the fusion RNN are shown in tables 3.8 and 3.9. The
first matrix showcases a 95.22% agreement between the deep learning fusion system
and the state-of-the-art fusion system for the most populated radar track. The second
matrix showcases a 94.1% agreement between the two fusion systems for the second
most populated track.
Table 3.8: Confusion Matrix Between Deep Learning Fusion System with Fusion RNN and
State-of-the-art (SOA) Fusion System on Most Populated Radar Track.
NN Fused NN Non-Fused
Table 3.9: Confusion Matrix Between Deep Learning Fusion System with Fusion RNN and
State-of-the-art (SOA) Fusion System on Second Most Populated Radar Track.
NN Fused NN Non-Fused
101
CHAPTER 3. RESULTS
The confusion matrices for the fusion GRU are shown in tables 3.10 and 3.11. The
first matrix showcases a 95.6% agreement between the deep learning fusion system
and the state-of-the-art fusion system for the most populated radar track. The second
matrix showcases a 94.39% agreement between the two fusion systems for the second
most populated radar track.
Table 3.10: Confusion Matrix Between Deep Learning Fusion System with Fusion GRU and
State-of-the-art (SOA) Fusion System on Most Populated Radar Track.
NN Fused NN Non-Fused
Table 3.11: Confusion Matrix Between Deep Learning Fusion System with Fusion GRU and
State-of-the-art (SOA) Fusion System on Second Most Populated Radar Track.
NN Fused NN Non-Fused
The confusion matrices for the fusion LSTM are shown in tables 3.12 and 3.13. The
first matrix showcases a 95.2% agreement between the deep learning fusion system
and the state-of-the-art fusion system for the most populated radar track. The second
matrix showcases a 94.13% agreement between the two fusion systems for the second
most populated radar track.
Table 3.12: Confusion Matrix Between Deep Learning Fusion System with Fusion LSTM
and State-of-the-art (SOA) Fusion System on Most Populated Radar Track.
NN Fused NN Non-Fused
102
CHAPTER 3. RESULTS
Table 3.13: Confusion Matrix Between Deep Learning Fusion System with Fusion LSTM
and State-of-the-art (SOA) Fusion System on Second Most Populated Radar Track.
NN Fused NN Non-Fused
Results
The deep learning sensor fusion system was run on multiple computer systems to
evaluate the response of the system in terms of real-time updates, which were provided
at a 40 ms sample rate. The fusion system was evaluated with the LSTM versions of
both consistency and fusion neural networks because these were inherently the slowest
deep neural networks constructed given their complexity. The two notable computer
systems that the deep learning fusion system was evaluated on were the following:
– 32 GB DDR4 RAM
103
CHAPTER 3. RESULTS
On the deep learning PC, the fusion system was evaluated using both CPU and
GPU, while on the embedded system, the fusion system was evaluated using only
CPU. The runtime results using CPU on the deep learning PC with multiple config-
urations involving serial and parallel computing with batch mode were as stated in
table 3.14. The runtime results using CPU and GPU on the deep learning PC with
multiple configurations in serial, parallel and batch mode were as stated in table 3.15.
The runtime results using CPU on the embedded system with multiple configurations
in both serial and parallel were as stated in table 3.16.
Table 3.14: Deep Learning Fusion System Runtime on Intel CPU with Serial, Parallel and
Batch Processing.
Version Mean (s) Std. Dev. (s)
Serial 0.347 0.008
Batch 0.106 0.01
Multi-thread 0.577 0.118
Multi-process 0.113 0.021
Table 3.15: Deep Learning Fusion System Runtime on with Intel CPU and NVIDIA GPU
with Serial, Parallel and Batch Processing.
Version Mean (s) Std. Dev. (s)
Serial GPU 0.678 0.016
Batch GPU 0.081 0.01
Multi-thread GPU 0.715 0.152
Multi-process w/ Batch GPU 0.034 0.01
Table 3.16: Deep Learning Fusion System Runtime on RaspberryPi 3 CPU with Serial,
Parallel and Batch Processing.
Version Mean (s) Std. Dev. (s)
Serial 3.23 0.42
Batch 0.68 0.014
Multi-thread 2.14 0.52
Multi-process 1.38 0.242
104
Chapter 4
Discussion
The sample correlation metrics computed between each of the four streams of con-
sistent, fused sensor tracks according to the state-of-the-art fusion system in section
3.1, reveal that the correlation is strong between two camera and radar sensor tracks
when both tracks are consistent. Between all four streams, the most reliable streams
for fusion are the longitudinal distance, lateral position, and width due to their strong
correlations between both sensors, while the relative velocity had the weakest corre-
lation between both sensors. The stream rankings in terms of correlation strength
from strongest to weakest with the r2 -metric are the following:
3. Width, 0.986
These rankings reveal the reliability order with which a dependent variable, such
as the fusion between two sensor tracks, could be predicted from any of the four se-
105
CHAPTER 4. DISCUSSION
lected streams based on the proportion of variance found between the fusion results
of the state-of-the-art fusion system. The lateral position was revealed to be the
most reliable stream for predicting a dependent variable based on its variance. The
longitudinal distance was second in terms of reliable prediction based on its variance,
off by a marginal 0.0057 from the lateral position in terms of the correlation metric.
The width was shown to be less reliable than the previous two streams under this
metric, and the relative velocity was the least reliable stream for predicting a depen-
dent variable based on its proportion of variance according to this metric. Despite
the comparison shown, all the streams utilized for fusion were very correlative when
fused as compared to other potential data streams acquired from the sensors.
The stream rankings in terms of correlation strength from strongest to weakest
with the normalized cross correlation metric where the signals completely overlapped
are the following:
1. Width, 0.9998
These rankings reveal the reliability order of the streams upon which a displacement-
based comparison strategy, such as a convolution, could predict fusion between the
two sensors based on the fusion results from the state-of-the-art fusion system. The
width was revealed to be the most reliable stream for fusing the two sensor tracks
based on a correlation of displacement. This expresses that the width stream provides
the most reliable displacement correlation between the two sensors, and when both
width streams have strong correlation with direct overlap, there is a higher chance
that the compared sensor tracks are fused. The longitudinal distance was second to
106
CHAPTER 4. DISCUSSION
These rankings reveal that the width is the strongest indicator of when two sensor
tracks are fused between radar and camera based on the distance between the two
track streams. The width shows strong agreement with minimal error and deviation
between the two sensors. The rankings also reveal that the longitudinal distance has
a larger error tolerance and variance than all other streams between the radar and
camera when any two tracks are fused. This larger error tolerance may be due to many
factors involving the sensors including sensor position, calibration, and approximation
of distance. The camera must approximate the distance of any given vehicle using
various algorithms, and this distance approximation may be the cause of a larger
107
CHAPTER 4. DISCUSSION
acceptable error between longitudinal distance streams of both sensors. The larger
error tolerance in the longitudinal distance is, however, proportionate to the standard
deviation in the error. As shown in previous correlation rankings, the lateral position
still requires a minimal error tolerance between both sensors while the relative velocity
requires a higher error tolerance. Based on this ranking order, the two streams that
agree with minimal error between the two sensors are the width and lateral position
while the relative velocity and longitudinal distance streams must have a higher error
tolerance between the two sensors due to sensor-specific reasons.
The computed correlation results uphold the ideas that the four selected sensor
streams, when consistent, are highly correlative between the radar and camera sensors
according to various correlation metrics. These streams thus proved to be very useful
for the determination of sensor fusion once they were rearranged or masked to be
consistent with no track reassignments.
Results
The train/test error versus time results showcased in section 3.2.1 reveal the con-
vergence of each consistency neural network to minimal error using the RMSProp
training algorithm with the RSS cost metric. Each network took a different number
of epochs to reach minimal regression error. The network versions are ranked in terms
of training time in epochs from least to greatest below:
108
CHAPTER 4. DISCUSSION
Interestingly, the MLP-based consistency neural network took the least amount of
time to train. The GRU-based network was tied with the MLP in terms of training
epoch time, but the errors observed over time in GRU consistency training error plot
in figure 3.3 were greater than the errors observed in the MLP training plot in figure
3.1. An interesting phenomenon took place in the GRU training error plot where
there was a large sum-squared error for train and test samples until epoch 45, where
the error dropped below half of what it was spontaneously. This sudden decrease
in error was not something that took place in any of the other consistency training
error plots, but it could have been based on a decay in the learning rate due to the
RMSProp optimization strategy. The RNN-based network ranked third in training
time. Surprisingly, the LSTM-based neural network took the longest amount of time
to train. Typically, the RNN-based network would take the most time to train due
to its lack of gated logic to modulate data flow, but in this case the LSTM network
took a longer time to train.
The wall-clock time required per training and evaluation epoch grew as the consis-
tency network complexity increased as described in section 3.2.1. The required times
per training epoch over 85 batches are ranked from least to greatest below:
Based on the wall-clock time ranking list, it is shown that the MLP-based network
takes the least amount of time for forward and backward propagation while the LSTM-
based network takes the most time for said propagations. These results support the
idea that the increased network complexity incurs additional runtime overhead.
109
CHAPTER 4. DISCUSSION
These results demonstrate that the consistency neural networks were able to learn
when data was consistent, inconsistent, or unknown quickly with minimal error. The
observation that each training epoch took longer as the networks grew more complex
shows that the additional network complexity incurred extra overhead runtime for
forward and backward propagation of input samples.
tion Results
The consistency neural networks were evaluated over 9 bootstrapping trials on the
test datasets with three regression validation metrics including the MSE, RMSE, and
r2 -metric. The validation results are shown in section 3.3. The mean validation
results for each network are summarized in table 4.1.
Table 4.1: Consistency Neural Networks Validation Results
Version MSE RMSE r2 -metric
MLP 0.0319 0.1775 0.9714
RNN 0.033 0.182 0.9643
GRU 0.0276 0.1656 0.9755
LSTM 0.0281 0.167 0.973
According to the table, the best performing network across all 9 bootstrapping
trials was the GRU-based solution. The LSTM network was very close in comparison
to the GRU, and the difference in MSE was a mere 5 ∗ 10−4 . The MLP network was
the third best performer, and the RNN-based solution was the worst performer. All
networks were able to provide substantial generalization power for the consistency
sequence regression problem based on the four synchronized input streams of radar
track data.
110
CHAPTER 4. DISCUSSION
sults
The train/test error versus time results showcased in section 3.5.1 display quick con-
vergence to minimal classification error for each trained fusion network version using
the AdaGrad optimization algorithm with the cross-entropy binary cost metric. Each
network took 10 epochs to fully train. The train and test errors in all plots looked
very similar. The train error was able to grow very small after only one epoch of
training for each network. The next 9 epochs were useful for learning edge cases that
were not absorbed in the first epoch. The test error was small for all versions of the
network throughout all training epochs. The train and test errors for each network
were able to converge to the range [0, 0.01]. The large batch size and number of
samples were most likely large contributors to the speed of the learning process.
The wall-clock time required per training and evaluation epoch grew as the fusion
network complexity increased as described in section 3.5.1. The required times per
training epoch over a range of batch sizes from 250 to 285 batches are ranked from
least to greatest below:
Based on the wall-clock time ranking list, it is shown that the MLP-based network
takes the least amount of time for forward and backward propagation while the LSTM-
based network takes the most time for said propagations. These results support the
idea that the increased network complexity incurs additional runtime overhead.
111
CHAPTER 4. DISCUSSION
These results support the application of the designed fusion neural network struc-
ture for classifying fusion of the four streams of absolute differences between two
radar and camera tracks. The networks were able to learn the complex nature of the
fusion scenarios labeled according to the state-of-the-art sensor fusion system quickly
with minimal classification error. These networks also learned the nature of the hand-
engineered fusion label generation algorithm just as fast with just as minimal error.
These similar results across all fusion datasets provide substantial evidence that all
designed versions of the fusion neural networks are well-suited for learning and provid-
ing the complex capabilities of the preexisting state-of-the-art fusion system as well as
other alternative sensor fusion algorithms. The observation that each training epoch
took longer as the networks grew more complex shows that the additional network
complexity incurred extra overhead runtime for forward and backward propagation
of input samples.
sults
The fusion neural networks were evaluated over 9 bootstrapping trials on the test
datasets with seven classification validation metrics including accuracy (ACC), mis-
classification (1-ACC), precision (PPV), recall (TPR), F1 weight macro (F1), ROC
AUC micro (ROC micro), and ROC AUC macro (ROC macro). The validation re-
sults are shown in section 3.6. The mean validation results for each network trained
and evaluated on the state-of-the-art sensor fusion system dataset, as acquired from
3.6.1, are summarized in table 4.2.
According to the table, the best performing network across all 9 bootstrapping
trials was the GRU-based solution. The LSTM network was very close in comparison
to the GRU, and the difference in precision was a mere 2 ∗ 10−4 while the difference
112
CHAPTER 4. DISCUSSION
Table 4.2: Sensor Fusion Neural Networks Validation Results Based on State-of-the-art
Fusion Dataset
Version ACC 1-ACC PPV TPR F1 ROC micro ROC macro
MLP 99.85 0.15 0.9985 0.9984 0.9985 0.9999 0.9999
RNN 99.86 0.14 0.9986 0.9985 0.9986 0.9999 0.9999
GRU 99.88 0.12 0.9989 0.9989 0.9989 0.9999 0.9999
LSTM 99.87 0.13 0.9987 0.9986 0.9987 0.9999 0.9999
in recall was a mere 3 ∗ 10−4 . The RNN network was the third best performer, and
the MLP-based solution was the worst performer. All networks were able to provide
substantial generalization power for the sensor fusion classification problem based
on the absolute differences between the four synchronized input streams of camera
and radar track data. These networks were shown to perform exceptionally well on
fusion datasets based on the state-of-the-art sensor fusion system. The precision and
recall plots in section 3.6.2 show that all fusion networks were able to achieve near
perfect precision and recall on the fusion test datasets, notably those based on the
state-of-the-art fusion system.
The agreement and confusion of the deep learning sensor fusion system as compared
to the state-of-the-art sensor fusion system was evaluated in section 3.7.1. Examples
of the four possible types of agreement and disagreement were shown. The confusion
matrices for the fusion results between both systems were computed for the two most
populated radar tracks over the course of 1.25 hours of test data. Four variants of the
deep learning fusion system were evaluated against the state-of-the-art fusion system
for these scenarios with the consistency LSTM held constant and each of the four
fusion network versions variated. The fusion agreement percentages between the con-
structed and preexisting fusion systems for all four variants of the deep learning fusion
113
CHAPTER 4. DISCUSSION
system along with the corresponding average agreement percentages are summarized
in table 4.3.
Table 4.3: Agreement Percentages Between Deep Learning Fusion System and State-of-the-
art Fusion System for Two Most Populated Radar Tracks (RT0) and (RT1)
Fusion Network RT0 Agreement RT1 Agreement Avg. Agreement
MLP 95.08% 93.88% 94.48%
RNN 95.22% 94.1% 94.66%
GRU 95.6% 94.39% 95%
LSTM 95.2% 94.13% 94.67%
According to the table, the GRU-based fusion neural network performed the best
out of all versions applied in the deep learning fusion system. The MLP-based fusion
neural network performed the worst of all versions applied in the fusion system.
Surprisingly, the LSTM and the RNN average agreement percentages differed by a
mere 0.01%, with the LSTM performing slightly better than the RNN. The results
reveal how closely the deep learning fusion system agreed with the state-of-the-art
sensor fusion system no matter the fusion neural network version applied. It was
shown that overall, all tested variants of the deep learning fusion system agreed with
the state-of-the-art fusion system at least 94.4% of the time.
Based on the agreement percentages, there was a disagreement percentage of ap-
proximately 5% between the deep learning and state-of-the-art sensor fusion systems.
Examples of false positive and false negative fusion disagreements between the sys-
tems are visualized in figures 3.22 and 3.24, respectively. The visualized examples
were primary cases where the systems disagreed. Cases like those shown were hard to
distinguish and could be clarified using video data captured by the camera in order to
truly determine if the two sensors were fused during those disagreement occurrences.
114
CHAPTER 4. DISCUSSION
The deep learning sensor fusion system runtimes were benchmarked for multiple con-
figurations of the system with serial, batch and parallel processing across multiple
hardware architectures including Intel CPU, NVIDIA GPU and ARMv8 CPU. The
results were shown in section 3.8.
Of all deep learning fusion system configurations with serial, batch and parallel
processing, only one notable combination of parallelization techniques led the system
to a real-time response update rate of 40 ms on the deep learning PC. The winning
runtime system configuration involved parallelizing the consistency neural networks
over 10 processes, per the 10 radar tracks, and evaluating the fusion neural network
in a single process with a batch size of 100 samples on the NVIDIA GPU, per the
100 possible radar and camera track combinations. This successful result was shown
in the last row of table 3.15. This result reveals that the deep learning fusion system
can respond to real-time sensor updates in the proper parallelization configuration on
a computer with capable hardware.
The slowest runtime result benchmarked on the deep learning PC involved the
configuration with multi-threaded GPU evaluation of the deep neural networks. This
result was shown in the second-to-last row of table 3.15. This result reveals the
slowdown involved in using the GPU for small tasks such as consistency neural net-
work evaluation as well as the overhead runtime entailed by the GIL in the Python
threading interface. Table 3.14 also supports this claim because the multi-threaded
fusion system on the CPU performed the worst of all CPU configurations on the deep
learning PC as well.
The slowest runtime results were benchmarked on the RaspberryPi 3 with its
ARMv8 CPU. For this embedded system architecture, the fastest runtime of 0.68
seconds per update was found using batch processing for both neural networks. On
this system, there were many factors that may have contributed to the slow per-
115
CHAPTER 4. DISCUSSION
formance observed including processor speed, RAM speed and amount, and lack of
optimized BLAS libraries to enhance linear algebra performance. An additional fac-
tor in this slow performance was the lack of a GPU to parallelize multiple neural
network computations.
These results support the idea that parallel computing with processes in Python,
coupled with the use of CUDA and CuBLAS with batch GPU neural network evalua-
tion over many samples, is necessary in order to achieve high-speed, real-time worthy
deep neural network computations. It was demonstrated that the GPU is not always
useful for small tasks, but it is very useful for large, equally-distributed tasks. The
runtime results acquired with the GPU revealed that evaluating the consistency neu-
ral network on the GPU was detrimental to the runtime of the fusion system because
the number of samples was small and the overhead of transferring data back and
forth from GPU took more time than necessary. However, evaluating the fusion neu-
ral network in batch mode on the GPU was beneficial for the runtime of the fusion
system because the number of samples was large and the neural network evaluation
tasks were equally-distributed.
116
Chapter 5
Conclusion
Based on the consistency neural network training and validation results, it was ob-
served that all four versions of the consistency neural network performed well on the
dataset generated by the consistency labeling algorithm that was developed. These
networks were each able to learn to predict the consistency of any 1-second sample of
4 synchronized radar track streams with minimal error. Based on the fusion neural
network training and validation results, it was observed that all four versions of the
fusion neural network performed well across five different sensor fusion datasets, in-
cluding a dataset generated based on fusion labels from a preexisting state-of-the-art
sensor fusion system. These networks were each able to classify 1-second samples
of fused camera and radar track pairs of the absolute differences of any 4 synchro-
nized streams between the two sensors with high accuracy, precision, and recall. The
greater the complexity of either deep neural network, the longer it took for forward
and backward propagation of any input samples.
According to the results, the consistency neural network structure was able to
detect radar track reassignments across synchronized multi-modal track data input.
This neural network proved highly accurate at recognizing said reassignments in radar
track streams when the training dataset was appropriately designed to map the input
117
CHAPTER 5. CONCLUSION
streams to consistency numerical sequences of integers in the range [0, 2]. Using this
track reassignment detection and the camera video IDs for a selected camera object
track, a masking of absolute difference data between any two paired sensor tracks
along with backward interpolation of the most recent consistent difference data was
able to provide the fusion neural network with the ability to classify fusion between
1-second camera and radar track samples in the constructed deep learning sensor
fusion system.
When combined in the constructed deep learning fusion system, it was shown that
the consistency and fusion deep neural networks could be utilized to fuse multi-modal
sensor data from two independent sensor sources of camera and radar to within a 95%
agreement of a state-of-the-art sensor fusion system. The constructed deep learning
sensor fusion system was thus able to mimic the complex fusion capabilities of a
preexisting state-of-the-art fusion system to within a 5% error bound.
The runtime of the constructed deep learning sensor fusion system was shown
to react to simulated incoming sensor updates at real-time speeds of 40 ms on a
deep learning PC with a single NVIDIA GeForce GTX 980Ti GPU with combined
parallelization and batch mode deep neural network evaluation. The runtimes were
benchmarked using the slowest versions of the deep neural networks constructed in
order to acquire an upper-bound runtime, which still demonstrated a real-time re-
sponse rate of the constructed deep learning sensor fusion system with the proper
parallelization and hardware specifications.
The work described herein could be extended in a variety of ways. A physics
model could be applied to the sensor data or neural network structures in order to
ensure that fusion only took place in truly realistic situations where both sensors
tracked the same vehicles over time. The physics model would likely help eliminate
disagreement between the deep learning fusion system and the state-of-the-art fusion
system as shown in section 3.7.1. The additional analysis of forward-facing video
118
CHAPTER 5. CONCLUSION
acquired from vehicles equipped with paired radar and camera sensors could add in-
creased reliability and certainty to the deep learning fusion results. A convolutional
neural network, provided with the video as input, could merge with the developed
multi-stream neural network structures in order to learn a more accurate representa-
tion of sensor fusion between the employed sensors via additive spatial video context.
Additionally, the multi-stream deep neural networks could be implemented in an em-
beddable, high-speed deep learning framework such as Caffe2 for increased efficiency
and performance. Lastly, the deep learning fusion system could be tested on a em-
bedded system with a NVIDIA GPU, such as the Jetson TX2, to determine if the
system was truly capable of embedded real-time response rate with GPU hardware.
In conclusion, we found that the constructed deep learning sensor fusion system
was able to reproduce the complex, dynamic behavior of a state-of-the-art sensor fu-
sion system on behalf of multiple deep neural networks combined in the proposed novel
way. These deep neural networks required training datasets with enough breadth and
quality to sufficiently capture a wide spectrum of synchronized sensor track reassign-
ment and fusion scenarios between both sensors. The constructed deep learning fu-
sion system was proven versatile and highly generalizable given its ability to learn and
mimic both an existing state-of-the-art fusion algorithm and a novel hand-engineered
fusion algorithm with minimal confusion. It also demonstrated capabilities of real-
time inference with proper parallelization and hardware specifications. Thus, it was
shown that deep learning techniques are versatile enough to provide both data pre-
processing and data fusion capabilities when trained with enough consistent, high
quality examples, and such deep learning techniques may be accelerated for high per-
formance use in a variety of ways. Future work was proposed that could enhance the
realism and performance of the constructed deep learning fusion system to further
guide it toward embeddable deployment.
119
Appendices
120
121
.1 Most Restrictive Fusion Dataset Validation Re-
sults
Figure 1: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the MLP fusion neural network across 9 bootstrapping trials
of random train/test data selection from over 1 million randomly-selected samples via the
most restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
122
Figure 2: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the RNN fusion neural network across 9 bootstrapping trials
of random train/test data selection from over 1 million randomly-selected samples via the
most restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
123
Figure 3: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the GRU fusion neural network across 9 bootstrapping trials
of random train/test data selection from over 1 million randomly-selected samples via the
most restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
124
Figure 4: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the LSTM fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the most restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
125
126
.2 Mid-Restrictive Fusion Dataset Validation Re-
sults
Figure 5: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the MLP fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the mid-restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
127
Figure 6: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the RNN fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the mid-restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
128
Figure 7: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the GRU fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the mid-restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
129
Figure 8: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the LSTM fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the mid-restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
130
131
.3 Least Restrictive Fusion Dataset Validation Re-
sults
Figure 9: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC micro
and ROC AUC macro scores for the MLP fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the least restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
132
Figure 10: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the RNN fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the least restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
133
Figure 11: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the GRU fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the least restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
134
Figure 12: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the LSTM fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the least restrictive fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
135
136
.4 Maximum Threshold Fusion Dataset Validation
Results
Figure 13: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the MLP fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the max threshold fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
137
Figure 14: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the RNN fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the max threshold fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
138
Figure 15: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the GRU fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the max threshold fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
139
Figure 16: Accuracy, misclassification, precision, recall, F1 weighted macro, ROC AUC mi-
cro and ROC AUC macro scores for the LSTM fusion neural network across 9 bootstrapping
trials of random train/test data selection from over 1 million randomly-selected samples via
the max threshold fusion error dataset. The average score and its standard deviation are
included in the title of each subplot.
140
Bibliography
[Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma-
chine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
[Bahrampour et al., 2015] Bahrampour, S., Ramakrishnan, N., Schott, L., and Shah,
M. (2015). Comparative study of caffe, neon, theano, and torch for deep learning.
CoRR, abs/1511.06435.
[Behnel et al., 2011] Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D. S.,
and Smith, K. (2011). Cython: The best of both worlds. Computing in Science
Engineering, 13(2):31–39.
[Bottou, 2010] Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gra-
dient Descent, pages 177–186. Physica-Verlag HD, Heidelberg.
[Cho et al., 2014] Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk,
H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-
decoder for statistical machine translation. CoRR, abs/1406.1078.
141
BIBLIOGRAPHY
[Chung et al., 2014] Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014). Em-
pirical evaluation of gated recurrent neural networks on sequence modeling. CoRR,
abs/1412.3555.
[Dagan et al., 2004] Dagan, E., Mano, O., Stein, G. P., and Shashua, A. (2004).
Forward collision warning with a single camera. In IEEE Intelligent Vehicles Sym-
posium, 2004, pages 37–42.
[Duchi et al., 2011] Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient
methods for online learning and stochastic optimization. J. Mach. Learn. Res.,
12:2121–2159.
[Gers et al., 2000] Gers, F. A., Schmidhuber, J. A., and Cummins, F. A. (2000).
Learning to forget: Continual prediction with lstm. Neural Comput., 12(10):2451–
2471.
142
BIBLIOGRAPHY
[Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the dif-
ficulty of training deep feedforward neural networks. In In Proceedings of the Inter-
national Conference on Artificial Intelligence and Statistics (AISTATS10). Society
for Artificial Intelligence and Statistics.
[Graves, 2014] Graves, A. (2014). Generating sequences with recurrent neural net-
works.
[Graves et al., 2013] Graves, A., Jaitly, N., and r. Mohamed, A. (2013). Hybrid
speech recognition with deep bidirectional lstm. In 2013 IEEE Workshop on Au-
tomatic Speech Recognition and Understanding, pages 273–278.
[Hayek, 1952] Hayek, F. A. (1952). The sensory order. University of Chicago Press.
[Hinton et al., 2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation
of feature detectors. CoRR, abs/1207.0580.
[Hornik et al., 1989] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer
feedforward networks are universal approximators. Neural Networks, 2(5):359–366.
143
BIBLIOGRAPHY
[Huang et al., 2016] Huang, L., Chen, H., Yu, Z., and Bai, J. (2016). Multi-target
tracking algorithm in the complicated road condition for automotive millimeter-
wave radar. In SAE Technical Paper. SAE International.
[Ian Goodfellow and Courville, 2016] Ian Goodfellow, Y. B. and Courville, A. (2016).
Deep learning. Book in preparation for MIT Press.
[Jones et al., 01 ] Jones, E., Oliphant, T., Peterson, P., et al. (2001–). SciPy: Open
source scientific tools for Python. [Online; accessed ¡today¿].
[Kodak, 2017] Kodak (2017). Motion picture cameras and lenses. In The Essential
Reference Guide for Filmmakers, chapter 1, pages 63–70.
[Labayrade et al., 2005] Labayrade, R., Royere, C., and Aubert, D. (2005). A colli-
sion mitigation system using laser scanner and stereovision fusion and its assess-
ment. In IEEE Proceedings. Intelligent Vehicles Symposium, 2005., pages 441–446.
[Laneurit et al., 2003] Laneurit, J., Blanc, C., Chapuis, R., and Trassoudaine, L.
(2003). Multisensorial data fusion for global vehicle and obstacles absolute po-
sitioning. In IEEE IV2003 Intelligent Vehicles Symposium. Proceedings (Cat.
No.03TH8683), pages 138–143.
[LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning.
Nature, 521(7553):436–444.
[Levinson et al., 2011] Levinson, J., Askeland, J., Becker, J., Dolson, J., Held, D.,
Kammel, S., Kolter, J. Z., Langer, D., Pink, O., Pratt, V., Sokolsky, M., Stanek,
G., Stavens, D., Teichman, A., Werling, M., and Thrun, S. (2011). Towards fully
144
BIBLIOGRAPHY
[Lipton et al., 2015] Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical
review of recurrent neural networks for sequence learning. CoRR, abs/1506.00019.
[Mcculloch and Pitts, 1943] Mcculloch, W. and Pitts, W. (1943). A logical calculus
of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:127–
147.
[Minsky and Papert, 1969] Minsky, M. and Papert, S. (1969). Perceptrons. MIT
Press, Cambridge, MA.
[Ngiam et al., 2011] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A.
(2011). Multimodal deep learning, pages 689–696.
[NOAA, 2012] NOAA (2012). Lidar 101: An introduction to lidar technology, data,
and applications.
145
BIBLIOGRAPHY
[Nvidia, 2015] Nvidia (2015). Gpu-based deep learning inference: A performance and
power analysis.
[Oancea and Ciucu, 2014] Oancea, B. and Ciucu, S. C. (2014). Time series forecast-
ing using neural networks. CoRR, abs/1401.1333.
[Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,
Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duch-
esnay, E. (2011). Scikit-learn: Machine learning in python. J. Mach. Learn. Res.,
12:2825–2830.
[Plahl et al., 2013] Plahl, C., Kozielski, M., Schlter, R., and Ney, H. (2013). Feature
combination and stacking of recurrent and non-recurrent neural networks for lvcsr.
In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pages 6714–6718.
[Rabiner and Schafer, 1978] Rabiner, L. and Schafer, R. (1978). Digital Processing
of Speech Signals. Signal Processing Series. Prentice Hall.
[Rabiner and Gold, 1975] Rabiner, L. R. and Gold, B. (1975). Theory and Applica-
tion of Digital Signal Processing. Prentice-Hall.
[Rao, 2001] Rao, N. S. V. (2001). On fusers that perform better than best sensor.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8):904–909.
146
BIBLIOGRAPHY
[Reed et al., 2014] Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and
Rabinovich, A. (2014). Training deep neural networks on noisy labels with boot-
strapping. CoRR, abs/1412.6596.
[Singh et al., 2016] Singh, B., Marks, T. K., Jones, M., Tuzel, O., and Shao, M.
(2016). A multi-stream bi-directional recurrent neural network for fine-grained ac-
tion detection. In 2016 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 1961–1970.
[Song et al., 2016a] Song, H., Thiagarajan, J. J., Sattigeri, P., Ramamurthy, K. N.,
and Spanias, A. (2016a). A deep learning approach to multiple kernel fusion.
[Song et al., 2016b] Song, S., Chandrasekhar, V., Mandal, B., Li, L., Lim, J. H.,
Babu, G. S., San, P. P., and Cheung, N. M. (2016b). Multimodal multi-stream
deep learning for egocentric activity recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW), pages 378–385.
[Srinivasa et al., 2003] Srinivasa, N., Chen, Y., and Daniell, C. (2003). A fusion
system for real-time forward collision warning in automobiles. In Proceedings of
147
BIBLIOGRAPHY
[Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from
overfitting. Journal of Machine Learning Research, 15:1929–1958.
[Steux et al., 2002] Steux, B., Laurgeau, C., Salesse, L., and Wautier, D. (2002).
Fade: a vehicle detection and tracking system featuring monocular color vision
and radar data fusion. In Intelligent Vehicle Symposium, 2002. IEEE, volume 2,
pages 632–639 vol.2.
[Sundermeyer et al., 2013] Sundermeyer, M., Oparin, I., Gauvain, J. L., Freiberg, B.,
Schlter, R., and Ney, H. (2013). Comparison of feedforward and recurrent neural
network language models. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing, pages 8430–8434.
[Svozil et al., 1997] Svozil, D., Kvasnicka, V., and Pospichal, J. (1997). Introduc-
tion to multi-layer feed-forward neural networks. Chemometrics and Intelligent
Laboratory Systems, 39(1):43 – 62.
[Thimm et al., 1996] Thimm, G., Moerland, P., and Fiesler, E. (1996). The inter-
changeability of learning rate and gain in backpropagation neural networks. Neural
Computation, 8(2):451–460.
148
BIBLIOGRAPHY
[Tieleman and Hinton, 2012] Tieleman, T. and Hinton, G. E. (2012). Lecture 6.5-
rmsprop: Divide the gradient by a running average of its recent magnitude.
[van der Walt et al., 2011] van der Walt, S., Colbert, S. C., and Varoquaux, G.
(2011). The numpy array: a structure for efficient numerical computation. CoRR,
abs/1102.1523.
[Wu et al., 2015] Wu, Z., Jiang, Y., Wang, X., Ye, H., Xue, X., and Wang, J. (2015).
Fusing multi-stream deep networks for video classification. CoRR, abs/1509.06086.
[Xing and Qiao, 2016] Xing, L. and Qiao, Y. (2016). Deepwriter: A multi-stream
deep CNN for text-independent writer identification. CoRR, abs/1606.06472.
[Yao et al., 2016] Yao, S., Hu, S., Zhao, Y., Zhang, A., and Abdelzaher, T. F. (2016).
Deepsense: A unified deep learning framework for time-series mobile sensing data
processing. CoRR, abs/1611.01942.
[Zaremba et al., 2014] Zaremba, W., Sutskever, I., and Vinyals, O. (2014). Recurrent
neural network regularization. CoRR, abs/1409.2329.
149