0% found this document useful (0 votes)
20 views7 pages

Reconfigurable Network For Efficient Inferencing in Autonomous Vehicles

Uploaded by

mawutofelix01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

Reconfigurable Network For Efficient Inferencing in Autonomous Vehicles

Uploaded by

mawutofelix01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2019 International Conference on Robotics and Automation (ICRA)

Palais des congres de Montreal, Montreal, Canada, May 20-24, 2019

Reconfigurable Network for Efficient Inferencing


in Autonomous Vehicles
Shihong Fang∗ and Anna Choromanska∗

Abstract— We propose a reconfigurable network for efficient The bottleneck for using multiple sensors on the au-
inference dedicated to autonomous platforms equipped with tonomous platform is the size of the network that needs to
multiple perception sensors. The size of the network for steering process the data registered by all the sensors in order to
autonomous platforms grows proportionally to the number of
installed sensors eventually preventing the usage of multiple output the steering command. The size of network needs to
sensors in real-time applications due to an inefficient inference. expand proportionally to accommodate the incoming data
Our approach hinges on the observation that multiple sensors which hurts the inference time and affects the real-time
provide a large stream of data, where only a fraction of the data operation of the resulting model. Our work addresses this
is relevant for the performed task at any given moment in time. problem. We design a reconfigurable network, which con-
The architecture of the reconfigurable network that we propose
contains separate feature extractors, called experts, for each tains feature extractors, that we call experts, each processing
sensor. The decisive block of our model is the gating network, the data coming from a different sensor, where at any given
which online decides which sensor provides the data that is point in time only one expert is active and the remaining ones
most relevant for driving. It then reconfigures the network by are deactivated. Thus the network reconfigures itself online
activating only the relevant expert corresponding to that sensor guided by the gating network, which each time chooses the
and deactivating the remaining ones. As a consequence, the
model never extracts features from data that are irrelevant most relevant sensor. To empirically verify our approach
for driving. The gating network takes the data from all inputs we built a UGV equipped with three front-facing cameras
and thus to avoid explosion of computation time and memory covering left, center, and right field of view. We demonstrate
space it has to be realized as a small and shallow network. that our model can steer the UGV in real time and choose
We verify our model on the unmanned ground vehicle (UGV) correct sensors for navigation. Furthermore, we demonstrate
comprising of the 1/6 scale remote control truck equipped with
three cameras. We demonstrate that the reconfigurable network that the reconfigurable network exhibits similar performance
correctly chooses experts in real-time allowing the reduction of to the standard network without the gating mechanism and
computations cost for the whole model without deteriorating requires significantly less computations.
its performance. This paper is organized as follows: Section II reviews
relevant literature, Section III discusses the reconfigurable
I. INTRODUCTION network (architecture and training), Section IV shows em-
pirical evaluation, and Section V concludes the paper with a
A plethora of real-life problems are currently addressed
brief summary of findings.
with convolutional neural networks (CNNs), including image
recognition [1], [2] and segmentation [3], speech recog- II. RELATED WORK
nition [4], and natural language processing [5]. In some A. End-to-end learning for autonomous vehicles
learning tasks the performance of deep networks exceeds
An end-to-end learning system for steering autonomous
that of a human [2]. Due to the rapid advances in hard-
vehicles typically learns the mapping from the raw readings
ware technologies leading to powerful and efficient GPUs
from the sensors to the corresponding steering commands
and mobile supercomputers, deep learning techniques were
via supervised learning. The training data are acquired from
more recently successfully used in the complex intelligent
human drivers, where inputs to the learning system obtained
autonomous systems such as self-driving cars [6], [7], [8].
from the sensors are captured together with driver actions.
Deep learning techniques enable automatic extraction of
A system known as Autonomous Land Vehicle in a Neural
data features and consequently allow scaling-up learning
Network (ALVINN) [9] was the first end-to-end learning
systems to large data settings. Due to automatic feature
system for autonomous driving and was based on fully-
extraction, employing multiple sensors in platforms based
connected network. Later introduced convolutional neural
on deep learning became much easier than in case of rule-
networks (CNNs) for data feature extraction [10] were ap-
based models. Using multiple sensors improve the safety of
plied in DARPA Autonomous Vehicle (DAVE) project [11].
autonomous vehicles by increasing its perception abilities
The DAVE robot was able to drive autonomously using left
and became an industrial standard used in Advanced Driver
and right cameras. Recently, CNNs were used to train a
Assistance Systems (ADAS) and autonomous car driving
real car to drive autonomously with good performance [7].
platforms of Google, NVIDIA, UBER, or Intel.
Authors used three cameras in data collection phase and a
∗ The authors are with the Machine Learning Lab, Department of Electri-
single one for actual driving. Their later work [12] explains
cal and Computer Engineering, New York University, 5 MetroTech Center, why CNNs are very effective in self-driving tasks. Besides
USA. {sf2584, ac5455}@nyu.edu cameras, other sensors like LiDARs were used to provide

978-1-5386-6027-0/19/$31.00 ©2019 IEEE 1183

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.
multi-modal input to the networks [13], [14] and make the III. R ECONFIGURABLE N ETWORK
system more robust to the variability of the real world. The reconfigurable network proposed in this paper is
B. Gating mechanism captured in Fig. 1. The network receives n input signals
from various sensors, i.e. the input to the network can be
The first gating network proposed in the literature [15] written as x = {x1 , x2 , ..., xn }, where xi is the signal
divides learning task into subtasks and uses separate expert from ith sensor. Reconfigurable network consists of n expert
for each subtask. Each expert takes the same input and the networks E1 , E2 , . . . , En , the gating network G, and the
gating network decides which experts to use at any given fully connected layers. The expert networks extract features
moment. It can be viewed as an ensemble method that uses from the inputs and the gating network decides which input
a combination of selected experts to improve prediction. signals xi should be processed for a given input x. Thus,
This idea was extended [16] to a tree-structured architecture only the experts corresponding to the selected input signals
called Hierarchical Mixture of Experts that solves nonlinear need to perform calculations. Finally, the fully connected
supervised learning problems by dividing the input space layers form the final prediction based on combined feature
into a nested set of regions and fitting simple surfaces to vector from all experts that processed the data. Thus, the
the data that fall in these regions. The gating mechanism final prediction ŷ can be written as:
was also applied in mixture of SVMs [17], where the  n 
authors demonstrated the feasibility of training SVMs on 
large datasets. A stacked Deep Mixture of Experts model [18] ŷ = F Gi (x)Ei (xi ) , (1)
with multiple sets of gating and experts was later proposed i=1

utilizing the concept of conditional computation. Conditional where Gi (x) is the weight of expert i (∈ [0, 1], where 0
computation policies were also explored in the context of denotes the situation when the expert is not used) for input
reinforcement learning [19]. Another notable work [20] x assigned by the gating mechanism, Ei (xi ) is the feature
presents an approach that can adaptively choose parts of the vector outputted by expert i for input xi , and F denotes the
whole neural network to be evaluated for efficient inference action of fully connected layers.
purposes. Similarly, this can be achieved by introducing the The gating network in the proposed architecture selects
sparsely-gated mixture-of-experts layer [21] that allows to the subset of experts which will process the data. Thus, the
pick the best performing subset of k experts, which constitute gating network needs to properly estimate the relevance of
the feed-forward sub-networks. A trainable gating network each expert for the given learning task, not just the input xi
determines a sparse combination of these experts to use for itself, i.e. even if xi contain relevant information for driving
each example. The authors demonstrate the plausibility of the autonomous vehicle, expert Ei should not be activated if
their approach in language modeling problems and machine it is unable to extract high-quality features from xi . It makes
translation, where they show improvements in the computa- the reconfigurable network difficult to train in an end-to-end
tional costs compared to standard approaches. fashion as there is no explicit correlation between experts and
Gating mechanism was also applied to sensor fusion in the feature extractors used in the gating network. In order to
order to boost model’s performance. Multi-sensor fusion al- overcome the mentioned problem we propose the dedicated
lows capturing more diverse data representation (i.e. multiple training procedure.
sensors can capture various modalities) and benefit from In the first step of the proposed training procedure, we
wider space coverage (i.e. multiple sensors can have large train the sensor fusion network proposed before in the
combined field of view). These factors are crucial to increase literature [13] and illustrated in Fig. 2. In this step, the
reliability and accuracy of the system. For example, in object gating network uses experts as feature extractors, which
detection problems, the network performing fusion of LiDAR solves the problem mentioned before. The outputs of the
and camera data has shown to be successful in practical experts are scaled by the outputs of the gating network and
settings [22]. Combining such approach with gated multi- then concatenated to form the combined feature vector. The
modal method was found to outperform other conventional obtained feature vector is then passed to the fully-connected
fusion techniques [23].
Our work builds on the intuition that in the autonomous
driving applications, due to heavy redundancies in the data,
one can use a subset of the inputs at any given moment to
steer the car, i.e. when driving forward, the sensors situated
on vehicle’s side and back most often do not provide relevant
information. Thus, we propose to utilize the idea of gating
network and sparse gating as a practical way to handle multi-
sensor input and at the same time avoid the explosion of the
computational cost. Our approach differs from other works
existing in the literature in that it allows switching the experts Fig. 1. Reconfigurable network with the gating mechanism. In this example,
the second expert is activated based on the output of the gating network and
on/off (in contrary to [13]) and it allows the experts to use the other experts are disabled (colored in gray).
different input data (in contrary to [21]).

1184

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.
layers (FC), which forms the final prediction. To ensure
that feature extractors capture necessary representation, we
first pre-train each expert individually to make the final
prediction. We then borrow weights from our pre-trained
models for initialization. Finally, we train this network end-
to-end to obtain the properly trained experts and a functional
gating network.
In the second step of the proposed training procedure, Fig. 2. The architecture of the network used in the first step of the proposed
we construct the reconfigurable network that contains a very training procedure. The part of the network inside the green shaded box
performs the task of the gating network. Thus, the gating network uses
specific gating component as shown in Fig. 3 (we ask the experts as feature extractors.
reader to look either at the top or bottom figure as both of
them contain the same gating network shown in the shaded
green box). In this step we only train the gating network.
The newly constructed gating network has it own feature
extractors, which are significantly smaller than experts in the
main part of the network. The new gating network is trained
in a supervised setting to mimic the behavior of the reference
gating network obtained from the first step. At this stage we
are enforcing sparsity by modifying training labels for the
new gating network. In particular, we convert label vector
v = [v1 , ..., vn ] obtained from reference gating network, and
convert it into one-hot vector, where max value from v is
converted to 1, and rest to 0. Additionally, after training
the new gating network, we add the hard thresholding on
the output, so the output is always one-hot vector. This will
force the reconfigurable network to use only one input, or
equivalently expert, at a time. The extension to constrained
number of inputs will be explained later in this section. At
the end of this stage we obtain the compact version of gating Fig. 3. The architecture of “Reconf Concat” (top) and “Reconf Select”
network with desired behavior. (botttom). The part of the network inside the green shaded box performs the
task of the gating network and is trained in the second step of the training
In the third step, we fine tune the experts and the fully- process. In this case the gating network has its own feature extractors, which
connected layers (keeping the gating network fixed) by are significantly smaller then experts in the main part of the network. For
training them all together on the same data as in the first step. each data example, only the experts selected by the gating network are
fine-tuned (on the picture above only the first expert is activated since the
This step allows the reconfigurable network to adjust to the outputs of the gating network for the other two experts are zero for the
modified behavior of the gating network. In the experimental given input image and thus are colored in gray). Top: The concatenated
section the resulting network is called “Reconf Concat” (see feature vectors are fed to the fully connected layers of the main network.
Bottom: The selected feature vector is fed to the fully connected layers of
top of Fig. 3). the main network together with the output of the gating network.
So far, we have used concatenation for combining feature
vectors obtained from the experts. This results in a large following steps, shortening the first step too much may
combined feature vector which leads to significant amount significantly degrade the performance.
of computations. Instead we propose to use point-wise sum- At the end of the training process we obtain the network,
mation instead of concatenation since we want to use one which has similar computational cost at inference to the
selected input at a time. In this case, the combined feature network processing only a single input. The proposed net-
vector is equal to the feature vector corresponding to the work selects one most relevant sensor at any given moment.
selected input. As the encoding of input by expert may Extending to the case where a subset of sensors is used can
be different for each input and expert, we pass the gating be done in the following way. After the first step of training,
network output together with a combined feature vector to the gating network outputs a vector with continuous weights
the fully connected layers. This way, the fully connected corresponding to the experts. In order to keep k experts, in
layer has the information about which input is used at the the second step of training the labels used for training the
moment and can process it properly. Finally, the expert gating network should be modified appropriately, e.g. if one
networks are fine-tuned and the the fully-connected layers wants to use 2 out of 3 sensors and use weights for each
are trained from scratch, while keeping the gating network input, one should only zero-out the smallest weight.
fixed. In the experimental section the resulting network is
called “Reconf Select” (see bottom of Fig. 3). . IV. E XPERIMENTS
In each step, we train the network until convergence. To validate our approach, we implemented the proposed
The first step of training is critical. As the gating network reconfigurable network to steer the autonomous platform
is trained in the first step and used as a reference in the equipped with three cameras.

1185

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.
A. Hardware Overview procedure. For every training epoch, we sample uniformly
The block diagram of the our autonomous platform used at random from each of these seven categories (70000 in
for the experiments is shown in Fig. 4. We used a Traxxas total) in order to create our actual training data set.
X-Maxx remote control truck as a base for the autonomous 2) Data augmentation: In order to increase the size of
platform and NVIDIA Jetson TX1 for computations. We the training data set we perform data augmentation. We use
installed three Logitech HD Pro C920 cameras on the horizontal flipping of the images. As we use three cameras
platform. The center camera is oriented straight. The side mounted symmetrically on the autonomous platform, we also
cameras are mounted at angles to allow capturing front have to swap left and right camera images. In order to adjust
side views. Cameras capture non-overlapping views. A PCI the steering command accordingly we multiply it by −1
Express (PCIe) USB 3.0 Card was connected to NVIDIA when we perform the flipping. We apply the augmentation
Jetson TX1 to provide the bandwidth for three cameras with probability 0.5.
enabling them to operate at the same time. For controlling Finally, note that we use the same amount of data for
actuators of the autonomous platform we used Micro Maestro our network as well as other methods, including the ones
6-Channel USB Servo Controller. We use the same platform without the gating mechanism, thus introducing the gating
for autonomous driving and training data collection. mechanism does not require an increased amount of training
In the training data collection mode we use the wireless data.
gamepad to control the car. The corresponding steering
C. Experimental Results
commands and images captured by the cameras are saved
on the local storage of the NVIDIA Jetson TX1. In the experiments we use the network which takes three
In the autonomous testing mode, we run our network on camera images on its input and outputs a steering command
NVIDIA Jetson TX1 in real time. The predicted commands for the autonomous platform. We construct and train the
are used to steer the platform while the speed is controlled network as described in Section III. Each expert has an
by the operator. architecture provided in Table II. The fully connected layers
block consists of two fully connected layers, for both,
the reconfigurable network and its component, the gating
network.
We use separate recordings for training and testing. The
training and test data were recorded in different driving
environments, i.e. different parts of the building, thus the
training data are not replicated in the test set. The training
data consisted of 70000 scenes, as explained before, and the
test data had 5738 scenes.
The results obtained for the reconfigurable network as
Fig. 4. Block diagram of the autonomous platform used for experiments. given in Fig. 2 obtained after the first step of the training
process are shown in Fig. 5. The results show that the
network learned to correctly predict the steering command
B. Data Preprocessing and also properly choose the most relevant input, i.e. a center
Before training the network using the collected data, we camera when driving straight and side cameras in tight turns.
pre-process the data in order to improve training efficiency. In the second step of the training procedure we trained
We perform two preprocessing steps, data balancing and data multiple gating networks of various architectures given in
augmentation. We describe these steps below. The data bal- Table III. We modify the training labels as described before,
ancing and augmentation are used to achieve a good driving to predict single most relevant input. After the training we
performance. This step is the same for the reconfigurable are adding hard thresholding on the gating network output
network and all other methods we experimented with. Data to form a one-hot output vector. The comparison of the
balancing is crucial to prevent the car from overfitting to performance of the trained gating networks is shown in
driving straight and augmentation ensures the same number Fig. 6. The performance is measured with the classification
of turns to the left and right. accuracy. Based on our experiments we found that the
1) Data Balancing: First, we normalize the values of the
TABLE I
captured steering commands to the range −1 to 1, where
S EVEN CATEGORIES FOR DATA BALANCING
0 means driving straight. Next, we organize all the training
Category number Steering Command
examples into 7 categories based on their steering command
1 [-1,-0.67)
as shown in Table I. We collected 68245 scenes for training, 2 [-0.67,-0.33)
9896 scenes in category 1, 1671 in category 2, 2505 in 3 [-0.33,0)
category 3, 39926 in category 4, 4882 in category 5, 2383 4 0
in category 6, and 6982 in category 7. Therefore, collected 5 (0,0.33]
6 (0.33,0.67]
data are heavily imbalanced in terms of captured steering 7 (0.67,1]
commands. We address this problem with data balancing

1186

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.
TABLE II TABLE III
E XPERT LAYER ARCHITECTURE . D IFFERENT ARCHITECTURES OF THE FEATURE EXTRACTOR USED IN
layer name output size parameters THE GATING NETWORK . F ROM TOP TO BOTTOM : CASES 1 - 9.
conv1 16 × 58 × 78 5 × 5, stride=2, BN, ReLU Case 1:
conv2 32 × 27 × 37 5 × 5, stride=2, BN, ReLU layer name output size parameters
conv3 64 × 12 × 17 5 × 5, stride=2, BN, ReLU conv1 3 × 60 × 80 1 × 1, stride=2, BN, ReLU
conv4 96 × 4 × 7 5 × 5, stride=2, BN, ReLU conv2 16 × 28 × 38 5 × 5, stride=2, BN, ReLU
conv5 128 × 2 × 5 3 × 3, stride=1, BN, ReLU conv3 32 × 12 × 17 5 × 5, stride=2, BN, ReLU
conv7 128 × 1 × 4 2 × 2, stride=1, BN, ReLU conv4 48 × 4 × 7 5 × 5, stride=2, BN, ReLU
vectorize 512 conv5 64 × 2 × 5 3 × 3, stride=1, BN, ReLU
conv6 96 × 1 × 4 2 × 2, stride=1, BN, ReLU
vectorize 384
Case 2:
gating architecture that performs well can be obtained by layer name output size parameters
significantly scaling down the input image, which is done conv1 3 × 30 × 40 1 × 1, stride=4, BN, ReLU
by introducing large stride in the first layer of the gating conv2 16 × 13 × 18 5 × 5, stride=2, BN, ReLU
conv3 32 × 5 × 7 5 × 5, stride=2, BN, ReLU
network, and using shallow CNN architecture, which helps
conv4 48 × 3 × 5 3 × 3, stride=1, BN, ReLU
to minimize required computations for the gating network. conv5 64 × 1 × 3 3 × 3, stride=1, BN, ReLU
For further experiments we use the best performing gating vectorize 192
network which uses feature extractor with an architecture Case 3:
given in case 6 in Table III. layer name output size parameters
conv1 3 × 24 × 32 1 × 1, stride=5, BN, ReLU
Next, we train two versions of the reconfigurable network conv2 16 × 10 × 14 5 × 5, stride=2, BN, ReLU
described in the third step of the training procedure in conv3 32 × 3 × 5 5 × 5, stride=2, BN, ReLU
conv4 48 × 1 × 3 3 × 3, stride=1, BN, ReLU
Section III. The obtained results are shown in Fig. 7. The
vectorize 144
performance of both networks is very similar while the
Case 4:
Reconf Select has significantly smaller fully-connected layer. layer name output size parameters
For comparison, we also trained a single-input network, conv1 3 × 15 × 20 1 × 1, stride=8, BN, ReLU, Maxpool
conv2 16 × 6 × 9 3 × 3, stride=1, BN, ReLU, Maxpool
a three-input network without any gating mechanism, and conv3 32 × 2 × 3 3 × 3, stride=1, BN, ReLU, Maxpool
a soft-attention mechanism (first three rows in Table IV). conv4 48 × 1 × 2 2 × 2, stride=1, BN, ReLU, Maxpool
Both networks use experts that have the same architecture vectorize 96
as other considered networks. The single-input network uses Case 5:
layer name output size parameters
only the center camera. We compare the performance of all
conv1 3 × 12 × 16 1 × 1, stride=10, BN, ReLU, Maxpool
considered networks in terms of the test loss and the amount conv2 16 × 5 × 7 3 × 3, stride=1, BN, ReLU, Maxpool
of computations needed for the forward pass through the conv3 32 × 2 × 3 2 × 2, stride=1, BN, ReLU, Maxpool
network. The test loss is measured as the mean squared conv4 48 × 1 × 2 2 × 2, stride=1, BN, ReLU, Maxpool
error between predicted and reference steering command and vectorize 96
the amount of computation is measured in the number of Case 6:
layer name output size parameters
floating-point operations (FLOPs). We summarize the results
conv1 3 × 12 × 16 1 × 1, stride=10, BN, ReLU
of the comparison in Table IV. conv2 16 × 6 × 8 3 × 3, stride=1, padding=1 BN, ReLU, Maxpool
conv3 32 × 2 × 3 3 × 3, stride=2, BN, ReLU, Maxpool
The results clearly show that both versions of the proposed vectorize 192
reconfigurable network achieve the performance that is close
Case 7:
to network with three inputs and they use almost the same layer name output size parameters
amount of computations as the single-input network. The conv1 3 × 12 × 16 1 × 1, stride=10, BN, ReLU
results also show that shallow gating network is sufficient conv2 12 × 6 × 8 3 × 3, stride=1, padding=1 BN, ReLU, Maxpool
for selecting a relevant input at any given moment. We also conv3 24 × 2 × 3 3 × 3, stride=2, BN, ReLU, Maxpool
vectorize 144
found that roughly the same number of errors come from
Case 8:
choosing the wrong camera and insufficiency of one sensor layer name output size parameters
in making a prediction. conv1 3 × 12 × 16 1 × 1, stride=10, BN, ReLU
conv2 8 × 6 × 8 3 × 3, stride=1, padding=1 BN, ReLU, Maxpool
Finally, we tested the second version of the reconfigurable conv3 16 × 2 × 3 3 × 3, stride=2, BN, ReLU, Maxpool
network on our autonomous platform, where the network vectorize 96
steers the car in the indoor environment. We recorded the Case 9:
videos captured by the three cameras as well as the gating layer name output size parameters
network output. We used the recorded gating network output conv1 3×6×8 1 × 1, stride=20, BN, ReLU, Maxpool
conv2 16 × 2 × 3 3 × 3, stride=1, BN, ReLU, Maxpool
to highlight the images corresponding to the sensors that are conv3 32 × 1 × 2 2 × 3, stride=1, BN, ReLU, Maxpool
activated by the gating mechanism at any given moment. The vectorize 64
resulting video is attached as the Supplementary material.
The example images captured by the three cameras during
autonomous driving are shown in Fig. 8.

1187

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.
TABLE IV
C OMPARISON OF DIFFERENT NETWORKS IN TERMS OF THE TEST LOSS
AND REQUIRED AMOUNT OF COMPUTATIONS .

Model Test loss FLOPs


Network using only center camera 0.20 36.36M
No gating mechanism; 0.09 109.08M
images from all three cameras are used
Network with gating mechanism 0.09 109.15M
and no thresholding
(soft attention mechanism)
Reconf Concat 0.12 36.54M
Reconf Select 0.11 36.41M

Fig. 5. Top: The output of the gating component of the reconfigurable


network indicating the relevance of the inputs. Bottom: The comparisons
between the actual steering command and predicted steering angles pro-
duced by network after the first step of the training procedure.

Fig. 8. The exemplary images captured by the three cameras during


autonomous driving when turning left (top row), driving straight (middle
Fig. 6. Performance comparisons of 9 various gating network architectures. row), and turning right (bottom row). The color images in red frames are
the inputs selected by the gating network.

V. C ONCLUSION
In this work, we propose the reconfigurable network for
handling multiple sensors on autonomous platforms. We
discuss its architecture and the training procedure. We show
that the proposed reconfigurable network effectively selects
the most relevant input at any given moment and by far
(achieving almost two times smaller test loss) outperforms
single-sensor network, while using the same amount of
computations. We also demonstrate that the compact gating
network is sufficient for selecting the relevant input at any
given moment. Thus, the proposed concept can be easily
scaled up to large number of sensors while keeping compu-
tations at a reasonable level, especially in applications were
only a subset of sensors are expected to suffice to perform
the learning task at any given time.

ACKNOWLEDGMENT
Fig. 7. The comparisons between the actual steering command and
predicted steering angles produced by the first (top) and second (bottom) We would like to thank Dr Mariusz Bojarski from
version of reconfigurable network after the third step of the training NVIDIA for hardware assistance and helpful discussions.
procedure. The output of the gating network is marked with shaded areas
in different colors for each selected input.
R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in NIPS, 2012.

1188

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–
image recognition,” in CVPR, 2016. 87, 1991.
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, [16] M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and
“Deeplab: Semantic image segmentation with deep convolutional nets, the em algorithm,” Neural computation, vol. 6, no. 2, pp. 181–214,
atrous convolution, and fully connected crfs,” arXiv:1606.00915, 2016. 1994.
[4] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “Applying [17] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixture of svms
convolutional neural networks concepts to hybrid NN-HMM model for very large scale problems,” in NIPS, 2002.
for speech recognition,” in ICASSP, 2012. [18] D. Eigen, M. Ranzato, and I. Sutskever, “Learning factored represen-
[5] J. Weston, S. Chopra, and K. Adams, “#tagspace: Semantic embed- tations in a deep mixture of experts,” arXiv preprint arXiv:1312.4314,
dings from hashtags,” in EMNLP, 2014. 2013.
[6] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, [19] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup, “Conditional
M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, et al., computation in neural networks for faster models,” arXiv:1511.06297,
“An empirical evaluation of deep learning on highway driving,” 2015.
arXiv:1504.01716, 2015. [20] T. Bolukbasi, J. Wang, O. Dekel, and V. Saligrama, “Adaptive neural
[7] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, networks for efficient inference,” in ICML, 2017.
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, [21] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” and J. Dean, “Outrageously large neural networks: The sparsely-gated
CoRR, vol. abs/1604.07316, 2016. mixture-of-experts layer,” arXiv:1701.06538, 2017.
[8] C. Chen, A. Seff, A. L. Kornhauser, and J. Xiao, “Deepdriving: [22] J. Schlosser, C. K. Chow, and Z. Kira, “Fusing lidar and images for
Learning affordance for direct perception in autonomous driving.” in pedestrian detection using convolutional neural networks,” in ICRA,
ICCV, 2015. 2016.
[9] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural [23] O. Mees, A. Eitel, and W. Burgard, “Choosing smartly: Adaptive
network,” in NIPS, 1989. multimodal fusion for object detection in changing environments,” in
IROS, 2016.
[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
[11] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road
obstacle avoidance through end-to-end learning,” in NIPS, 2006.
[12] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner,
L. Jackel, and U. Muller, “Explaining how a deep neural network
trained with end-to-end learning steers a car,” arXiv:1704.07911, 2017.
[13] N. Patel, A. Choromanska, P. Krishnamurthy, and F. Khorrami, “Sen-
sor modality fusion with cnns for ugv autonomous driving in indoor
environments,” in IROS, 2017.
[14] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object
detection network for autonomous driving,” in CVPR, 2017.
[15] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive

1189

Authorized licensed use limited to: Turun Yliopisto. Downloaded on July 09,2024 at 13:34:12 UTC from IEEE Xplore. Restrictions apply.

You might also like