0% found this document useful (0 votes)
23 views9 pages

Self Driving Car

The document discusses two models for predicting steering angle from images for self-driving cars. The first model uses 3D convolutional layers followed by LSTM recurrent layers to capture temporal information. The second model uses transfer learning with a pre-trained model, using its lower layers without retraining them. Both models aim to predict steering angle without explicitly programming rules, instead allowing the models to learn from example image and steering angle data. The document provides background on related work using neural networks for autonomous vehicle control tasks.

Uploaded by

aqib ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views9 pages

Self Driving Car

The document discusses two models for predicting steering angle from images for self-driving cars. The first model uses 3D convolutional layers followed by LSTM recurrent layers to capture temporal information. The second model uses transfer learning with a pre-trained model, using its lower layers without retraining them. Both models aim to predict steering angle without explicitly programming rules, instead allowing the models to learn from example image and steering angle data. The document provides background on related work using neural networks for autonomous vehicle control tasks.

Uploaded by

aqib ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Self-Driving Car Steering Angle Prediction Based on Image Recognition

Shuyang Du Haoli Guo Andrew Simpson


[email protected] [email protected] [email protected]

Abstract used 3D convolutional layers followed by recurrent layers


using LSTM (long short term memory). This model will
Self-driving vehicles have expanded dramatically over explore how temporal information is used to predict steer-
the last few years. Udacity has release a dataset containing, ing angle. Both 3D convolutional layers and recurrent lay-
among other data, a set of images with the steering angle ers make use of temporal information. The second model
captured during driving. The Udacity challenge aimed to to be discussed uses transfer learning (using lower layers of
predict steering angle based on only the provided images. a pre-trained model) by using a high quality model trained
We explore two different models to perform high quality on another dataset. Transfer learning helps to mitigate the
prediction of steering angles based on images using differ- amount of training data needed to create a high quality
ent deep learning techniques including Transfer Learning, model.
3D CNN, LSTM and ResNet. If the Udacity challenge was
still ongoing, both of our models would have placed in the 2. Related Work
top ten of all entries.
Using a neural network for autonomous vehicle naviga-
tion was pioneered by Pomerleau (1989) [13] who built the
1. Introduction Autonomous Land Vehicle in a Neural Network (ALVINN)
system. The model structure was relatively simple, com-
Self-driving vehicles are going to be of enormous eco- prising a fully-connected network, which is tiny by todays
nomic impact over the coming decade. Creating models standard. The network predicted actions from pixel in-
that meet or exceed the ability of a human driver could save puts applied to simple driving scenarios with few obstacles.
thousands of lives a year. Udacity has an ongoing chal- However, it demonstrated the potential of neural networks
lenge to create an open source self-driving car [19]. In for end-to-end autonomous navigation.
their second challenge Udacity released a dataset of images Last year, NVIDIA released a paper regarding a simi-
taken while driving along with the corresponding steering lar idea that benefited from ALVINN. In the paper [1], the
angle and ancillary sensor data for a training set (left, right, authors used a relatively basic CNN architecture to extract
and center cameras with interpolated angles based on cam- features from the driving frames. The layout of the archi-
era angle). The goal of the challenge was to find a model tecture can be seen in Figure 1. Augmentation of the data
that, given an image taken while driving, will minimize the collected was found to be important. The authors used arti-
RMSE (root mean square error) between what the model ficial shifts and rotations of the training set. Left and right
predicts and the actual steering angle produced by a human cameras with interpolated steering angles were also incor-
driver. In this project, we explore a variety of techniques in- porated. This framework was successful in relatively simple
cluding 3D convolutional neural networks, recurrent neural real-world scenarios, such as highway lane-following and
networks using LSTM, ResNets, etc. to output a predicted driving in flat, obstacle-free courses.
steering angle in numerical values. Recently, more attempts on using deep CNNs and RNNs
The motivation of the project is to eliminate the need for to tackle the challenges of video classification [9], scene
hand-coding rules and instead create a system that learns parsing [4], and object detection [17] have stimulated the
how to drive by observing. Predicting steering angle is one applications of more complicated CNN architectures in au-
important part of the end-to-end approach to self-driving car tonomous driving. ”Learning Spatiotemporal Features with
and would allow us to explore the full power of neural net- 3D Convolutional Networks” introduces how to construct
works. For example, using only steering angle as the train- 3D Convolutional Networks to capture spatiotemporal fea-
ing signal, deep neural networks can automatically extract tures in a sequence of images or videos [18]. ”Beyond Short
features to help position the road to make the prediction. Snippets: Deep Networks for Video Classification” de-
The two models that will be discussed are a model that scribes two ways including using LSTM to deal with videos

1
the input data that are relevant to the driving process.

3. Methods
We developed two types of models. The first one uses
3D convolutional layers followed by recurrent layers using
LSTM. The second model uses transfer learning with 2D
convolutional layers on a pre-trained model where the first
layers are blocked from training.

3.1. 3D Convolutional Model with Residual Con-


nections and Recurrent LSTM Layers
3.1.1 3D Convolutional Layer
How 3D convolutional layer works is similar to 2D con-
volutional layers, the only difference is that in addition to
height and width, now we have the third dimension depth
(temporal). Instead of having a 2D filter (if we ignore the
channel dimension for a while) moving within the image
along height and width, now we have a 3D filter moving
along with height, width and depth. If the input has shape
Figure 1. CNN architecture used in [1]. The network contains ap-
(D1 , H1 , W1 , C), then the output would have shape (D2 ,
proximately 27 million connections and 250 thousand parameters.
H2 , W2 , F ) where F is the number of filters. D2 , H2 , W2
could be calculated given stride and padding in its dimen-
sion.
[20]. ”Deep Residual Learning for Image Recognition” [6]
and ”Densely Connected Convolutional Networks” [7] de-
scribe the techniques to construct residual connections be- 3.1.2 Residual Connection
tween different layers and make it easier to train deep neural
Since deep neural networks may have issues passing the
networks.
gradient through all the layers, residual connections are
Besides of the CNN and/or RNN methods, there are used to help the training process. The idea of residual con-
more research initiatives applying deep learning techniques nection is to use network layers to fit a residual mapping
in autonomous driving challenges. Another line of work is instead of directly trying to fit a desired underlying map-
to treat autonomous navigation as a video prediction task. ping. Without residual connection:
Comma.ai [14] has proposed to learn a driving simulator Fit H(G(x)) directly
with an approach that combines a Variational Auto-encoder With residual connection:
(VAE) [11] and a Generative Adversarial Network (GAN) Fit F (x) = H(G(x)) − G(x) − x
[5]. Their approach is able to keep predicting realistic- Essentially the gradient can skip over different layers. This
looking video for several frames based on previous frames skipping lowers the effective number of layers the gradient
despite the transition model being optimized without a cost has to pass through in order to make it all the way back.
function in the pixel space. Skipping in this manner alleviates problems with the back-
Moreover, deep reinforcement learning (RL) has also propagation algorithm in very deep networks such as the
been applied to autonomous driving [3], [15]. RL has not vanishing gradient problem.
been successful for automotive applications until some re-
cent work shows the deep learning algorithms ability to 3.1.3 Spatial Batch Normalization
learn good representations of the environment. This was
demonstrated by learning of games like Atari and Go by Batch Normalization (see [8]) alleviates a lot of headaches
Google DeepMind [12], [16]. Inspired by these work, [3] with properly initializing neural networks by explicitly forc-
has proposed a framework for autonomous driving using ing the activations throughout a network to take on a unit
deep RL. Their framework is extensible to include RNN gaussian distribution at the beginning of the training. Spa-
for information integration, which enables the car to han- tial batch normalization not only normalizes among differ-
dle partially observable scenarios. The framework also in- ent samples but also among the spatial axis of images. Here
tegrates attention models, making use of the glimpse and we add spatial batch normalization after each 3D convolu-
action networks to direct the CNN kernels to the places of tional layer.

2
3.1.4 Recurrent Neural Networks and Long Short
Term Memory

Recurrent Neural Networks have loops in them, which al-


lows information to persist. This could be used to under-
stand present frame based on previous video frames. One
problem of vanilla RNN is it cannot learn to connect the
information too far away due to the gradient vanishing.
LSTM has a special cell state, served as a conveyor belt
that could allow information to flow without many interac-
tions. After we use 3D convolutional layers to extract visual
features, we feed them into LSTM layers to capture the se-
quential relation.

3.1.5 New Architecture

For self driving cars, incorporating temporal information


could play an important role in production systems. For ex-
ample, if the camera sensor is fully saturated looking at the
sun, knowing the information of the previous frames would
allow for a much better prediction than basing the steering
angle prediction only on the saturated frame. As discussed
earlier, 3D convolutional layers and recurrent layers incor- Figure 2. 3D convolutional model with residual connections and
porate temporal information. In this model we combined recurrent LSTM layers
these two ways of using temporal information. We used
the idea residual connection in constructing this model [6].
These connections allow for more of the gradient to pass 3.2. Transfer Learning
through network by combining different layers. The model For this model, we used the idea of transfer learning.
consisted five sequences of five frames of video shifted by Transfer learning is a way of using high quality models that
one frame for the input (5x5x120x320x3). The values were were trained on existing large datasets. The idea of trans-
selected to fit the computational budget. This allowed for fer learning is that features learned in the lower layers of
both motion and differences in outputs between frames to the model are likely transferable to another dataset. These
be used. This model had 543,131 parameters. The architec- lower level features would be useful in the new dataset such
ture of the model can be seen in Figure 2. as edges.
The model consists of a few initial layers to shrink the Of the pre-trained models available, ResNet50 had good
size followed by ResNet like blocks of 3D convolutions performance for this dataset. This model was trained on
with spatial batch normalization (only two of these in the ImageNet. The weights of the first 15 ResNet blocks were
trained model). Due to computational restraints, shrink lay- blocked from updating (first 45 individual layers out of 175
ers were added to make the input to the LSTM layers much total). The output of ResNet50 was connected to a stack of
smaller. Only two levels of recurrent layers were used due fully connected layers containing 512, 256, 64, and 1 dif-
to the speed of computation on these layers being much ferent units respectively. The architecture of this model can
slower due to parts that must be done in a serial manner. The be seen in Figure 3 with the overall number of parameters
output of the recurrent layers was fed into a fully connected being 24,784,641. The fully connected layers used ReLUs
stack that ends with the angle prediction. All of these layers as their activation. The ResNet50 model consists of several
used rectified linear units, ReLUs, as their activation ex- different repeating blocks that form residual connections.
cept the LSTM layers (ReLUs keep the same positive value The number filters varies from 64 to 512. A block is con-
as the output and negative values are set to zero). Spatial sistent of a convolutional layer, batch normalization, ReLU
batch normalization was used on the convolutional layers. activation repeated three times and the input layer output
The LSTM layers used the hyperbolic tangent function as combined with the last layer.
their activation, which is common to use in these types of Other sizes of locking were attempted, but produced ei-
layers. ther poor results or were slow in training. For example,

3
training only the last 5 blocks provided poor results, which tains 5615 frames. The original resolution of the image is
were only slightly better than predicting a steering angle 640x480.
of zero for all inputs. Training all the layers also produced Training images come from 5 different driving videos:
worse results on the validation set compared to blocking the
first 45 (0.0870 on the validation set after 32 epochs).
1. 221 seconds, direct sunlight, many lighting changes.
The model took as input images of 224x224x3 (down-
Good turns in beginning, discontinuous shoulder lines,
sized and mildly stretched from the original Udacity data).
ends in lane merge, divided highway
The only augmentation provided for this model was mir-
rored images. Due to the size constraints of the input into 2. discontinuous shoulder lines, ends in lane merge, di-
ResNet50, cropping was not used as it involved stretching vided highway 791 seconds, two lane road, shadows
the image. The filters in the pretrained model were not are prevalent, traffic signal (green), very tight turns
trained on stretched images, so the filters may not activate as where center camera can’t see much of the road, di-
well on the stretched data (RMSE of 0.0891 on the valida- rect sunlight, fast elevation changes leading to steep
tion set after 32 epochs). Additionally, using the left and the gains/losses over summit. Turns into divided highway
right cameras from the training set proved not to be useful around 350s, quickly returns to 2 lanes.
for the 32 epochs used to train (0.17 RMSE on the valida-
tion set). 3. 99 seconds, divided highway segment of return trip
over the summit
4. 212 seconds, guardrail and two lane road, shadows in
beginning may make training difficult, mostly normal-
izes towards the end
5. 371 seconds, divided multi-lane highway with a fair
amount of traffic
Figure 4 shows typical images for different light, traffic
and driving conditions.

Figure 3. Architecture used for transfer learning model.

4. Dataset and Features


The dataset we used is provided by Udacity, which is Figure 4. Example images from the dataset. From left to right,
generated by NVIDIAs DAVE-2 System [1]. Specifically, bight sun, shadows, sharp left turn, up hill, straight, and heavy
three cameras are mounted behind the windshield of the traffic conditions.
data-acquisition car. Time-stamped video from the cameras
is captured simultaneously with the steering angle applied 4.1. Data Augmentation Methods
by the human driver. This steering command is obtained by
tapping into the vehicle’s Controller Area Network (CAN) 4.1.1 Brightness Augmentation
bus. In order to make the system independent of the car ge- Brightness is randomly changed to simulate different light
ometry, they represent the steering command as 1/r where conditions. We generate augmented images with different
r is the turning radius in meters. They use 1/r instead of brightness by first converting images to HSV, scaling up or
r to prevent a singularity when driving straight (the turning down the V channel and converting back to the RGB chan-
radius for driving straight is infinity). 1/r smoothly transi- nel. Following are typical augmented images.
tions through zero from left turns (negative values) to right
turns (positive values). Training data contains single im-
4.1.2 Shadow Augmentation
ages sampled from the video, paired with the corresponding
steering command (1/r). Random shadows cast across images. The intuition is that
Training data set contains 101397 frames and corre- even the camera has been shadowed (maybe by rainfall or
sponding labels including steering angle, torque and speed. dust), the model is still expected to predict the correct steer-
We further split this data set into training and validation in ing angle. This is implemented by choosing random points
a 80/20 fashion. And there is also a test set which con- and shading all points on one side of the image.

4
Figure 7. Shift augmentation examples
Figure 5. Brightness augmentation examples

Figure 8. Rotation augmentation example

ages/255. We further rescale the image to a 224x224x3


square image for transfer learning model. For the 3D LSTM
model, we cropped the sky out of the image to produce data
the size of 120x320x3.

5. Experiments, Results, and Discussion


5.1. Data Augmentation
In order to establish a baseline for our research, we used
Figure 6. Shadow augmentation examples the architecture from NVIDIA [1] to test different forms of
data augmentation. Teams in the Udacity challenge noted
that data augmentation was helpful along with the original
4.1.3 Horizontal and Vertical Shifts NVIDIA researchers. Knowing which forms of augmenta-
tion work well for the amount of time and computational
We shift the camera images horizontally to simulate the ef- available would be helpful in training our new models. The
fect of car being at different positions on the road, and add NVIDIA model architecture seen previously in Figure 1.
an offset corresponding to the shift to the steering angle. We The input to this model was 120x320x3 with a batch size
also shift the images vertically randomly to simulate the ef- of 32. In the NVIDIA paper [1] it was not clear how they
fect of driving up or down a slope. optimized the loss function. For this experiment, we used
the default parameters of Adam (see [10]) provided in Keras
4.1.4 Rotation Augmentation (learning rate of 0.001, β1 = 0.9, β2 = 0.999,  = 1e − 8,
and decay=0).
During the data augmentation experiment, the images were Three different levels of augmentation were examined.
also rotated around the center. The idea behind rotations is The first had minimal augmentation with only using ran-
that the model should be agnostic to camera orientation and dom flips and cropping of the top of the image. Randomly
that rotations may help reduce over-fitting. flipping the input images eliminates the bias towards right
turns found in the dataset. Cropping the top of the image
4.2. Preprocessing
eliminates the sky from the image, which should not play a
For each image, we normalize the value range from [0, role in how to turn predict the steering angle. A second form
255] to [-1, 1] by normalizing by image=-1+2*original im- of augmentation had the same augmentation as the minimal

5
version along with small rotations (-5 to +5 degrees), shifts 5.3. Feature Visualization
(25 pixels), and small brightness changes of the image. The
final Heavier version of augmentation used more exagger- In order to examine what our networks find relevant in
ated effects of the second version including large angle ro- an image, we can use saliency maps. These maps can show
tations (up to 30 degrees), large shadows, shifts, and larger how the gradient flows back to the image highlighting the
brightness changes were used. Results from this experiment most salient areas. A similar approach was used in a recent
can be see in Table 1. NVIDIA paper [2].
Table 1. RMSE on the validation set using the NVIDIA architec-
ture for different levels of data augmentation with 32 epochs.
Minimal Moderate Heavy 5.3.1 Data Augmentation Experiment
0.09 0.10 0.19
What these models found important can be visualized in
Using heavy data augmentation produced very poor re- Figure 9. The minimal model found the lane markers im-
sults that were not much above predicting a steering angle portant. In the moderate model more of the lane markers
of 0 for all the data. The moderate augmentation produced were found to be important; however, this model’s saliency
good results; however the minimal augmentation produced maps appeared more noisy, which could explain its slightly
the best results. These results could be explained by only decreased performance. In the heavy model almost no areas
training for 32 epochs. Heavy augmentation could be hard were found to be salient, which is understandable due to its
for the model to pick up on such drastic shifts. Similarly, poor performance.
the moderate version may have outperformed the minimal
version over more epochs. In a later section visualization of
these tests will be examined. For our new models, we chose
to use minimal augmentation.
5.2. Training Process
5.2.1 Loss Function and Optimization
For all models used, the mean-square-loss function was
used. This loss function is common for regression prob-
lems. The MSE punishes large deviations harshly. This
function is simple the mean of the sum of the squared dif-
ferences between the actual and predicted results (see Equa-
tion 1). The scores for the Udacity challenge were reported
as the root-mean-square-error, RMSE, which is simply the Figure 9. NVIDIA model saliency maps for different levels of aug-
square root of the MSE. mentation.

1X
M SE = (yi − yˆi )2
n
r (1)
1X 5.3.2 3D Convolutional LSTM Model
RM SE = (yi − yˆi )2
n
To optimize this loss, the Adam optimizer was used [10]. This model produced interesting saliency maps. In exam-
This optimizers is often the go to choice for deep learning ining on of the video clips fed into the model, we can see
application. This optimization method usually substantially that the salient features change frame to frame in Figure 10.
outperforms more generic stochastic gradient decent meth- The salient features seem to change from frame to frame,
ods. Initial testing of these models indicate that their loss which would indicate that the changes between frames are
level change slowed after a few epochs. Although Adam important.
computes an adaptive learning rate through its formula, de- This sequence of frames can be collapse into a single
cay of the learning rate was used. The decay rate of the image, which is shown in Figure 11. The collapsed version
optimizer was updated from 0 to the learning rate divided helps to visualize this better. The expressed salient features
by the number of epochs. The other default values of the do cluster around road markers, but they also cluster around
Keras Adam optimizer showed good results during training other vehicles and their shadows. This model may be using
(learning rate of 0.001, β1 = 0.9, β2 = 0.999,  = 1e − 8, information about the car in front in order to make a steering
and decay=learning rate/batch size). angle prediction along with the road markers.

6
Figure 12. Transfer learning model (ResNet50) saliency map for
an example image.

in Table 2. The results from the 3D convolutional model


with LSTM and residual connections had a RMSE for the
Figure 10. Saliency map for a sequence in the 3D convolutional test set of 0.1123. Since the Udacity challenge is over, the
LSTM model for an example image. results can be compared to the leader board. For the 3D
LSTM model, the results on the test set would have put in
in tenth place overall. The ResNet50 transfer model had
the best results overall with a RMSE of 0.0709 on the test
set. This result would have placed the model in fourth place
overall in the challenge. This is without using any exter-
nal functions for the models (some teams used an external
smoothing function in conjunction with their deep learning
models).
Table 2. RMSE for the models on the Udacity dataset.
Training Set Validation Set Test Set
Predict 0 0.2716 0.2130 0.2076
3D LSTM 0.0539 0.1139 0.1123
Figure 11. Saliency map for the 3D convolutional LSTM model
Transfer 0.0212 0.0775 0.0709
for an example image. NVIDIA 0.0750 0.0995 0.0986

In order to help visualization of what this level of error


5.3.3 Transfer Learning Model looks like, an example overlay for random images is seen in
Figure 13. The green circle indicates the true angle and the
An example saliency map for the ResNet50 transfer learn- red circle indicates the predicted angle. The predictions are
ing model can be seen in Figure 12. The model does appear generated from the ResNet50 transfer learning model.
to have salient features on the road markers; however, there
are also regularly spaced blotches. These blotches may be 5.5. Discussion
artifacts from using this type pretrained model with resid-
For the amount of epochs we used, only minimal data
ual connections. Although this model had the best overall
augmentation proved to be of any major use for these model.
results, its saliency maps did not match well with the ex-
For more expansive training, the strategy of data augmenta-
pectation of what would be expected for salient features in
tion can allow for near infinite training data given the right
predicting steering angles from road images.
strategy.
Overall, these models showed that they were competi-
5.4. Results
tive with other top models from the Udacity challenge. For
These models were all ran on the same datasets (training, the 3D LSTM model, with more time and computational re-
validation, and test). The results for each model is listed sources, this model could have been expanded to take in a

7
better results.
These models are far from perfect and there is substan-
tial research that still needs to be done on the subject be-
fore models like these can be deployed widely to transport
the public. These models may benefit from a wider range
of training data. For a production system, a model would
have to be able to handle the environment in snowy condi-
tions. Generate adversarial models, GANs, could be used
to transform a summer training set into a winter one. Addi-
tionally, GANs could be used to generate more scenes with
sharp angles. Additionally, a high quality simulator could
be used with deep reinforcement learning. A potential re-
ward function could be getting from one point to another
while minimizing time, maximizing smoothness of the ride,
staying in the correct lane/following the rules of the road,
and not hitting objects.
Figure 13. Example actual vs. predicted angle on unprocessed im-
ages (transfer model).
References
[1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,
longer period of video along with more ResNet blocks. Ex-
B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,
panding the model in this way could have produced superior
J. Zhang, et al. End to end learning for self-driving cars.
results. One of the teams near the top of the competition arXiv preprint arXiv:1604.07316, 2016.
used a full 250 frames or 2.5 seconds of video [19]. [2] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski,
For the Resnet50 transfer model, the strategy of using a B. Firner, L. Jackel, and U. Muller. Explaining how a deep
pre-trained model, locking approximately the first quarter neural network trained with end-to-end learning steers a car.
layers, training the deeper layers with the existing weights, arXiv preprint arXiv:1704.07911, 2017.
and connecting to a fully connected stack proved to be ef- [3] A. El Sallab, M. Abdou, E. Perot, and S. Yogamani. Deep
fective in producing a high quality and competitive model reinforcement learning framework for autonomous driving.
for the Udacity self-driving car dataset. It was surprising Autonomous Vehicles and Machines, Electronic Imaging,
that this model outperform the other model. The architec- 2017.
ture of this model takes no temporal data, yet it still predicts [4] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
very good values. hierarchical features for scene labeling. IEEE transactions
Both of these models appeared to have had some over- on pattern analysis and machine intelligence, 35(8):1915–
fitting with the ResNet50 model having more of an issue 1929, 2013.
with this. Data augmentation could act as a form of reg- [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
ularization for this model. Different teams in the Udacity
erative adversarial nets. In Advances in neural information
challenge have tried different regularization method includ-
processing systems, pages 2672–2680, 2014.
ing dropout and L2 regularization. The results for using this
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
regularization methods was mixed with some teams claim- ing for image recognition. In Proceedings of the IEEE Con-
ing good results and others having less success. ference on Computer Vision and Pattern Recognition, pages
770–778, 2016.
6. Conclusion and Future Work [7] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.
Densely connected convolutional networks. arXiv preprint
In examining the final leader board from Udacity our
arXiv:1608.06993, 2016.
models would have placed fourth (transfer learning model)
[8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
and tenth (3D convolutional model with LSTM layers).
deep network training by reducing internal covariate shift.
These results were produced solely from the models with- arXiv preprint arXiv:1502.03167, 2015.
out any external smoothing function. We have shown that [9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
both transfer learning and a more advanced architecture and L. Fei-Fei. Large-scale video classification with convo-
have promise in the field of autonomous vehicles. The 3D lutional neural networks. In Proceedings of the IEEE con-
model was limited by computational resources, but overall ference on Computer Vision and Pattern Recognition, pages
it still provided a good result from a novel architecture. In 1725–1732, 2014.
future work the 3D model’s architecture could be expanded [10] D. Kingma and J. Ba. Adam: A method for stochastic opti-
by having a larger and deeper layers, which may produce mization. arXiv preprint arXiv:1412.6980, 2014.

8
[11] D. P. Kingma and M. Welling. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
I. Antonoglou, D. Wierstra, and M. Riedmiller. Play-
ing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
[13] D. A. Pomerleau. Alvinn, an autonomous land vehicle in a
neural network. Technical report, Carnegie Mellon Univer-
sity, Computer Science Department, 1989.
[14] E. Santana and G. Hotz. Learning a driving simulator. arXiv
preprint arXiv:1608.01230, 2016.
[15] S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe,
multi-agent, reinforcement learning for autonomous driving.
arXiv preprint arXiv:1610.03295, 2016.
[16] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, et al. Mastering the game
of go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016.
[17] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks
for object detection. In Advances in Neural Information Pro-
cessing Systems, pages 2553–2561, 2013.
[18] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional net-
works. In The IEEE International Conference on Computer
Vision (ICCV), December 2015.
[19] Udacity. An open source self-driving car, 2017.
[20] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,
O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-
pets: Deep networks for video classification. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 4694–4702, 2015.

You might also like