Self Driving Car
Self Driving Car
1
the input data that are relevant to the driving process.
3. Methods
We developed two types of models. The first one uses
3D convolutional layers followed by recurrent layers using
LSTM. The second model uses transfer learning with 2D
convolutional layers on a pre-trained model where the first
layers are blocked from training.
2
3.1.4 Recurrent Neural Networks and Long Short
Term Memory
3
training only the last 5 blocks provided poor results, which tains 5615 frames. The original resolution of the image is
were only slightly better than predicting a steering angle 640x480.
of zero for all inputs. Training all the layers also produced Training images come from 5 different driving videos:
worse results on the validation set compared to blocking the
first 45 (0.0870 on the validation set after 32 epochs).
1. 221 seconds, direct sunlight, many lighting changes.
The model took as input images of 224x224x3 (down-
Good turns in beginning, discontinuous shoulder lines,
sized and mildly stretched from the original Udacity data).
ends in lane merge, divided highway
The only augmentation provided for this model was mir-
rored images. Due to the size constraints of the input into 2. discontinuous shoulder lines, ends in lane merge, di-
ResNet50, cropping was not used as it involved stretching vided highway 791 seconds, two lane road, shadows
the image. The filters in the pretrained model were not are prevalent, traffic signal (green), very tight turns
trained on stretched images, so the filters may not activate as where center camera can’t see much of the road, di-
well on the stretched data (RMSE of 0.0891 on the valida- rect sunlight, fast elevation changes leading to steep
tion set after 32 epochs). Additionally, using the left and the gains/losses over summit. Turns into divided highway
right cameras from the training set proved not to be useful around 350s, quickly returns to 2 lanes.
for the 32 epochs used to train (0.17 RMSE on the valida-
tion set). 3. 99 seconds, divided highway segment of return trip
over the summit
4. 212 seconds, guardrail and two lane road, shadows in
beginning may make training difficult, mostly normal-
izes towards the end
5. 371 seconds, divided multi-lane highway with a fair
amount of traffic
Figure 4 shows typical images for different light, traffic
and driving conditions.
4
Figure 7. Shift augmentation examples
Figure 5. Brightness augmentation examples
5
version along with small rotations (-5 to +5 degrees), shifts 5.3. Feature Visualization
(25 pixels), and small brightness changes of the image. The
final Heavier version of augmentation used more exagger- In order to examine what our networks find relevant in
ated effects of the second version including large angle ro- an image, we can use saliency maps. These maps can show
tations (up to 30 degrees), large shadows, shifts, and larger how the gradient flows back to the image highlighting the
brightness changes were used. Results from this experiment most salient areas. A similar approach was used in a recent
can be see in Table 1. NVIDIA paper [2].
Table 1. RMSE on the validation set using the NVIDIA architec-
ture for different levels of data augmentation with 32 epochs.
Minimal Moderate Heavy 5.3.1 Data Augmentation Experiment
0.09 0.10 0.19
What these models found important can be visualized in
Using heavy data augmentation produced very poor re- Figure 9. The minimal model found the lane markers im-
sults that were not much above predicting a steering angle portant. In the moderate model more of the lane markers
of 0 for all the data. The moderate augmentation produced were found to be important; however, this model’s saliency
good results; however the minimal augmentation produced maps appeared more noisy, which could explain its slightly
the best results. These results could be explained by only decreased performance. In the heavy model almost no areas
training for 32 epochs. Heavy augmentation could be hard were found to be salient, which is understandable due to its
for the model to pick up on such drastic shifts. Similarly, poor performance.
the moderate version may have outperformed the minimal
version over more epochs. In a later section visualization of
these tests will be examined. For our new models, we chose
to use minimal augmentation.
5.2. Training Process
5.2.1 Loss Function and Optimization
For all models used, the mean-square-loss function was
used. This loss function is common for regression prob-
lems. The MSE punishes large deviations harshly. This
function is simple the mean of the sum of the squared dif-
ferences between the actual and predicted results (see Equa-
tion 1). The scores for the Udacity challenge were reported
as the root-mean-square-error, RMSE, which is simply the Figure 9. NVIDIA model saliency maps for different levels of aug-
square root of the MSE. mentation.
1X
M SE = (yi − yˆi )2
n
r (1)
1X 5.3.2 3D Convolutional LSTM Model
RM SE = (yi − yˆi )2
n
To optimize this loss, the Adam optimizer was used [10]. This model produced interesting saliency maps. In exam-
This optimizers is often the go to choice for deep learning ining on of the video clips fed into the model, we can see
application. This optimization method usually substantially that the salient features change frame to frame in Figure 10.
outperforms more generic stochastic gradient decent meth- The salient features seem to change from frame to frame,
ods. Initial testing of these models indicate that their loss which would indicate that the changes between frames are
level change slowed after a few epochs. Although Adam important.
computes an adaptive learning rate through its formula, de- This sequence of frames can be collapse into a single
cay of the learning rate was used. The decay rate of the image, which is shown in Figure 11. The collapsed version
optimizer was updated from 0 to the learning rate divided helps to visualize this better. The expressed salient features
by the number of epochs. The other default values of the do cluster around road markers, but they also cluster around
Keras Adam optimizer showed good results during training other vehicles and their shadows. This model may be using
(learning rate of 0.001, β1 = 0.9, β2 = 0.999, = 1e − 8, information about the car in front in order to make a steering
and decay=learning rate/batch size). angle prediction along with the road markers.
6
Figure 12. Transfer learning model (ResNet50) saliency map for
an example image.
7
better results.
These models are far from perfect and there is substan-
tial research that still needs to be done on the subject be-
fore models like these can be deployed widely to transport
the public. These models may benefit from a wider range
of training data. For a production system, a model would
have to be able to handle the environment in snowy condi-
tions. Generate adversarial models, GANs, could be used
to transform a summer training set into a winter one. Addi-
tionally, GANs could be used to generate more scenes with
sharp angles. Additionally, a high quality simulator could
be used with deep reinforcement learning. A potential re-
ward function could be getting from one point to another
while minimizing time, maximizing smoothness of the ride,
staying in the correct lane/following the rules of the road,
and not hitting objects.
Figure 13. Example actual vs. predicted angle on unprocessed im-
ages (transfer model).
References
[1] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,
longer period of video along with more ResNet blocks. Ex-
B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,
panding the model in this way could have produced superior
J. Zhang, et al. End to end learning for self-driving cars.
results. One of the teams near the top of the competition arXiv preprint arXiv:1604.07316, 2016.
used a full 250 frames or 2.5 seconds of video [19]. [2] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski,
For the Resnet50 transfer model, the strategy of using a B. Firner, L. Jackel, and U. Muller. Explaining how a deep
pre-trained model, locking approximately the first quarter neural network trained with end-to-end learning steers a car.
layers, training the deeper layers with the existing weights, arXiv preprint arXiv:1704.07911, 2017.
and connecting to a fully connected stack proved to be ef- [3] A. El Sallab, M. Abdou, E. Perot, and S. Yogamani. Deep
fective in producing a high quality and competitive model reinforcement learning framework for autonomous driving.
for the Udacity self-driving car dataset. It was surprising Autonomous Vehicles and Machines, Electronic Imaging,
that this model outperform the other model. The architec- 2017.
ture of this model takes no temporal data, yet it still predicts [4] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
very good values. hierarchical features for scene labeling. IEEE transactions
Both of these models appeared to have had some over- on pattern analysis and machine intelligence, 35(8):1915–
fitting with the ResNet50 model having more of an issue 1929, 2013.
with this. Data augmentation could act as a form of reg- [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-
ularization for this model. Different teams in the Udacity
erative adversarial nets. In Advances in neural information
challenge have tried different regularization method includ-
processing systems, pages 2672–2680, 2014.
ing dropout and L2 regularization. The results for using this
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
regularization methods was mixed with some teams claim- ing for image recognition. In Proceedings of the IEEE Con-
ing good results and others having less success. ference on Computer Vision and Pattern Recognition, pages
770–778, 2016.
6. Conclusion and Future Work [7] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.
Densely connected convolutional networks. arXiv preprint
In examining the final leader board from Udacity our
arXiv:1608.06993, 2016.
models would have placed fourth (transfer learning model)
[8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
and tenth (3D convolutional model with LSTM layers).
deep network training by reducing internal covariate shift.
These results were produced solely from the models with- arXiv preprint arXiv:1502.03167, 2015.
out any external smoothing function. We have shown that [9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
both transfer learning and a more advanced architecture and L. Fei-Fei. Large-scale video classification with convo-
have promise in the field of autonomous vehicles. The 3D lutional neural networks. In Proceedings of the IEEE con-
model was limited by computational resources, but overall ference on Computer Vision and Pattern Recognition, pages
it still provided a good result from a novel architecture. In 1725–1732, 2014.
future work the 3D model’s architecture could be expanded [10] D. Kingma and J. Ba. Adam: A method for stochastic opti-
by having a larger and deeper layers, which may produce mization. arXiv preprint arXiv:1412.6980, 2014.
8
[11] D. P. Kingma and M. Welling. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
I. Antonoglou, D. Wierstra, and M. Riedmiller. Play-
ing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
[13] D. A. Pomerleau. Alvinn, an autonomous land vehicle in a
neural network. Technical report, Carnegie Mellon Univer-
sity, Computer Science Department, 1989.
[14] E. Santana and G. Hotz. Learning a driving simulator. arXiv
preprint arXiv:1608.01230, 2016.
[15] S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe,
multi-agent, reinforcement learning for autonomous driving.
arXiv preprint arXiv:1610.03295, 2016.
[16] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,
V. Panneershelvam, M. Lanctot, et al. Mastering the game
of go with deep neural networks and tree search. Nature,
529(7587):484–489, 2016.
[17] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks
for object detection. In Advances in Neural Information Pro-
cessing Systems, pages 2553–2561, 2013.
[18] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
Learning spatiotemporal features with 3d convolutional net-
works. In The IEEE International Conference on Computer
Vision (ICCV), December 2015.
[19] Udacity. An open source self-driving car, 2017.
[20] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan,
O. Vinyals, R. Monga, and G. Toderici. Beyond short snip-
pets: Deep networks for video classification. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 4694–4702, 2015.