Paper 3
Paper 3
e-mail: [email protected]
Abstract—Potholes are a structural damage to the road efficient in detecting the cracks and uneven surfaces on the
with hollow which can cause severe traffic accidents and road. Zhang L, et al. proposed to use a deep CNN model
impact road efficiency. In this paper, we propose an efficient along with some sensors for automatic crack detection[8].
pothole detection system using deep learning algorithms which This model can learn the features without any feature
can detect potholes on the road automatically. Four models are extraction processes automatically. Tedeschi A, et al.
trained and tested with preprocessed dataset, including YOLO developed a real-time pothole detection system for Android
V3, SSD, HOG with SVM and Faster R-CNN. In the phase one, devices[9].
initial images with potholes and non-potholes are collected and
labeled. In the phase two, the four models are trained and In this paper, we propose a solution aimed to use
tested for the accuracy and loss comparison with the processed machine learning and artificial intelligence algorithms to
image dataset. Finally, the accuracy and performance of all create an accurate and efficient pothole detection system.
four models are analyzed. The experimental results show that Four modern deep learning models are trained to see which
the YOLO V3 model performs best for its faster and more model or ensemble of models produces the best results,
reliable detection results. including Yolo V3(You Only Look Once) Algorithm,
SSD(Single Shot Detector) Algorithm, HOG(Histogram of
Keywords—YOLO, Deep learning, Pothole detection, CNN, Oriented Gradients) with Support Vector Machine and Faster
SVM. R-CNN. Our pothole detection schema consists of two parts:
(1) data preparation and (2) predict potholes in images using
I. INTRODUCTION machine learning models. In the first part, the subsets of all
A pothole is a structure failure in a road surface. It cannot available data that relate to our schema are selected,
be ignored since it may cause severe traffic accidents and containing training datasets and test datasets, positive and
impact road efficiency. The 2006 Asian Development Bank negative images. As positive image indicates that there exists
(ADB) study showed that about half of these paved roads are a pothole in the street and negative image indicates there are
in a poor condition. All developed countries almost have the no potholes. We need label images to generate these datasets
similar problem. Potholes are formed by the terrible weather and then convert image file to train.record which will be used
and heavy vehicles movement. The most important step to by the model as input. In the second part, the prepared data is
maintain the road condition is to detect potholes with high fed to deep learning models which will predict the potholes
accuracy[1,2]. Recent years, a lot of studies have been to do training and predictions. Finally, from the
conducted to detect pothole in the road automatically. Lin J, experimentally results, we can see that YOLO V3
et al. proposed to use SVM(Support Vector Machine) for outperforms best in terms of speed, and it is also decent in
pothole detection[3]. The image region was extracted based terms of accuracy for all object sizes. From the aspect of
on the histogram of the image and simple kernel SVM was time consuming, YOLO V3 still has superb performance.
used to locate the pothole. The target was well recognizable SSD has quite high accuracy but it is slower compared to
using this method. CNN ( Convolutional Neural Network) other models. HOG model has mediocre performance in both
based deep learning is used to classify the potholes and accuracy and speed. Faster R-CNN has best performance in
cracks based on the images. A model was built using CNN accuracy but this model needs more computing power and
which was not influenced by the noise due to incorrect training time.
illumination and shadows[4]. Hiroya Maeda, et al. [5]
developed a system to detect the road damage using CNN The rest of the paper is organized as follows: Section II
methods on the images taken by phones. They gathered a presents data pre-process relevant to the pothole detection
huge dataset for pothole detection and applied deep learning solution. Section III describes the proposed solution with
algorithms to solve the problem. The accuracy and speed of four trained models. In Section IV, the simulation results and
the road damage detection system was approving. Some performance analysis are given. Finally, the conclusion is
other researchers have employed binary classification [6] given in Section V.
based on deep neural networks to classify the road images II. FUNDAMENTAL KNOWLEDGE AND PRELIMINARIES
whether they belong to normal road images or the ones with
This section presents the process of prepared training
pothole. The features of the images need to be fed to the
dataset and test dataset in a statistics format, including
system before it can perform the classification. A new neural
classes, sizes, media types in each cycle.
model Crack-net [7] was proposed for detecting the cracks
on the road. The difference with other neural models was that The dataset we chosen is created by Electrical and
pooling layers are not included. This method was very Electronic Department, Stellenbosch University in 2015. The
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.
entire dataset consists of two different sets, one was have created a dataset of 2036 images. Out of the total
considered to be simple and the other more complex. The dataset, 1384 images are training data while the remaining
dataset is collected by clicking pictures on smart phones by 652 images will be used for test data. The detailed statistics
setting it up on the dashboard of a car. These datasets do is listed in table 1.
share some files and there are a few instances where two TABLE 1
different images would have the same name. Therefore,
appropriate measures need to be taken if the data is DETAILED STATISTICS OF THE ENTIRE DATASET
combined into one larger dataset. Every folder contains 2
subfolders which contain the training data and test data. Index name Description
Furthermore, the training data folder is divided into 2 more Image size 72dpi*72dpi
Total categories 2
such subfolders namely positive data which contains the
Total dataset size 2036
pictures of roads with potholes and negative data which Training dataset size 1384
consists of pictures of roads with no potholes. Figure.1 Test dataset size 652
shows examples of training data which contain positives and
negatives in the dataset and figure.2 shows examples of test B. Data Validation
data. The models will be tested on 652 images to derive the
loss and accuracy of the model for the test images. The loss
and accuracy will be compared to finally choose the result
model. The models we employed are explained in detail in
the next section.
III. THE PROPOSED POTHOLE DETECTION SOLUTION
A. You Only Look Once(YOLO) Algorithm
YOLO is an object detection algorithm which is popular
for detecting objects in images. This algorithm uses single
neural network to predict the vector of the bounding boxes
and potholes[10,11]. It works by splitting images into a grid
with size of ShS. Every cell in the grid can predict N
Figure 1 Examples of training data possible bounding boxes and the level of probability(i.e.
confidence score) of it being the object which in our case is a
pothole. This will give us S h S h N boxes. Figure.3
demonstrates the architecture of YOLO. Most of these boxes
will have a quite low probability, that’s why the algorithm
proceeds to delete the boxes that are below a certain
threshold of minimum probability. The rest of the bounding
boxes are then moved towards a non-max suppression to
remove all the duplicate objects. This paper uses YOLO V3,
the training of this model is done on full images and
Figure 2 Examples of test data probability of the class in the bounding boxes. This method
has a lot of benefits than the original methods for object
A. Data Preparation detection. The YOLO V3 model is very fast. A complex
To create the training data, we need labeled images as pipeline is not needed because YOLO V3 works on object
labeled images contain the position and name of the object to detection as a regression problem. The neural network needs
be classified in the model. The images are labeled by to be run on the new image whenever we need to make
creating a rectangular bounding box around the object predictions. Using the GPU, the 45fps are run and on faster
manually on all the training images. Finding the exact version runs with 150fps. This implies that real time video
position of these bounding box for all the training images can also be processed with latency as less as 25ms. YOLO
could be a tedious task. To overcome this, we will use an V3 looks at the image as a overall package before detecting
image labeling tool like Labelme or LabelImg. This tool and making predictions.
makes labeling the potholes easier as an object can be
labeled by just dragging a line across a pothole. Below are
the steps that were performed for data preparation:
Step 1:Generate dataset using LabelImg which converts
JPEG image file to XML with pothole labeled.
Step 2:Convert the XML file to CSV records which has Bounding boxes + confidence
image details.
Step 3:Convert this csv file to train.record which will be
S×S grid on input Final detections
used by the model as input.
Once all the images are labeled, a .xml is created for
every image which contains the top-left and bottom-right Class probability map
coordinates of the bounding box. These coordinates are then
fed to the model which will predict the potholes position. We Figure 3 YOLO V3 architecture
199
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.
Unlike the other methods like sliding window, YOLO V3 Location Loss (LL) : It is a parameter to measure how
looks at the entire image during the training as well as during far are the predicted bounded boxes from the actual bounding
the predictions. This allows the model to have all the boxes of the object.
contextual information about the classes of objects. If we
look at CNN, it sometimes classifies the background as an C. Histogram of Oriented Gradients(HOG) with Support
object because it is not able to see the picture as a whole. It is Vector Machine
unable to get larger context out of the picture. So the error The shape of the object is an important feature to
rate of YOLO V3 in terms of background errors is half of distinguish any object. HOG is an algorithm for feature
what is for CNN model. YOLO V3 learns the general extraction which distinguishes objects on the basis of their
structure of the object rather than cramming the exact shape. shapes. The histograms are calculated for each gradient
Due to this reason, the YOLO V3 can make good predictions orientation of the picture. Each image has different colors
on the natural photos which do not all have exact same shape and the intensity of colors vary in all of them. Gradient
of the object. This feature allows YOLO V3 to outperform orientation is the directional change in the color, intensity
many other superior algorithms. and other properties of the image. Given below are the steps
of how feature extraction is done using HOG:
B. Single Shot Detector(SSD) Algorithm
SSD which stands for single shot detector is another Step 1:Resize the image to a smaller size and keep all the
algorithm used in object detection. It is based on simple features preserved. This is needed so that the code could run
neural networks where the nodes do not form cycle, rather faster. The opencv function resize() can be used to achieve
the information only moves in forward direction. This this.
algorithm creates bounding boxes of fixed size and gives a Step 2:Convert the image to some particular colorspace
score to decide the presence of the object in the box. Next is where some specific information can be extracted. There are
the non-maximum compression step where the bounding box many different colorspaces like RGB, YUV, LUV, etc. For
which has the maximum overlap gets the highest score and example, we can vary the lighting and saturation of the
produces the detection of the object[12]. The localization and image and then train the system using those images to
classification are done as one single step.j It is similar to identify objects under shadow. This is usually done by HLS
YOLO in the sense that this model also divides the image color scheme.
into grids of equal sizes. Figure.4 below shows the
architecture of how SSD works and how it goes through the Step 3:Use function numpy.histogram() to create color
process of detecting an object. histogram of the image. It is the most important step as
histogram contains all the feature information.
Step 4:Use function hog() to achieve Histogram of
Oriented Gradients(HOG). Figure.4 below is an instance of
how HOG visualizes an image.
200
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.
Figure 5 CNN architecture
Figure.5 shows the process of the CNN algorithm. The The following morphological operations are applied on
first step is to provide an image. Then, parameters are chosen, images for size calculation:
padding is added, and filters are applied to that image. Next,
convolution is performed on the image. Pooling is also
performed in order to reduce the numbers of parameters.
Additional convolutional layers can be added if needed.
Then, the next step is to flatten the output and feed it to the
fully connected layer. The last step is to output the class[4,5].
Similar to CNN, Faster R-CNN is also used for
classification models. There are two main networks in R-
CNN. One is RPN, which is used to generate region
proposals. Another is a network that uses the region
proposals to detect objects. We choose to use R-CNN
because it uses RPN to generate fixed set regions and anchor
boxes for object detection. Furthermore, it has no Figure 6 Flow Chart of size calculation process
requirement for extensive data augmentation. Faster R-CNN
has faster speed. Edge Detection: the set of processes to identify points in
the image where the change in the brightness is sharp and
The Faster R-CNN model we used has 10 layers: 3 not continuous. These points are put together into a set of
convolutional layers, 3 max-pooling layers and 4 fully lines called edges.
connected layers. The purpose of the Convolutional Layers is
to reduce the dimensionality of the image for faster Dilation: remove the extra unwanted edges from the gray
processing and less complexity. The function of Pooling scale image. Figure.7 shows the process of dilation.
layer is to reduce the size of the image or amount of input
parameters for the next layer. The final detection network
takes input from both the previous layers and generate the
final bounding boxes and classify the images. This layer
consists of 4 fully connected layers. We need to give image
as an input to the CNN , then to SVM(Support vector
Machine) which helps in predicting the class for each
bounding box or region. Next, we need to optimize the
bounding boxes by training each bounding box separately. Figure 7 Dilation process
We need to handle differences in the image scale and the Erosion: shrink the objects in the gray scale image. The
aspect ratio, due to which CNN includes the concept of process of erosion is shown in figure.8 and the original
anchor boxes. There are 3 different sizes of the anchor boxes : image is figure.9.
128h128, 256h256 and 512h512. For the aspect ratio,
three different ratios are used: 1:1, 2:1, 1:2. This allows 9
possible boxes at each location which can also be named as
background or the object.
E. Size Calculation of Potholes
We convert the image to the gray scale image to get rid
of the noise. Some unwanted edges are also created in
process which is due to the shadows of trees and insufficient Figure 8 Erosion process
light. Other vehicles can also result in those extra edges.
One of the major problems in size calculation is that there
are many unconnected sharp edges and noise. These extra
edges can be removed by dilation process. Figure.6
demonstrates the process of the size calculation of potholes.
Figure 9 Original image
201
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.
Thresholding: the process to identify whether an edge is From the plot of loss Figure.13(a), we can see that the
present at a particular point in the image or not. If the model SSD has comparable performance on both train and
threshold is low, there will be more edges. test datasets (labeled test). If these parallel plots start to
depart consistently, it might be a sign to stop training at an
Closing: increase the boundary of bright regions in the earlier epoch. From the plot of accuracy Figure.13(b), we can
image without destroying the original shape. see SSD model could probably be trained a little more as the
After the bounding boxes are predicted, the cropped trend for accuracy on both datasets is still rising for the last
image of the pothole is then converted to a black and white few epochs. We can also see that the model has not yet over-
image as shown. The depth of potholes is defined by the learned the training dataset, showing comparable skill on
maximum number of black pixels in the vertical direction both datasets.
and is calculated by running a loop across all the columns
and finding the number of black pixels for each column. A
lot of white noise can be seen in figure.10 which could result
in miscalculation of the pothole. To avoid this, we used
several morphological techniques like closing techniques and
experimented with various kernel sizes and iterations as
shown above. The best result was obtained by using a 9 by 9
kernel with one iteration which is shown in figure.11.
Finding the actual size and depth of the potholes was the
most challenging task since even the potholes with same size
would appear differently according to the distance between
them and the camera. That is to say, the potholes that are
close to the camera would appear bigger against the potholes
that are at some distance from the camera. To overcome this (a) SSD loss vs epoch
issue we’ll be using Multiple Regression to predict the actual
height, width and depth of the pothole by providing the
calculated values(in pixels) as inputs.
202
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.
datasets from training. The results of the different models are
compared in the table below:
Table 2
TIME TAKEN TO TRAIN DIFFERENT MODELS
GPU time required to train the YOLO V3 model was not too
high and was manageable when the model was trained on
HPC (High Performance Computing). Furthermore, we
evaluated the performance of the YOLO V3 model on a set
of 1000 road images, the learning rate for YOLO V3 was
0.01. F-1 score measures accuracy using the statistics
K K K K
7LPH precision and recall values. Precision is the ratio of true
positives to all the predicted positives. Recall is the ratio of
true positives to all the actual positives. The biggest
Figure 16 Time vs accuracy of different algorithms
advantage of the YOLO V3 algorithm is its superb speed. It
&11 can be used in real time as the processing speed is as fast as
66' 45 frames per second. The YOLO V3 we used has improved
<2/29
average precision so even the accuracy for detection for the
small objects improved greatly which was a big drawback
$FFXUDF\
203
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.
all four models and size calculation of potholes is considered [6] Bray J, Verma B, Li X, et al. A neural network based technique for
for more accurate detection results. Comparing the results of automatic classification of road cracks[C]//The 2006 IEEE
International Joint Conference on Neural Network Proceedings. IEEE,
all four models, the YOLO V3 model performed best with 2006: 907-912.
accuracy of 82%. The future work direction includes [7] Allen Zhang, Kelvin C. P. Wang, Baoxian Li, Enhui Yang, Xianxing
extending the detection object to broken drains and manhole Dai, Yi Peng, Yue Fei, Yang Liu, Joshua Q. Li, Cheng Chen.
covers and using images taken from moving vehicles in a Automated Pixel - Level Pavement Crack Detection on 3D Asphalt
realistic scenario. Surfaces Using a Deep-Learning Network, Computer-Aided Civil and
Infrastructure Engineering, vol. 00, pp. 1-15, 2017.
REFERENCES [8] Zhang L, Yang F, Zhang Y D, et al. Road crack detection using deep
convolutional neural network[C]//2016 IEEE international conference
[1] Anon, (2019). [online] Available at: https://www.pothole.info/the- on image processing (ICIP). IEEE, 2016: 3708-3712.
facts/
[9] Tedeschi A, Benedetto F. A real-time automatic pavement crack and
[2] Potholes Dataset, Google Drive. [Online]. Available at:
pothole recognition system for mobile Android-based devices[J].
https://drive.google.com/drive/folders/1vUmCvdW32lMrhsMbXdM Advanced Engineering Informatics, 2017, 32: 11-25.
WeLcEzOcuy.
[10] Chablani M. YOLO—You only look once, real time object detection
[3] Lin J, Liu Y. Potholes detection based on SVM in the pavement explained[J]. Towards Data Science [online].[cit. 2019-04-25].
distress image[C]//2010 Ninth International Symposium on Dostupné z: https://towardsdatascience. com/yolo-you-only-look-
Distributed Computing and Applications to Business, Engineering once-real-time-object-detection-explained-492dc9230006.
and Science. IEEE, 2010: 544-547.
[11] M. Hollemans, Real-time object detection with YOLO. [Online].
[4] Cha Y J, Choi W, Büyüköztürk O. Deep learning-based crack damage Available: http://machinethink.net/blog/object-detection-with-yolo/.
detection using convolutional neural networks[J]. Computer-Aided [Accessed: 13-Mar-2019].
Civil and Infrastructure Engineering, 2017, 32(5): 361-378.
[12] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox
[5] Maeda H, Sekimoto Y, Seto T, et al. Road damage detection using
detector[C]//European conference on computer vision. Springer,
deep neural networks with images captured through a smartphone[J]. Cham, 2016: 21-37.
arXiv preprint arXiv:1801.09454, 2018.
204
Authorized licensed use limited to: Cornell University Library. Downloaded on August 29,2020 at 16:31:47 UTC from IEEE Xplore. Restrictions apply.