Efficient Detection of Small and Complex Objects for Autonomous Driving Using Deep Learning
Efficient Detection of Small and Complex Objects for Autonomous Driving Using Deep Learning
Abstract – The YOLOv2 is one of the most prominent model II. RELATED LITERATURE
used for object detection, it works on the concept of anchor boxes.
However, this model is prone to some problems like double anchor For this paper, we have worked on a customized YOLO model
boxes, missing small objects, and high time complexity. In this [11,12]. This is a 22-layer long model with 50,983,561 parameters
paper, we aim to solve the problem of double anchor boxes and out of which 50,962,889 parameters are trainable and remaining
undetected small objects by tuning the parameters like intersection 20,672 are not. We have used depth wise convolution layer (DW-
over union (IoU) and customizing non-max suppression thresholds. CONV2D), Batch Normalization, Rectilinear Unit Activation
Also, to reduce the time complexity of the model, we have proposed (ReLu) and Max-Pooling layers in this model. This model is
the use of depth wise convolution (DW-Conv2D) instead of represented later in the paper.
fundamental convolution (Conv2D) in this paper. Once we applied
the proposed model to datasets like PASCAL VOC07 and VOC12, One of the most popular and precise alternatives to our proposed
we observed significant improvements like reduced floating-point method is the use of semantic segmentation. There are DNN models
operations per second by 9.5% and better accuracy than the existing like Regional-CNN (R-CNN) which use the concept of
state-of-the-art models. segmentation along with classification. These models are also very
accurate but at the same time have high complexity.One way of
Keywords – Convolutional Neural Network (CNN), Depth-wise segmentation is also labeling every pixel of the input image to
Separable Convolution, You Only Look Once (YOLO), Intersection propose different regions. Segmentation is used in many
over Union (IoU), Non-Max Suppression, Small Object Detection. applications like detecting brain tumor (Brain MRI scan), chest X-
ray scan etc. Major difference between detection and segmentation
I. INTRODUCTION is the use of regions to classify the objects.
Convolutional Neural Networks (CNNs) are one of the most
Prominent models like U-Net and C-Net use the concept of
prominent methods to solve the modern-day deep learning
segmentation. U-Net also uses transpose convolution which is
problems. They can be used in any field related to computer vision.
nothing but reverse convolution by a filter. In the first half of the U-
Excessive research has led to various new and more efficient models
Net model, we implement basic DNN using basic convolution,
discovered in this field each year by numerous data scientists. Some
however, in the second half, we implement the network using
examples of these innovations are discovery of fast and faster R-
transpose convolution. Our Image first experiences shrink in its size
CNN as an update to the standard R-CNN models [1, 2, 3].
and then it is returned to its original dimensions (Height and Width)
We have also observed significant improvements in sliding with the help of transpose convolution. C-Net on the other hand
windows theorem with time [4, 5] for object detection. omits the second half of the U-Net in its model.Some other
alternative models might be LeNet, having 60,000 parameters,
Object localization is a crucial step in object detection, and its AlexNet having 60 million parameters or VGG-16-layer network
advantages can easily be understood by looking at some examples and VGG-19-layer network.Resnets are also very prominent model
[6,7,8], where the implementation of Weakly Supervised Learning used to lower the time complexity of many DNN models. This
(WSL) in YOLO to perform complex computer vision tasks with model provides a skip connection in a DNN which allows to skip
low level annotations has been highlighted [6, 9]. some steps in a CNN.We are using depth-wise convolution instead
However, the YOLOv2 model faces some major challenges like of the standard convolution in our model. One alternative to this
predicting more than one box on a single object or leaving them proposed method might be inception networks. However, they have
totally undetected. Although, this problem cannot be fully a high computation cost, eliminating this problem is one of the major
eliminated, but our customized model works on limiting these purposes of this paper. Thus, we avoid using Inception networks in
failures and have higher precision than the previously adopted our model.
models.
Setting up parameters like classes
These deep learning problems arise in the area of generic object
(20 classes) and scores for yolo
detection. Generic object detection is the stream where we have our
labeled classes stored in a list, and our model localizes the object filter boxes
and assigns it a class from that list. We expect two things from a
successful Deep Neural Network (DNN) model. Firstly, high
Setting up parameters like IoU
accuracy and second, high efficiency.
threshold for non-max suppression
High accuracy is the first problem we will deal with in this paper
by changing the IoU and non-max suppression parameters.
Implementing yolo and non-max
We will also improve the Efficiency of the standard YOLO
suppression steps to our model and
model by working on time complexity of the model. We do this by
completing filtering
incorporating depth-wise convolution, as will be explained later in
the paper. High efficiency is when the model works within the
acceptable memory expenditure in a given frame of time [10].
Testing
Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
16
In our model, we first setup the YOLO anchor boxes parameters • Non-max suppression and intersection over union:
and we have used 20 classes in this model for the purpose of this
Non-max suppression and intersection over union are some
paper. Next we setup non-max suppression parameters. After that
methods used to avoid multiple anchor boxes and overfitting in
we implement these steps together and complete our filtering step.
computer vision. Non-max suppression first computes the maximum
Next, we load a pretrained model trained on PASCAL VOC07
box scores for the detected objects and then in the final step, it
dataset and then we test it on an image. The deep learning pipeline
cancels all the anchor boxes with scores closest to the maximum
for our model is represented in figure 1.
value.
III. PROPOSED MODEL On the other hand, Intersection over union follows a formula for
A. Depth wise Separable Convolution (DW-CONV2D): removing multiple anchor boxes. This formula is represented in
This is the fundamental layer of a CNN. In this layer, we use a equation 4.
filter and convolve it with our input image. Every time we process
our image through this layer of the network, we experience a shrink *#+, -. /01,23,41#-0
/,) = (4)
*#+, -. 50#-0
in the size of the input image. In this process, the corner pixels of
the image are often ignored by the model. We can avoid such
problems by using padding and customized stride. IoU in equation 4 has a range of (0, 1) inclusive.
IV. RESULTS
!"#$"# &'()*+',* = (/ − 1 + 1)5(/ − 1 + 1) (1)
To increase the efficiency of the model, we have used depth-
wise separable convolution instead of the standard convolution.
In equation 1, (I x I) is the dimension of the input image and (F
Depth-wise separable convolution has two major benefits to it, first,
x F) is the dimension of the filter.
it works on less parameters as compared to standard convolution.
B. Batch Normalization: This helps the model to avoid overfitting problems. Second, due to
Batch normalization is used to simplify the network and make it less parameters, this method uses less memory and is faster than
more stable. Batch normalization computes mini batch mean (μB) standard convolution. Thus, we manage to reduce the high time
complexity, as they require less computations to operate.
and standard deviation (σB) as shown in equation 2. Here, 5 ∈
ℝ8×:×;×< [13]. The depth-wise separable convolution splits the process of
standard convolution into two parts [15]. First, depth-wise
set = = { x1...m : m ∈ [1,N] × [1,H] × [1,W]}, convolution and second, pointwise convolution.
Depth-wise convolution uses a normal filter of any dimension (f
x f). Whereas, in pointwise convolution, we use a unit filter of
>? = ∑!
#$ 5# %? = & ∑!
#$ (5# − >? ) + (
' (2)
! ! dimension (1 x 1).
To increase the accuracy of the model, we change the
C. Rectilinear Unit Activation (ReLu):
Intersection over union and non-max suppression parameters and its
ReLu is a layer used to simplify the input. In this layer, the pixels formula.
with negative values are reset to 0, and the positive pixels remain
unchanged [14]. This layer is represented in equation 3. This customized model produces better detection results as
compared to the standard models. Equation 5 represents standard
f(x) = max(0, x) (3) formula.
Our model has detected 9 boxes in the test image with classes
Car, Motorbike and Person. For all these boxes, we have computed
Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
17
their Output scores and output classes and highlighted these results Table I highlights some commonly used datasets for training
in the image itself. object detection models. We have worked on PASCAL VOC
datasets from this table. The table shows the number of classes these
In figure 3, we have displayed the results of our model on a test datasets work on along with the number of images present in their
image. Figure 3A shows the image we input into our model for training, validation, and test sets. It lists all the versions of PASCAL
testing. Figure 3B shows the output of our model. VOC and MS COCO datasets. Both these datasets are highly
We can clearly see in Figure 3B that our model has detected 7 preferred for training computer vision models. Here, we observe that
objects of class name “Car”, 1 object of class name “Motorbike”, the performance of our model has dropped in some classes on the
and 1 object with class name “Person”. VOC12 dataset as compared to VOC07 dataset. However, in many
classes like “car”, “motorbike” and “person”, our model has
It is evident in figure 3 that our model has not detected any depicted more accuracy in VOC12 than VOC07 dataset.
object more than once and managed to detect very small objects in
the test image. TABLE I. SUMMARY OF THE DATASETS USED TO TRAIN DEEP
NEURAL NETWORKS
Total
Images 11,540 9,963
Categories 20 20
Objects per
image 2.4 2.5
Annoted
objects 27,450 24,640
Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
18
TABLE II. COMPARISON BETWEEN THE PROPOSED MODEL AND OTHER MODELS ON PASCAL VOC07 DATASET
TABLE III. COMPARISON BETWEEN THE PROPOSED MODEL AND OTHER MODELS ON PASCAL VOC12 DATASET
A. Test Image B. Output of YOLOv2 with Double C. Output of the Improved YOLO
Anchor Boxes and Undetected Model
Objects
Table IV highlights the number of parameters and the would be to add more classes to our detection model (more than 20) by
performance in terms of FLOPs of all these alternative methods on training on a more versatile dataset like ImageNet, MS COCO, or
images of size 224 x 224 and 448 x 448 pixels. Places dataset without disturbing the improvements made in the model.
TABLE IV. COMPARISON BETWEEN THE PROPOSED MODEL AND VI. REFERENCES
ALTERNATIVE MODELS ON COMPUTATION BASED ON FLOPS [1] B. Liu, W. Zhao and Q. Sun, "Study of object detection based on Faster R-
FLOPs on FLOPs on CNN," 2017 Chinese Automation Congress (CAC), 2017, pp. 6233-6236.
Architecture Parameters image size image size [2] H. Tahir, M. Shahbaz Khan and M. Owais Tariq, "Performance Analysis and
224 x 224 448 x 448 Comparison of Faster R-CNN, Mask R-CNN and ResNet50 for the Detection
VGG19 143,667,240 20G 79.5G and Counting of Vehicles," 2021 International Conference on Computing,
VGG16 138,357,544 15.9G 62.3G Communication, and Intelligent Systems (ICCCIS), 2021, pp. 587-594.
ResNet-152 60,419,944 11.5G 45.1G [3] Y. Shen, R. Ji, C. Wang, X. Li and X. Li, "Weakly Supervised Object
ResNet-101 44,707,176 8G 30G
Detection via Object-Specific Pixel Gradient," in IEEE Transactions on
ResNet-50 25,636,712 4.2G 15.4G Neural Networks and Learning Systems, vol. 29, no. 12, pp. 5960-5970, Dec.
Inceptionv3 23,851,784 6G 13G 2018.
Xception 22,910,480 8.3G 18G
[4] S. Zhang and M. Xie, "Beyond sliding windows: Object detection based on
DenseNet201 20,242,984 4.5G 17G
Proposed Model 50,983,561 3.8G 16.2G hierarchical segmentation model," 2013 International Conference on
Communications, Circuits and Systems (ICCCAS), 2013, pp. 263-266.
V. CONCLUSION [5] J. Lee, J. Bang and S. -I. Yang, "Object detection with sliding window in
images including multiple similar objects," 2017 International Conference on
In this paper, we have significantly enhanced the YOLOv2 model Information and Communication Technology Convergence (ICTC), 2017, pp.
for vehicle detection by changing its IoU and non-Max suppression 803-806.
parameters. We also simultaneously replaced the CONV2D [6] H. Ibrahem, A. D. A. Salem and H. -S. Kang, "Real-Time Weakly Supervised
convolutional layer with the depth wise DW-CONV2D convolution in Object Detection Using Center-of-Features Localization," in IEEE Access,
our model to reduce the time complexity. We observed that our model vol. 9, pp. 38742-38756, 2021.
is successful in eliminating multiple anchor boxes for a single object
[7] J. U. Kim and Y. Man Ro, "Attentive Layer Separation for Object
and at the same time, undetected objects are also reduced with lower Classification and Object Localization in Object Detection," 2019 IEEE
FLOPs recorded. Our model, however, might be limited to specific International Conference on Image Processing (ICIP), 2019, pp. 3995-3999.
object classes for detection (20 classes). Hence, in the future our aim
Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
19
[8] E. Etemad and Q. Gao, "Object localization by optimizing convolutional
neural network detection score using generic edge features," 2017 IEEE
International Conference on Image Processing (ICIP), 2017, pp. 675-679.
[9] Y. Zhang, "Multi-scale Object Detection Model with Anchor Free Approach
and Center of Gravity Prediction," 2020 IEEE 5th Information Technology
and Mechatronics Engineering Conference (ITOEC), 2020, pp. 38-45.
[10] N. Zarei, P. Moallem and M. Shams, "Fast-Yolo-Rec: Incorporating Yolo-
Base Detection and Recurrent-Base Prediction Networks for Fast Vehicle
Detection in Consecutive Images," in IEEE Access, vol. 10, pp. 120592-
120605, 2022.
[11] R. Widyastuti and C. -K. Yang, "Cat’s Nose Recognition Using You Only
Look Once (Yolo) and Scale-Invariant Feature Transform (SIFT)," 2018 IEEE
7th Global Conference on Consumer Electronics (GCCE), 2018, pp. 55-56.
[12] V. E.K. and C. Ramachandran, "Real-time Gender Identification from Face
Images using you only look once (yolo)," 2020 4th International Conference
on Trends in Electronics and Informatics (ICOEI)(48184), 2020, pp. 1074-
1077.
[13] M. M. Kalayeh and M. Shah, "Training Faster by Separating Modes of
Variation in Batch-Normalized Models," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 42, no. 6, pp. 1483-1500, 1 June 2020.
[14] F. Z. Ouadiay, H. Bouftaih, E. H. Bouyakhf and M. M. Himmi, "Simultaneous
object detection and localization using convolutional neural networks," 2018
International Conference on Intelligent Systems and Computer Vision (ISCV),
2018, pp. 1-8.
[15] R. Li and J. Yang, "Improved YOLOv2 Object Detection Model," 2018 6th
International Conference on Multimedia Computing and Systems (ICMCS),
2018, pp. 1-6.
[16] Q. Xu, R. Lin, H. Yue, H. Huang, Y. Yang and Z. Yao, "Research on Small
Target Detection in Driving Scenarios Based on Improved Yolo Network,"
in IEEE Access, vol. 8, pp. 27574-27583, 2020.
[17] X. Liang, J. Zhang, L. Zhuo, Y. Li and Q. Tian, "Small Object Detection in
Unmanned Aerial Vehicle Images Using Feature Fusion and Scaling-Based
Single Shot Detector With Spatial Context Analysis," in IEEE Transactions
on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1758-1770,
June 2020.
[18] Tanupreet, Rashmi Gupta, “Deep Facial Recognition after Medical
Alterations", Multimedia Tools and Applications. March 2022, SCIE, IF
2.757.
[19] Prachi Punyani, Rashmi Gupta, Ashwani Kumar, “Neural networks for facial
age estimation: a survey on recent advances”, Artif Intell Rev, Springer, 53,
3299–3347 (June 2020), SCI, IF 9.69.
[20] Rajiv Kapoor, Rashmi Gupta, “Morphological Mapping for Non-linear
Dimensionality Reduction Technique”, IET Computer Vision, vol. 9, no.2, pp.
226-233, April 2015.
[21] Rajiv Kapoor, Rashmi Gupta, “Non-Linear Dimensionality Reduction using
Fuzzy Lattices,” IET Computer Vision, vol. 7, no. 3, pp.201-208, June 2013,
SCI, IF 1.524.
Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
20