0% found this document useful (0 votes)
2 views

Efficient Detection of Small and Complex Objects for Autonomous Driving Using Deep Learning

This paper presents a customized YOLO model aimed at improving the detection of small and complex objects for autonomous driving by addressing issues such as double anchor boxes and high time complexity. The proposed model utilizes depth-wise convolution and adjusts parameters like intersection over union (IoU) and non-max suppression, resulting in enhanced accuracy and efficiency compared to existing models. Testing on PASCAL VOC datasets shows significant improvements in detection performance, particularly for small objects.

Uploaded by

anshsharma120601
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Efficient Detection of Small and Complex Objects for Autonomous Driving Using Deep Learning

This paper presents a customized YOLO model aimed at improving the detection of small and complex objects for autonomous driving by addressing issues such as double anchor boxes and high time complexity. The proposed model utilizes depth-wise convolution and adjusts parameters like intersection over union (IoU) and non-max suppression, resulting in enhanced accuracy and efficiency compared to existing models. Testing on PASCAL VOC datasets shows significant improvements in detection performance, particularly for small objects.

Uploaded by

anshsharma120601
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE International Conference on Communication System, Computing and IT Applications (CSCITA 2023)

Efficient Detection of Small and Complex Objects for


Autonomous Driving Using Deep Learning
Ansh Sharma Rashmi Gupta
Department of Electronics Department of Electronics
2023 International Conference on Communication System, Computing and IT Applications (CSCITA) | 978-1-6654-5987-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/CSCITA55725.2023.10104969

and Communication Engineering and Communication Engineering


Netaji Subhas University of Technology Netaji Subhas University of Technology
Delhi, India Delhi, India
[email protected] [email protected]

Abstract – The YOLOv2 is one of the most prominent model II. RELATED LITERATURE
used for object detection, it works on the concept of anchor boxes.
However, this model is prone to some problems like double anchor For this paper, we have worked on a customized YOLO model
boxes, missing small objects, and high time complexity. In this [11,12]. This is a 22-layer long model with 50,983,561 parameters
paper, we aim to solve the problem of double anchor boxes and out of which 50,962,889 parameters are trainable and remaining
undetected small objects by tuning the parameters like intersection 20,672 are not. We have used depth wise convolution layer (DW-
over union (IoU) and customizing non-max suppression thresholds. CONV2D), Batch Normalization, Rectilinear Unit Activation
Also, to reduce the time complexity of the model, we have proposed (ReLu) and Max-Pooling layers in this model. This model is
the use of depth wise convolution (DW-Conv2D) instead of represented later in the paper.
fundamental convolution (Conv2D) in this paper. Once we applied
the proposed model to datasets like PASCAL VOC07 and VOC12, One of the most popular and precise alternatives to our proposed
we observed significant improvements like reduced floating-point method is the use of semantic segmentation. There are DNN models
operations per second by 9.5% and better accuracy than the existing like Regional-CNN (R-CNN) which use the concept of
state-of-the-art models. segmentation along with classification. These models are also very
accurate but at the same time have high complexity.One way of
Keywords – Convolutional Neural Network (CNN), Depth-wise segmentation is also labeling every pixel of the input image to
Separable Convolution, You Only Look Once (YOLO), Intersection propose different regions. Segmentation is used in many
over Union (IoU), Non-Max Suppression, Small Object Detection. applications like detecting brain tumor (Brain MRI scan), chest X-
ray scan etc. Major difference between detection and segmentation
I. INTRODUCTION is the use of regions to classify the objects.
Convolutional Neural Networks (CNNs) are one of the most
Prominent models like U-Net and C-Net use the concept of
prominent methods to solve the modern-day deep learning
segmentation. U-Net also uses transpose convolution which is
problems. They can be used in any field related to computer vision.
nothing but reverse convolution by a filter. In the first half of the U-
Excessive research has led to various new and more efficient models
Net model, we implement basic DNN using basic convolution,
discovered in this field each year by numerous data scientists. Some
however, in the second half, we implement the network using
examples of these innovations are discovery of fast and faster R-
transpose convolution. Our Image first experiences shrink in its size
CNN as an update to the standard R-CNN models [1, 2, 3].
and then it is returned to its original dimensions (Height and Width)
We have also observed significant improvements in sliding with the help of transpose convolution. C-Net on the other hand
windows theorem with time [4, 5] for object detection. omits the second half of the U-Net in its model.Some other
alternative models might be LeNet, having 60,000 parameters,
Object localization is a crucial step in object detection, and its AlexNet having 60 million parameters or VGG-16-layer network
advantages can easily be understood by looking at some examples and VGG-19-layer network.Resnets are also very prominent model
[6,7,8], where the implementation of Weakly Supervised Learning used to lower the time complexity of many DNN models. This
(WSL) in YOLO to perform complex computer vision tasks with model provides a skip connection in a DNN which allows to skip
low level annotations has been highlighted [6, 9]. some steps in a CNN.We are using depth-wise convolution instead
However, the YOLOv2 model faces some major challenges like of the standard convolution in our model. One alternative to this
predicting more than one box on a single object or leaving them proposed method might be inception networks. However, they have
totally undetected. Although, this problem cannot be fully a high computation cost, eliminating this problem is one of the major
eliminated, but our customized model works on limiting these purposes of this paper. Thus, we avoid using Inception networks in
failures and have higher precision than the previously adopted our model.
models.
Setting up parameters like classes
These deep learning problems arise in the area of generic object
(20 classes) and scores for yolo
detection. Generic object detection is the stream where we have our
labeled classes stored in a list, and our model localizes the object filter boxes
and assigns it a class from that list. We expect two things from a
successful Deep Neural Network (DNN) model. Firstly, high
Setting up parameters like IoU
accuracy and second, high efficiency.
threshold for non-max suppression
High accuracy is the first problem we will deal with in this paper
by changing the IoU and non-max suppression parameters.
Implementing yolo and non-max
We will also improve the Efficiency of the standard YOLO
suppression steps to our model and
model by working on time complexity of the model. We do this by
completing filtering
incorporating depth-wise convolution, as will be explained later in
the paper. High efficiency is when the model works within the
acceptable memory expenditure in a given frame of time [10].
Testing

Fig. 1. Deep Learning Model Pipeline

978-1-6654-5987-7/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
16
In our model, we first setup the YOLO anchor boxes parameters • Non-max suppression and intersection over union:
and we have used 20 classes in this model for the purpose of this
Non-max suppression and intersection over union are some
paper. Next we setup non-max suppression parameters. After that
methods used to avoid multiple anchor boxes and overfitting in
we implement these steps together and complete our filtering step.
computer vision. Non-max suppression first computes the maximum
Next, we load a pretrained model trained on PASCAL VOC07
box scores for the detected objects and then in the final step, it
dataset and then we test it on an image. The deep learning pipeline
cancels all the anchor boxes with scores closest to the maximum
for our model is represented in figure 1.
value.
III. PROPOSED MODEL On the other hand, Intersection over union follows a formula for
A. Depth wise Separable Convolution (DW-CONV2D): removing multiple anchor boxes. This formula is represented in
This is the fundamental layer of a CNN. In this layer, we use a equation 4.
filter and convolve it with our input image. Every time we process
our image through this layer of the network, we experience a shrink *#+, -. /01,23,41#-0
/,) = (4)
*#+, -. 50#-0
in the size of the input image. In this process, the corner pixels of
the image are often ignored by the model. We can avoid such
problems by using padding and customized stride. IoU in equation 4 has a range of (0, 1) inclusive.

IV. RESULTS
!"#$"# &'()*+',* = (/ − 1 + 1)5(/ − 1 + 1) (1)
To increase the efficiency of the model, we have used depth-
wise separable convolution instead of the standard convolution.
In equation 1, (I x I) is the dimension of the input image and (F
Depth-wise separable convolution has two major benefits to it, first,
x F) is the dimension of the filter.
it works on less parameters as compared to standard convolution.
B. Batch Normalization: This helps the model to avoid overfitting problems. Second, due to
Batch normalization is used to simplify the network and make it less parameters, this method uses less memory and is faster than
more stable. Batch normalization computes mini batch mean (μB) standard convolution. Thus, we manage to reduce the high time
complexity, as they require less computations to operate.
and standard deviation (σB) as shown in equation 2. Here, 5 ∈
ℝ8×:×;×< [13]. The depth-wise separable convolution splits the process of
standard convolution into two parts [15]. First, depth-wise
set = = { x1...m : m ∈ [1,N] × [1,H] × [1,W]}, convolution and second, pointwise convolution.
Depth-wise convolution uses a normal filter of any dimension (f
x f). Whereas, in pointwise convolution, we use a unit filter of
>? = ∑!
#$ 5# %? = & ∑!
#$ (5# − >? ) + (
' (2)
! ! dimension (1 x 1).
To increase the accuracy of the model, we change the
C. Rectilinear Unit Activation (ReLu):
Intersection over union and non-max suppression parameters and its
ReLu is a layer used to simplify the input. In this layer, the pixels formula.
with negative values are reset to 0, and the positive pixels remain
unchanged [14]. This layer is represented in equation 3. This customized model produces better detection results as
compared to the standard models. Equation 5 represents standard
f(x) = max(0, x) (3) formula.

D. Pooling Layer: Anchor Box Score = Pc x P (5)


In this layer, we divide the input into equal sections, then for
Here, Pc is the probability of an object of any class detected and
each section, we use the pixel with the desired value. For Max
P is the class probability of the detected object. We have modified
Pooling, we use the pixel with the maximum value. For Average
equation 5 to equation 6,
Pooling, we take the average of all the pixels in that section and use
that value. ∑?@AB(>? ×>@ )
6)7* 8*9ℎ,; <,5 =9,;) = (6)
For classification problems, we must use Fully connected layers 8C!D,2 -. :EF33,3
at the end of the network. In such problems we can also use
regression and pretrained models. Along with this, we changed the non-Max suppression and IoU
threshold values in the standard model to fit the purpose of this paper
All these layers are represented in figure 2 with their inputs and [16,17].
outputs also highlighted.
For the difference between floating point operations for standard
Input DW-Conv2D
608 x 608 x 32 Batch 608 x 608 x 32
ReLU
608 x 608 x 32
Max Pool
convolution and depth wise convolution, we have used TensorFlow
608 x 608 x 3 Normalization Profiler to calculate FLOPs throughout this paper. Here, we can
304 x 304 x 32 observe that FLOPs are higher in case of standard convolution layer
152 x 152 x 32
Max Pool ReLU
Batch
DW-Conv2D
due to more multiplications required [18,19,20]. Therefore, we have
304 x 304 x 64 Normalization implemented our model with the DW-Conv2D layer.
304 x 304 x 64 304 x 304 x 64

The percentage less multiplication required in depth wise


Repeat 21 Times Output
19 x 19 x 1024
DW-Conv2D
19 x 19 x 425 convolution is expressed in equation 7.

Fig. 2. Proposed Model %("H#'$H'97#',* H)++ = ( + )% (7)


I-#01J#3, .#E1,23 .#E1,2 K#!,03#-0

Our model has detected 9 boxes in the test image with classes
Car, Motorbike and Person. For all these boxes, we have computed

Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
17
their Output scores and output classes and highlighted these results Table I highlights some commonly used datasets for training
in the image itself. object detection models. We have worked on PASCAL VOC
datasets from this table. The table shows the number of classes these
In figure 3, we have displayed the results of our model on a test datasets work on along with the number of images present in their
image. Figure 3A shows the image we input into our model for training, validation, and test sets. It lists all the versions of PASCAL
testing. Figure 3B shows the output of our model. VOC and MS COCO datasets. Both these datasets are highly
We can clearly see in Figure 3B that our model has detected 7 preferred for training computer vision models. Here, we observe that
objects of class name “Car”, 1 object of class name “Motorbike”, the performance of our model has dropped in some classes on the
and 1 object with class name “Person”. VOC12 dataset as compared to VOC07 dataset. However, in many
classes like “car”, “motorbike” and “person”, our model has
It is evident in figure 3 that our model has not detected any depicted more accuracy in VOC12 than VOC07 dataset.
object more than once and managed to detect very small objects in
the test image. TABLE I. SUMMARY OF THE DATASETS USED TO TRAIN DEEP
NEURAL NETWORKS

Summary of the Used Datasets


Dataset
Name
PASCAL VOC (2012) PASCAL VOC (2007)

Total
Images 11,540 9,963

Categories 20 20

Objects per
image 2.4 2.5

Annoted
objects 27,450 24,640

Dataset Size 2.44 GiB 837.73 MiB

Table II is the result of comparison between our improved


model and some other models in existence using the localization and
classification methods. These models were tested on the PASCAL
VOC 2007 dataset for observing their accuracy (mean average
precision mAP). We have compared single shot detector SSD300,
SSD512, YOLOv2, and our improved YOLO model. We can
observe here that, our model has higher accuracy than YOLOv2 and
SSD models. However, in case of class name “Bus”, SSD512 has
highest accuracy, however, we have made improvements from the
standard YOLO model in this case also.
SSD512 has comparable performance to standard YOLO model.
In some cases, YOLO exceeds SSD512, but our improved YOLO
model has shown the highest accuracy in the majority cases.
In figure 4, we can observe the difference between YOLOv2
Fig. 3. Test Results on our Image model and our improved YOLO model.
Figure 4A is the test image on both the standard and the
Box class array works on the 20 class probabilities of the anchor customized model. Figure 4B represents the output of figure 4A on
boxes detected in the image. Box class scores are the maximum the YOLOv2 model. We can see here, some objects are assigned
values of output box scores as can be seen in equation 8. The results double anchor boxes. Also, some objects are left undetected, which
depicted in figure 3 are box scores, not box class scores, and they are detected in figure 4C. Figure 4C is the output of our improved
also show the class name for each anchor box. model and it has successfully eliminated double anchor boxes in the
These box class scores are greater than or equal to the box scores case of standard YOLOv2 model.
threshold values as shown in equation 9. Setting this parameter for Table II and Table III highlights the performance of our
the box class scores is also called masking process[21]. improved YOLO model on PASCAL VOC07 and PASCAL VOC12
datasets. We have chosen the initial and the latest version of this
<,5 LH7++ =9,;)+ = (75(<,5 =9,;)+) (8) dataset to observe the performance of our model.

<,5 LH7++ =9,;)+ ≥ <,5 =9,;)+ Nℎ;)+ℎ,HO (9)

Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
18
TABLE II. COMPARISON BETWEEN THE PROPOSED MODEL AND OTHER MODELS ON PASCAL VOC07 DATASET

Accuracy of Different Models on PASCAL VOC 2007 Dataset


Model Used
mAP(Mean) areo bike Bus car mbike person train
SSD300 80.91 76.20 81.14 83.30 82.92 82.70 75.85 84.32
SSD512 83.61 82.55 84.85 86.01 86.31 83.99 77.94 83.67
Yolov2 86.04 87.97 87.46 84.90 85.92 85.70 82.85 87.52
Improved Yolov2 86.65 88.43 88.01 86.38 87.20 85.87 82.94 87.77
(Proposed)

TABLE III. COMPARISON BETWEEN THE PROPOSED MODEL AND OTHER MODELS ON PASCAL VOC12 DATASET

Accuracy of Different Models on PASCAL VOC 2012 Dataset


Model Used
mAP(Mean) areo bike Bus car mbike person train
SSD300 83.06 77.21 83.10 84.33 84.99 85.04 79.86 86.90
SSD512 85.84 85.35 85.95 88.21 87.41 85.10 81.43 87.43
Yolov2 86.99 88.86 89.21 86.23 85.95 86.88 83.55 88.29
Improved Yolov2 88.14 89.22 89.68 88.11 88.21 87.25 84.89 89.66
(Proposed)

A. Test Image B. Output of YOLOv2 with Double C. Output of the Improved YOLO
Anchor Boxes and Undetected Model
Objects

Fig. 4. Comparison Between Old Model and Improved Model

Table IV highlights the number of parameters and the would be to add more classes to our detection model (more than 20) by
performance in terms of FLOPs of all these alternative methods on training on a more versatile dataset like ImageNet, MS COCO, or
images of size 224 x 224 and 448 x 448 pixels. Places dataset without disturbing the improvements made in the model.

TABLE IV. COMPARISON BETWEEN THE PROPOSED MODEL AND VI. REFERENCES
ALTERNATIVE MODELS ON COMPUTATION BASED ON FLOPS [1] B. Liu, W. Zhao and Q. Sun, "Study of object detection based on Faster R-
FLOPs on FLOPs on CNN," 2017 Chinese Automation Congress (CAC), 2017, pp. 6233-6236.
Architecture Parameters image size image size [2] H. Tahir, M. Shahbaz Khan and M. Owais Tariq, "Performance Analysis and
224 x 224 448 x 448 Comparison of Faster R-CNN, Mask R-CNN and ResNet50 for the Detection
VGG19 143,667,240 20G 79.5G and Counting of Vehicles," 2021 International Conference on Computing,
VGG16 138,357,544 15.9G 62.3G Communication, and Intelligent Systems (ICCCIS), 2021, pp. 587-594.
ResNet-152 60,419,944 11.5G 45.1G [3] Y. Shen, R. Ji, C. Wang, X. Li and X. Li, "Weakly Supervised Object
ResNet-101 44,707,176 8G 30G
Detection via Object-Specific Pixel Gradient," in IEEE Transactions on
ResNet-50 25,636,712 4.2G 15.4G Neural Networks and Learning Systems, vol. 29, no. 12, pp. 5960-5970, Dec.
Inceptionv3 23,851,784 6G 13G 2018.
Xception 22,910,480 8.3G 18G
[4] S. Zhang and M. Xie, "Beyond sliding windows: Object detection based on
DenseNet201 20,242,984 4.5G 17G
Proposed Model 50,983,561 3.8G 16.2G hierarchical segmentation model," 2013 International Conference on
Communications, Circuits and Systems (ICCCAS), 2013, pp. 263-266.

V. CONCLUSION [5] J. Lee, J. Bang and S. -I. Yang, "Object detection with sliding window in
images including multiple similar objects," 2017 International Conference on
In this paper, we have significantly enhanced the YOLOv2 model Information and Communication Technology Convergence (ICTC), 2017, pp.
for vehicle detection by changing its IoU and non-Max suppression 803-806.
parameters. We also simultaneously replaced the CONV2D [6] H. Ibrahem, A. D. A. Salem and H. -S. Kang, "Real-Time Weakly Supervised
convolutional layer with the depth wise DW-CONV2D convolution in Object Detection Using Center-of-Features Localization," in IEEE Access,
our model to reduce the time complexity. We observed that our model vol. 9, pp. 38742-38756, 2021.
is successful in eliminating multiple anchor boxes for a single object
[7] J. U. Kim and Y. Man Ro, "Attentive Layer Separation for Object
and at the same time, undetected objects are also reduced with lower Classification and Object Localization in Object Detection," 2019 IEEE
FLOPs recorded. Our model, however, might be limited to specific International Conference on Image Processing (ICIP), 2019, pp. 3995-3999.
object classes for detection (20 classes). Hence, in the future our aim

Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
19
[8] E. Etemad and Q. Gao, "Object localization by optimizing convolutional
neural network detection score using generic edge features," 2017 IEEE
International Conference on Image Processing (ICIP), 2017, pp. 675-679.
[9] Y. Zhang, "Multi-scale Object Detection Model with Anchor Free Approach
and Center of Gravity Prediction," 2020 IEEE 5th Information Technology
and Mechatronics Engineering Conference (ITOEC), 2020, pp. 38-45.
[10] N. Zarei, P. Moallem and M. Shams, "Fast-Yolo-Rec: Incorporating Yolo-
Base Detection and Recurrent-Base Prediction Networks for Fast Vehicle
Detection in Consecutive Images," in IEEE Access, vol. 10, pp. 120592-
120605, 2022.
[11] R. Widyastuti and C. -K. Yang, "Cat’s Nose Recognition Using You Only
Look Once (Yolo) and Scale-Invariant Feature Transform (SIFT)," 2018 IEEE
7th Global Conference on Consumer Electronics (GCCE), 2018, pp. 55-56.
[12] V. E.K. and C. Ramachandran, "Real-time Gender Identification from Face
Images using you only look once (yolo)," 2020 4th International Conference
on Trends in Electronics and Informatics (ICOEI)(48184), 2020, pp. 1074-
1077.
[13] M. M. Kalayeh and M. Shah, "Training Faster by Separating Modes of
Variation in Batch-Normalized Models," in IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 42, no. 6, pp. 1483-1500, 1 June 2020.
[14] F. Z. Ouadiay, H. Bouftaih, E. H. Bouyakhf and M. M. Himmi, "Simultaneous
object detection and localization using convolutional neural networks," 2018
International Conference on Intelligent Systems and Computer Vision (ISCV),
2018, pp. 1-8.
[15] R. Li and J. Yang, "Improved YOLOv2 Object Detection Model," 2018 6th
International Conference on Multimedia Computing and Systems (ICMCS),
2018, pp. 1-6.
[16] Q. Xu, R. Lin, H. Yue, H. Huang, Y. Yang and Z. Yao, "Research on Small
Target Detection in Driving Scenarios Based on Improved Yolo Network,"
in IEEE Access, vol. 8, pp. 27574-27583, 2020.
[17] X. Liang, J. Zhang, L. Zhuo, Y. Li and Q. Tian, "Small Object Detection in
Unmanned Aerial Vehicle Images Using Feature Fusion and Scaling-Based
Single Shot Detector With Spatial Context Analysis," in IEEE Transactions
on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1758-1770,
June 2020.
[18] Tanupreet, Rashmi Gupta, “Deep Facial Recognition after Medical
Alterations", Multimedia Tools and Applications. March 2022, SCIE, IF
2.757.
[19] Prachi Punyani, Rashmi Gupta, Ashwani Kumar, “Neural networks for facial
age estimation: a survey on recent advances”, Artif Intell Rev, Springer, 53,
3299–3347 (June 2020), SCI, IF 9.69.
[20] Rajiv Kapoor, Rashmi Gupta, “Morphological Mapping for Non-linear
Dimensionality Reduction Technique”, IET Computer Vision, vol. 9, no.2, pp.
226-233, April 2015.
[21] Rajiv Kapoor, Rashmi Gupta, “Non-Linear Dimensionality Reduction using
Fuzzy Lattices,” IET Computer Vision, vol. 7, no. 3, pp.201-208, June 2013,
SCI, IF 1.524.

Authorized licensed use limited to: ASU Library. Downloaded on September 17,2023 at 18:51:17 UTC from IEEE Xplore. Restrictions apply.
20

You might also like