0% found this document useful (0 votes)
55 views

Second Progress Report UID - 17BCS2127

This document is a progress report submitted by Shanu Naval Singh for their Bachelor of Engineering degree in partial fulfillment of requirements. It discusses an object detection system using state-of-the-art deep learning techniques to achieve high accuracy in real-time. The system is trained on the PASCAL VOC dataset to detect multiple objects in images. Challenges include variable output dimensions and balancing accuracy vs performance. Popular approaches like RCNN, Fast RCNN, Faster RCNN, YOLO, and SSD are reviewed. The project uses SSD which applies convolutional layers to feature maps from a VGG network to predict bounding boxes and classes at multiple scales for high mAP.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Second Progress Report UID - 17BCS2127

This document is a progress report submitted by Shanu Naval Singh for their Bachelor of Engineering degree in partial fulfillment of requirements. It discusses an object detection system using state-of-the-art deep learning techniques to achieve high accuracy in real-time. The system is trained on the PASCAL VOC dataset to detect multiple objects in images. Challenges include variable output dimensions and balancing accuracy vs performance. Popular approaches like RCNN, Fast RCNN, Faster RCNN, YOLO, and SSD are reviewed. The project uses SSD which applies convolutional layers to feature maps from a VGG network to predict bounding boxes and classes at multiple scales for high mAP.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Object Detection Smart Security System

Second progress Report


Submitted in partial fulfillment of the requirements for the award of degree of

BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING

Submitted to:
Gagandeep Kaur

Submitted By:

Shanu Naval Singh (17bcs2127)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Chandigarh University, Gharuan


June 2021
ACKNOWLEDGEMENT

I have taken efforts in this project. However, it would not have been possible without the kind
support and help of many individuals and organizations. I would like to extend my sincere
thanks to all of them.

I am highly indebted to Gagandeep kaur Mam for their guidance and constant supervision as
well as for providing necessary information regarding the project & also for their support in
completing the project.

I would like to express my gratitude towards my parents & member of Chandigarh University
for their kind co-operation and encouragement which help me in completion of this project.

My thanks and appreciations also go to my colleague in developing the project and people
who have willingly helped me out with their abilities.

Thanking you.

Yours Sincerely,

Shanu naval Singh (17bcs2127)

ABSTRACT
1
Efficient and accurate object detection has been an important topic in the advancement of
computer vision systems. With the advent of deep learning techniques, the accuracy for
object detection has increased drastically. The project aims to incorporate state-of-the-art
technique for object detection with the goal of achieving high accuracy with a real-time
performance. A major challenge in many of the object detection systems is the dependency
on other computer vision techniques for helping the deep learning-based approach, which
leads to slow and non-optimal performance. In this project, we use a completely deep
learning-based approach to solve the problem of object detection in an end-to-end fashion.
The network is trained on the most challenging publicly available dataset (PASCAL VOC), on
which a object detection challenge is conducted annually. The resulting system is fast and
accurate, thus aiding those applications which require object detection.

2
Table of Contents

Sr. No. Topic Page No.


1 Chapter 1: Introduction 4

2 Chapter 2: Related Work 6

3 Chapter 3: Approach 9
4 Chapter 4: Conclusion 11

3
1 Introduction
1.1 Problem Statement
Many problems in computer vision were saturating on their accuracy before a decade.
How- ever, with the rise of deep learning techniques, the accuracy of these problems
drastically improved. One of the major problem was that of image classification, which
is defined as predicting the class of the image. A slightly complicated problem is that of
image localiza- tion, where the image contains a single object and the system should
predict the class of the location of the object in the image (a bounding box around the
object). The more compli- cated problem (this project), of object detection involves both
classification and localization. In this case, the input to the system will be a image, and
the output will be a bounding box corresponding to all the objects in the image, along
with the class of object in each box. An overview of all these problems is depicted in
Fig. 1.

Figure 1: Computer Vision Tasks

1.2 Applications
A well known application of object detection is face detection, that is used in almost all
the mobile cameras. A more generalized (multi-class) application can be used in
autonomous driving where a variety of objects need to be detected. Also it has a
important role to play in surveillance systems. These systems can be integrated with
other tasks such as pose estimation where the first stage in the pipeline is to detect the
object, and then the second stage will be to estimate pose in the detected region. It can
be used for tracking objects and thus can be used in robotics and medical applications.
Thus this problem serves a multitude of applications.

4
(a) Surveillance (b) Autonomous vehicles
Figure 2: Applications of object detections

1.3 Challenges
The major challenge in this problem is that of the variable dimension of the output
which is caused due to the variable number of objects that can be present in any given
input image. Any general machine learning task requires a fixed dimension of input and
output for the model to be trained. Another important obstacle for widespread adoption
of object detection systems is the requirement of real-time (¿30fps) while being accurate
in detection. The more complex the model is, the more time it requires for inference;
and the less complex the model is, the less is the accuracy. This trade-off between
accuracy and performance needs to be chosen as per the application. The problem
involves classification as well as regression, leading the model to be learnt
simultaneously. This adds to the complexity of the problem.
2 Related Work
There has been a lot of work in object detection using traditional computer vision
techniques (sliding windows, deformable part models). However, they lack the accuracy of
deep learning based techniques. Among the deep learning based techniques, two broad
class of methods are prevalent: two stage detection (RCNN [1], Fast RCNN [2], Faster
RCNN [3]) and unified detection (Yolo [4], SSD [5]). The major concepts involved in
these techniques have been explained below.

2.1 Bounding Box


The bounding box is a rectangle drawn on the image which tightly fits the object in the
image. A bounding box exists for every instance of every object in the image. For the
box, 4 numbers (center x, center y, width, height) are predicted. This can be trained
using a distance measure between predicted and ground truth bounding box. The
distance measure is a jaccard distance which computes intersection over union between
the predicted and ground truth boxes as shown in Fig. 3.

Figure 3: Jaccard distance

2.2 Classification + Regression


The bounding box is predicted using regression and the class within the bounding box is
predicted using classification. The overview of the architecture is shown in Fig. 4

Figure 4: Architecture overview


2.3 Two-stage Method
In this case, the proposals are extracted using some other computer vision technique
and then resized to fixed input for the classification network, which acts as a feature
extractor. Then an SVM is trained to classify between object and background (one SVM
for each class). Also a bounding box regressor is trained that outputs some some
correction (offsets) for proposal boxes. The overall idea is shown in Fig. 5 These
methods are very accurate but are computationally intensive (low fps).

(a) Stage 1

(b) Stage 2

Figure 5: Two stage method

2.4 Unified Method


The difference here is that instead of producing proposals, pre-define a set of boxes to
look for objects. Using convolutional feature maps from later layers of the network, run
another network over these feature maps to predict class scores and bounding box
offsets. The broad idea is depicted in Fig. 6. The steps are mentioned below:
1. Train a CNN with regression and classification objective.

2. Gather activation from later layers to infer classification and location with a fully
connected or convolutional layers.
3. During training, use jaccard distance to relate predictions with the ground truth.

4. During inference, use non-maxima suppression to filter multiple boxes around the
same object.

Figure 6: Unified Method

The major techniques that follow this strategy are: SSD (uses different activation
maps (multiple-scales) for prediction of classes and bounding boxes) and Yolo (uses a
single ac- tivation map for prediction of classes and bounding boxes). Using multiple
scales helps to achieve a higher mAP(mean average precision) by being able to detect
objects with different sizes on the image better. Thus the technique used in this project
is SSD.
3 Approach
The network used in this project is based on Single shot detection (SSD) [5]. The
architecture is shown in Fig. 7.

Figure 7: SSD Architecture

The SSD normally starts with a VGG [6] model, which is converted to a fully convolu-
tional network. Then we attach some extra convolutional layers, that help to handle
bigger objects. The output at the VGG network is a 38x38 feature map (conv4 3). The
added layers produce 19x19, 10x10, 5x5, 3x3, 1x1 feature maps. All these feature maps
are used for predicting bounding boxes at various scales (later layers responsible for
larger objects).
Thus the overall idea of SSD is shown in Fig. 8. Some of the activations are passed to
the sub-network that acts as a classifier and a localizer.

Figure 8: SSD Overall Idea

Anchors (collection of boxes overlaid on image at different spatial locations, scales


and aspect ratios) act as reference points on ground truth images as shown in Fig. 9.
A model is trained to make two predictions for each anchor:
• A discrete class

• A continuous offset by which the anchor needs to be shifted to fit the ground-truth
bounding box
Figure 9: Anchors

During training SSD matches ground truth annotations with anchors. Each element
of the feature map (cell) has a number of anchors associated with it. Any anchor with an
IoU (jaccard distance) greater than 0.5 is considered a match. Consider the case as
shown in Fig. 10, where the cat has two anchors matched and the dog has one anchor
matched. Note that both have been matched on different feature maps.

Figure 10: Matches

The loss function used is the multi-box classification and regression loss. The
classification loss used is the softmax cross entropy and, for regression the smooth L1
loss is used.
4 Conclusion
An accurate and efficient object detection system has been developed which achieves compa-
rable metrics with the existing state-of-the-art system. This project uses recent
techniques in the field of computer vision and deep learning. Custom dataset was
created using labelImg and the evaluation was consistent. This can be used in real-time
applications which require object detection for pre-processing in their pipeline.
An important scope would be to train the system on a video sequence for usage in
tracking applications. Addition of a temporally consistent network would enable smooth
detection and more optimal than per-frame detection.
References
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014.

[2] Ross Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV),
2015.

[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards
real- time object detection with region proposal networks. In Advances in Neural
Information Processing Systems (NIPS), 2015.

[4] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look
once: Unified, real-time object detection. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.

[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.

[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

You might also like