Second Progress Report UID - 17BCS2127
Second Progress Report UID - 17BCS2127
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING
Submitted to:
Gagandeep Kaur
Submitted By:
I have taken efforts in this project. However, it would not have been possible without the kind
support and help of many individuals and organizations. I would like to extend my sincere
thanks to all of them.
I am highly indebted to Gagandeep kaur Mam for their guidance and constant supervision as
well as for providing necessary information regarding the project & also for their support in
completing the project.
I would like to express my gratitude towards my parents & member of Chandigarh University
for their kind co-operation and encouragement which help me in completion of this project.
My thanks and appreciations also go to my colleague in developing the project and people
who have willingly helped me out with their abilities.
Thanking you.
Yours Sincerely,
ABSTRACT
1
Efficient and accurate object detection has been an important topic in the advancement of
computer vision systems. With the advent of deep learning techniques, the accuracy for
object detection has increased drastically. The project aims to incorporate state-of-the-art
technique for object detection with the goal of achieving high accuracy with a real-time
performance. A major challenge in many of the object detection systems is the dependency
on other computer vision techniques for helping the deep learning-based approach, which
leads to slow and non-optimal performance. In this project, we use a completely deep
learning-based approach to solve the problem of object detection in an end-to-end fashion.
The network is trained on the most challenging publicly available dataset (PASCAL VOC), on
which a object detection challenge is conducted annually. The resulting system is fast and
accurate, thus aiding those applications which require object detection.
2
Table of Contents
3 Chapter 3: Approach 9
4 Chapter 4: Conclusion 11
3
1 Introduction
1.1 Problem Statement
Many problems in computer vision were saturating on their accuracy before a decade.
How- ever, with the rise of deep learning techniques, the accuracy of these problems
drastically improved. One of the major problem was that of image classification, which
is defined as predicting the class of the image. A slightly complicated problem is that of
image localiza- tion, where the image contains a single object and the system should
predict the class of the location of the object in the image (a bounding box around the
object). The more compli- cated problem (this project), of object detection involves both
classification and localization. In this case, the input to the system will be a image, and
the output will be a bounding box corresponding to all the objects in the image, along
with the class of object in each box. An overview of all these problems is depicted in
Fig. 1.
1.2 Applications
A well known application of object detection is face detection, that is used in almost all
the mobile cameras. A more generalized (multi-class) application can be used in
autonomous driving where a variety of objects need to be detected. Also it has a
important role to play in surveillance systems. These systems can be integrated with
other tasks such as pose estimation where the first stage in the pipeline is to detect the
object, and then the second stage will be to estimate pose in the detected region. It can
be used for tracking objects and thus can be used in robotics and medical applications.
Thus this problem serves a multitude of applications.
4
(a) Surveillance (b) Autonomous vehicles
Figure 2: Applications of object detections
1.3 Challenges
The major challenge in this problem is that of the variable dimension of the output
which is caused due to the variable number of objects that can be present in any given
input image. Any general machine learning task requires a fixed dimension of input and
output for the model to be trained. Another important obstacle for widespread adoption
of object detection systems is the requirement of real-time (¿30fps) while being accurate
in detection. The more complex the model is, the more time it requires for inference;
and the less complex the model is, the less is the accuracy. This trade-off between
accuracy and performance needs to be chosen as per the application. The problem
involves classification as well as regression, leading the model to be learnt
simultaneously. This adds to the complexity of the problem.
2 Related Work
There has been a lot of work in object detection using traditional computer vision
techniques (sliding windows, deformable part models). However, they lack the accuracy of
deep learning based techniques. Among the deep learning based techniques, two broad
class of methods are prevalent: two stage detection (RCNN [1], Fast RCNN [2], Faster
RCNN [3]) and unified detection (Yolo [4], SSD [5]). The major concepts involved in
these techniques have been explained below.
(a) Stage 1
(b) Stage 2
2. Gather activation from later layers to infer classification and location with a fully
connected or convolutional layers.
3. During training, use jaccard distance to relate predictions with the ground truth.
4. During inference, use non-maxima suppression to filter multiple boxes around the
same object.
The major techniques that follow this strategy are: SSD (uses different activation
maps (multiple-scales) for prediction of classes and bounding boxes) and Yolo (uses a
single ac- tivation map for prediction of classes and bounding boxes). Using multiple
scales helps to achieve a higher mAP(mean average precision) by being able to detect
objects with different sizes on the image better. Thus the technique used in this project
is SSD.
3 Approach
The network used in this project is based on Single shot detection (SSD) [5]. The
architecture is shown in Fig. 7.
The SSD normally starts with a VGG [6] model, which is converted to a fully convolu-
tional network. Then we attach some extra convolutional layers, that help to handle
bigger objects. The output at the VGG network is a 38x38 feature map (conv4 3). The
added layers produce 19x19, 10x10, 5x5, 3x3, 1x1 feature maps. All these feature maps
are used for predicting bounding boxes at various scales (later layers responsible for
larger objects).
Thus the overall idea of SSD is shown in Fig. 8. Some of the activations are passed to
the sub-network that acts as a classifier and a localizer.
• A continuous offset by which the anchor needs to be shifted to fit the ground-truth
bounding box
Figure 9: Anchors
During training SSD matches ground truth annotations with anchors. Each element
of the feature map (cell) has a number of anchors associated with it. Any anchor with an
IoU (jaccard distance) greater than 0.5 is considered a match. Consider the case as
shown in Fig. 10, where the cat has two anchors matched and the dog has one anchor
matched. Note that both have been matched on different feature maps.
The loss function used is the multi-box classification and regression loss. The
classification loss used is the softmax cross entropy and, for regression the smooth L1
loss is used.
4 Conclusion
An accurate and efficient object detection system has been developed which achieves compa-
rable metrics with the existing state-of-the-art system. This project uses recent
techniques in the field of computer vision and deep learning. Custom dataset was
created using labelImg and the evaluation was consistent. This can be used in real-time
applications which require object detection for pre-processing in their pipeline.
An important scope would be to train the system on a video sequence for usage in
tracking applications. Addition of a temporally consistent network would enable smooth
detection and more optimal than per-frame detection.
References
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014.
[2] Ross Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV),
2015.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards
real- time object detection with region proposal networks. In Advances in Neural
Information Processing Systems (NIPS), 2015.
[4] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look
once: Unified, real-time object detection. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.
[5] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
[6] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.