Object Detection
Object Detection
The ability to recognize objects is born with human and animals. Humans and animals
can recognize objects without much effort. Object recognition is part of their daily lives and they
don’t even notice about it. The ability to recognize and classify objects for computers is called
object detection. Object detection is a key ability for many computers, smartphones and robots.
Many deep learning algorithms have made object detection to progress greatly in many
directions. This thesis focuses on comparison of object detection using two algorithms, YOLO
and RCNN. The configurations, performance and accuracy will be compared and discussed.
CHAPTER 1
INTRODUCTION
1.1 Motivation
As technologies have been made significantly advanced progress in the recent years, people
wanted their devices and gadgets to be automated, starting from smartphones, robots to self-
driving cars. When making devices to be autonomous, using scripts, programs or sensors
cannot satisfy the needs due to the fact that both of them will work as the way of how they’re
programmed by the programmer. Devices needed intelligence to make decisions or classify
items. As machine learning and deep learning researchers and practitioners have contributed
to the field of artificial intelligence, devices can recognize objects in images, classify music
from audio files and predict the prices and stock shares. Intelligence for smartphones,
machines, computers and robots to make them more and more autonomous and independent
of human supervision is a sustain dream for the mankind. Many science-fiction movies have
shown robots that do domestic work, providing healthcare, fight in battlegrounds and
companioning humans.
A robot cannot be intelligent and independent if it cannot see and adapt to the
surrounding environment. Engineers and scientist implemented image recognition
technologies inside the intelligence robots. It must also be able to recognize people’s faces,
determine which object to pick up, drop objects at the required place or give them to people,
avoid the objects that are obstacles in its path and ability to understand human language. The
key ability for a robot or computer is object detection. Scientists and researchers have
contributed several algorithms to carry out object detection.
The purpose of the thesis is to compare the algorithms in detection, classification and
tracking the objects. According to the need for detecting objects, the goal of the thesis project
is to identify multiple objects in the image or video using two algorithms, YOLO and RCNN.
Once the development of the project is finished, there will be measurements and evaluations
in terms of configurations, performance and accuracy of detecting objects.
1.3 Development
CHAPTER 2
THEORY
2.2 YOLO
Existing detection algorithms from the last decade make use of classifiers to perform
detection. To detect an object, they take a classifier for the object and calculate its probabilities
and confidence values at different locations in an image.
More recent approaches like RCNN use region proposal technics to generate bounding
boxes in the image that is being classified to run a classifier on the bounding boxes. After
classification, a method called post-processing is used to improve the quality of the bounding
boxes, eliminate nearby duplicate detections. These algorithms are slow, resource-hungry and
difficult to optimize because each individual component must be trained separately.
YOLO reframes object detection as a single regression problem, straight from image
pixels to bounding box coordinates and class probabilities. Using YOLO, you only look once at
an image to predict what objects are in the image and location of the objects in the image. YOLO
is amazingly simple a simultaneously predicts multiple bounding boxes and class probabilities
for those boxes. YOLO trains on full images and directly optimizes detection performance. This
unified model has several benefits over traditional methods of object detection.
Detecting dog, bike and vehicle with YOLO, each color showing the class of objects
An example of convolutional neural network
The detection system divides the input image into a S × S grid. If the center of an object falls
into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B
bounding boxes and confidence scores for those boxes. These confidence scores reflect how
confident the model is that the box contains an object and also how accurate it thinks the box
is that it predicts. If no object exists in that cell, the confidence scores should be zero.
Otherwise, the confidence score should be equal to the intersection over union (IOU)
between the predicted box and the ground truth. Each bounding box consists of 5 predictions:
x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to
the bounds of the grid cell. The width and height are predicted relative to the whole image.
Finally, the confidence prediction represents the IOU between the predicted box and any
ground truth box.
Each grid cell also predicts C conditional class probabilities, Pr (Classi | Object). These
probabilities are conditioned on the grid cell containing an object. We only predict one set of
class probabilities per grid cell, regardless of the number of boxes B. At test time we multiply
the conditional class probabilities and the individual box confidence predictions
YOLO detecting a bird, bounding box(red), grid cells(green) and x, y, w, h values
IoU Formula
Ground truth box and predicted box while detecting a stop sign
Accuracy of YOLO depending on IoU