Minor Project Report File
Minor Project Report File
Minor Project Report File
A
Minor Project Report
Submitted in partial fulfillment of the requirement for the award of degree of
Bachelor of Technology
In
Computer Science & Engineering
Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA,
BHOPAL (M.P.)
Guided By Submitted By
Supervisor
ii
Project Approval Form
I hereby recommend that the project Object Detection prepared under my supervision by
Kratagya Mourya (0875CS191054), Anmol Saxena (0875CS191014), Abhinav
Shrivastava (0875CS191004) and Aryan Patel (0875CS191018) be accepted in partial
fulfillment of the requirement for the degree of Bachelor of Engineering in Computer
Science & Engineering.
Supervisor
Project In-charge
Project Coordinator
iii
Shivajirao Kadam Institute of Technology & Management –
Technical Campus
Certificate
The project work entitled Object Detection submitted by Kratagya Mourya
(0875CS191054), Anmol Saxena (0875CS191014), Abhinav Shrivastava
(0875CS191004) and Aryan Patel (0875CS191018) is approved as partial fulfillment
for the award of the degree of Bachelor of Engineering in Computer Science &
Engineering by Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal (M.P.).
iv
Acknowledgement
With boundless love and appreciation, we would like to extend our/my heartfelt gratitude
and appreciation to the people who helped us/me to bring this work in reality. We would
like to have some space of acknowledgement for them.
Foremost, our would like to express our sincere gratitude to our supervisor, Prof. Jayesh
Umre whose expertise, consistent guidance, ample time spent and consistent advices that
helped us to bring this study into success.
To the project in-charges Prof. Virendra Dani for their constructive comments,
suggestions, and critiquing even in hardship.
To the project Coordinator Prof. Deepak Singh Chouhan for their consistent guidance,
coordination and schedule.
To the honorable Dr. Rashmi Yadav, Head, Department of Computer Science &
Engineering for his favorable responses regarding the study and providing necessary
facility.
Finally, we would like to pay our thanks to faculty members and staff of Department of
Computer Science & Engineering for their timely help and support.
We also like to pay thanks to our parents for their eternal love, support and prayers.
Without them it is not possible.
v
Abstract
Object detection methods aim to identify all target objects in the target image and
determine the categories and position information to achieve machine vision
understanding. Numerous approaches have been proposed to solve this problem, mainly
inspired by methods of computer vision and deep learning.
Real-time object detection is a vast, vibrant and complex area of computer vision. If there
is a single object to be detected in an image, it is known as Image Localization and if
there are multiple objects in an image, then it is Object Detection. This detects the
semantic objects of a class in digital images and videos. The applications of real-time
object detection include tracking objects, video surveillance, pedestrian detection, people
counting, self-driving cars, face detection, ball tracking in sports, and many more.
Convolution neural networks are a representative tool of Deep learning to detect objects
using OpenCV(Opensource Computer Vision), which is a library of programming
functions mainly aimed at real-time computer vision.
The main purpose of object detection is to identify and locate one or more effective
targets from still image or video data. It comprehensively includes a variety of important
techniques, such as image processing, pattern recognition, artificial intelligence, and
machine learning. It has broad application prospects in such areas as road traffic accident
prevention, warnings of dangerous goods in factories, military restricted area monitoring,
and advanced human-computer interaction. Since the application scenarios of multi-target
detection in the real world are usually complex and variable, balancing the relationship
between accuracy and computing costs is a difficult task.
vi
The objective is to detect objects using the You Only Look Once (YOLO) approach. This
method has several advantages as compared to other object detection algorithms. In other
algorithms like Convolutional Neural Network, Fast Convolutional Neural Network the
algorithm will not look at the image completely but in YOLO the algorithm looks at the
image completely by predicting the bounding boxes using convolutional network and the
class probabilities for these boxes and detects the image faster as compared to other
algorithms.
YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-
time. As the name suggests, the algorithm requires only a single forward propagation
through a neural network to detect objects. This means that prediction in the entire image
is done in a single algorithm run. The CNN is used to predict various class probabilities
and bounding boxes simultaneously.
vii
Table of Content
List of Figures...........................................................................................................................................
List of Tables.............................................................................................................................................
Abbreviations............................................................................................................................................
1.2 Goal.......................................................................................................................................................
1.3 Objective...............................................................................................................................................
1.4 Methodology.........................................................................................................................................
1.5 Role.......................................................................................................................................................
1.6.2 Innovativeness.........................................................................................................................
1.6.3 Usefulness...............................................................................................................................
viii
3.6 E-R Diagram.........................................................................................................................................
Chapter 4: Methodology.............................................................................................................................
Chapter 5: Construction..............................................................................................................................
References………………………………………………………………………………………………...
ix
CHAPTER 1
INTRODUCTION
1.1 Goal:
Blind people do lead a normal life with their own style of doing things. But they definitely face
troubles due to inaccessible infrastructure and social challenges. The biggest challenge for a
blind person, especially the one with the complete loss of vision, is to navigate around places.
Obviously, blind people roam easily around their house without any help because they know the
position of everything in the house. Blind people have a tough time finding objects around them.
So, we decided to make a REAL TIME OBJECT DETECTION System. We are interested in this
project after we went through few papers in this area. As a result, we are highly motivated to
develop a system that recognizes objects in the real time environment.
1.2 Objective:
The motive of object detection is to recognize and locate all known objects in a scene. Preferably
in 3D space, recovering pose of objects in 3D is very important for robotic control systems.
Imparting intelligence to machines and making robots more and more autonomous and
independent has been a sustaining technological dream for the mankind. It is our dream to let the
robots take on tedious, boring, or dangerous work so that we can commit our time to more
creative tasks. Unfortunately, the intelligent part seems to be still lagging behind. In real life, to
achieve this goal, besides hardware development, we need the software that can enable robot the
intelligence to do the work and act independently. One of the crucial components regarding this
is vision, apart from other types of intelligences such as learning and cognitive thinking. A robot
cannot be too intelligent if it cannot see and adapt to a dynamic environment.
The searching or recognition process in real time scenario is very difficult. So far, no effective
solution has been found for this problem. Despite a lot of research in this area, the methods
developed so far are not efficient, require long training time, are not suitable for real time
application, and are not scalable to large number of classes. Object detection is relatively simpler
if the machine is looking for detecting one particular object. However, recognizing all the objects
inherently requires the skill to differentiate one object from the other, though they may be of
same type. Such problem is very difficult for machines, if they do not know about the various
possibilities of objects
x
1.3 Methodology:
YOLO algorithm gives a much better performance on all the parameters we discussed along with
a high fps for real-time usage. YOLO algorithm is an algorithm based on regression, instead of
selecting the interesting part of an Image, it predicts classes and bounding boxes for the whole
image in one run of the Algorithm.
YOLO is an abbreviation for the term ‘You Only Look Once’. This is an algorithm that detects
and recognizes various objects in a picture (in real-time). Object detection in YOLO is done as a
regression problem and provides the class probabilities of the detected images.
YOLO algorithm employs convolutional neural networks (CNN) to detect objects in real-time.
As the name suggests, the algorithm requires only a single forward propagation through a neural
network to detect objects.
This means that prediction in the entire image is done in a single algorithm run. The CNN is used
to predict various class probabilities and bounding boxes simultaneously.
The YOLO algorithm consists of various variants. Some of the common ones include tiny
YOLO and YOLOv3.
1.4 Role:
The significant most advantage of object detection projects is that it is more accurate than human
vision. The human brain is astounding, so much that it can finish pictures dependent on only a
couple of snippets of data. But it can sometimes also keep us from seeing what is actually there.
The complete picture isn’t always accurate because human brains make assumptions.
Object detection projects react to images based only on the data presented and not just snippets
of it like the human brain. Although it can make assumptions based on patterns, it does not have
the disadvantage of a human brain’s tendency to leap to conclusions that may not be accurate.
Object detection also operates at the pixel level at which the human brain can’t process. This
allows object detection projects to provide more accurate results.
xi
Today, object recognition is the core of most vision-based AI software and programs. Object
detection plays an important role in scene understanding, which is popular in security,
transportation, medical, and military use cases.
People detection in Security. A wide range of security applications in video surveillance are
based on object detection, for example, to detect people in restricted or dangerous areas, suicide
prevention, or to automate inspection tasks on remote locations with computer vision.
1.5.2 Innovativeness:
As a real-time object detection system, YOLO object detection utilizes a single neural network.
The latest release of ImageAI v2.1.0 now supports training a custom YOLO model to detect any
kind and number of objects. Convolutional neural networks are instances of classifier-based
systems where the system repurposes classifiers or localizers to perform detection and applies
the detection model to an image at multiple locations and scales. Using this process, “high
scoring” regions of the image are considered detections. Simply put, the regions which look most
like the training images given are identified positively.
As a single-stage detector, YOLO performs classification and bounding box regression in one
step, making it much faster than most convolutional neural networks. For example, YOLO object
detection is more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.
YOLOv3 achieves 57.9% mAP on the MS COCO dataset compared to DSSD513 of 53.3% and
RetinaNet of 61.1%. YOLOv3 uses multi-label classification with overlapping patterns for
training. Hence it can be used in complex scenarios for object detection. Because of its multi-
class prediction capabilities, YOLOv3 can be used for small object classification while it shows
worse performance for detecting large or medium-sized objects.
xii
1.5.3 Usefulness:
Object detection is one of the fundamental problems of computer vision. It forms the basis of
many other downstream computer vision tasks, for example, instance segmentation, image
captioning, object tracking, and more. Specific object detection applications include pedestrian
detection, people counting, face detection, text detection, pose detection, or number-plate
recognition.
xiii
CHAPTER 2
ANALYSIS AND DESIGN
A use case diagram in the Unified Modeling Language (UML is a kind of behavioral outline
positive by and produced using a Use-contextual investigation. Its determination is to surviving a
graphical sign of the usefulness giving by a framework regarding performing artists and their
points (spoke to as use cases), and any conditions between those utilization cases. The key
reason for a utilization case outline is to show what framework capacities are performed for
which on-screen character. Parts of the on-screen characters in the framework can be delineated.
xiv
3.2 Activity Diagram
xv
3.3 System Architecture
The architecture has been bought out, keeping in view that general algorithms are usually
implemented in OpenCV, referred to as the popular computer vision library. Past evolutions on
these theories of object identification also used these algorithms. These evolving technologies in
the building of the latest applications based on these algorithms are not so useful and accurate.
Hence, these traditional algorithms could not meet its specifications in evaluating its
performance and work efficiency under certain circumstance
xvi
3.3.1 System Architecture of YOLOv3
YOLOv2 used a feature extractor known as the Darknet-19, which consisted of 19 convolutional
layers. The newer version of this algorithm, YOLOv3 uses a new feature extractor known as
Darknet-53 which, as the name suggests, uses 53 convolutional layers while the overall
algorithm consists of 75 convolutional layers and 31 other layers making it a total of 106 layers
[36]. Pooling layers have been removed from the architecture and replaced by another
convolutional layer with stride ‘2’, for the purpose of down-sampling. This key change has been
made to prevent the loss of features during the process of pooling. Figure 2.8 which is created by
‘CyberailAB’ clearly depicts the architecture of YOLOv3 algorithm.
YOLOv3 performs detections at three different scales, as shown in the Figure 2.8. 1 x 1
detection kernels are applied on the feature maps with three unique sizes located at three unique
places in the network. The shape of the detection kernel is 1x1x (B ∗ (4 + 1 + C)), where ‘B’ is
the number of bounding boxes that can be predicted by a cell located on the feature map, ‘4’
represents the number of bounding box attributes, ‘1’ represents the object confidence and ‘C’
represents the number of classes. Figure 2.7 depicts the splitting of an image and bounding-box
prediction in YOLOv3 and Figure 2.8 depicts the architecture of YOLOv3 algorithm trained on
COCO dataset which has 80 classes and bounding boxes are considered to be 3. Therefore, the
kernel size would be 1 x 1 x 255 [37]. In YOLOv3, the dimensions of the input image are down
sampled by 32, 16 and 8 to make predictions at scales 3,2 and 1 respectively.
In the Figure 2.8, the size of the input image is 416 x 416. As mentioned in the earlier section,
the total number of layers in YOLOv3 is 106. As shown in the network architecture diagram
Figure 2.8, the input image is down sampled by the network for the first 81 layers. Since the 81st
layer has a stride of 32, the 82nd layer performs the first detection with a feature map of size 13
x 13. Since a 1 x 1 kernel is used to perform the detection, the size of the resulting detection
feature map is 13 x 13 x 255 which is responsible for the detection of objects at scale 3.
Following this, the feature map from 79th layer is up sampled by 2x after subjecting it to a few
convolutional layers, resulting in the dimensions 26 x 26. This is then concatenated with the
feature map from 61st layer. The features are fused by subjecting the concatenated feature map
to a few more 1 x 1 convolutional layers. As a result, the 94th layer performs the second
detection with a feature map of 26 x 26 x 255, which is responsible for the detection of objects at
scale 2.
Following the second detection, the feature map from 91st layer is up sampled by 2x after
subjecting it to a few convolutional layers, resulting in the dimensions 52 x 52. This is then
concatenated with the feature map from 36th layer. The features are fused by subjecting the
concatenated feature map to a few more 1 x 1 convolutional layers. As a result, the 106th layer
xvii
performs the third and final detection with a feature map of 52 x 52 x 255, which is responsible
for the detection of objects at scale 1. As a result, YOLOv3 is better at detecting smaller objects
when compared to its predecessors YOLOv2 and YOLO.
xviii
3.3.1 System Architecture for Tiny-YOLOv3
Tiny-YOLOv3 is the smaller and simplified version of YOLOv3. Even though the number of
layers in Tiny-YOLOv3 is quite less when compared to that of YOLOv3, the accuracy of the
model is almost the same as that of its bigger self when high frame rates are considered. Tiny-
YOLOv3 consists of only 13 convolutional layers and 8 max-pool layers and therefore, requires
minimal memory to run which is way less than the layers in YOLOv3. The major difference
between YOLOv3 and TinyYOLOv3 is that the former is designed to detect objects at three
different scales while the later can only detect objects at two different scales. Apart from these
differences, the working of both these variants is similar. Figure 2.11 shows the architecture
details of Tiny-YOLOv3.
Tiny-YOLOv3 performs detections at two different scales, as shown in the Figure 2.11. 1 x 1
detection kernels are applied on the feature maps with two unique sizes located at two unique
places in the network. The shape of the detection kernel is 1x1x (B ∗ (4 + 1 + C)), where ‘B’ is
the number of bounding boxes that can be predicted by a cell located on the feature map, ‘4’
represents the number of bounding box attributes, ‘1’ represents the object confidence and ‘C’
represents the number of classes. Figure 2.11 depicts the architecture of Tiny-YOLOv3
algorithm trained on COCO dataset which has 80 classes and bounding boxes are considered to
be 3. Therefore, the kernel size would be 1 x 1 x 255 [37].
In the Figure 2.11, the size of the input image is 416 x 416. As mentioned in the earlier section,
the total number of layers in Tiny-YOLOv3 is 23. As shown in the network architecture diagram
Figure 2.11, the input image is max-pooled by the network for the first 15 layers. The 15th layer
performs the first detection with a feature map of size 13 x 13. Since a 1 x 1 kernel is used to
perform the detection, the size of the resulting detection feature map is 13 x 13 x 255 which is
responsible for the detection of objects at scale 2.
Following this, the feature map from 14th layer is up sampled by 2x after subjecting it to a
convolutional layer, resulting in the dimensions 26 x 26. This is then concatenated with the
feature map from 9th layer. The features are fused by subjecting the concatenated feature map to
a 1 x 1 and a 3 x 3 convolutional layer. As a result, the 23rd layer performs the second and final
detection with a feature map of 26 x 26 x 255, which is responsible for the detection of objects at
scale 1.
xix
3.4 Sequence Diagram
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.
xx
3.5 E-R Diagram
xxi
CHAPTER 3
METHODOLOGY
Understanding visual scenes is a primary goal of computer vision; it involves recognizing what
objects are present, localizing the objects in 2D and 3D, determining the object’s attributes, and
characterizing the relationship between objects. Therefore, algorithms for object detection and
object classification can be trained using the dataset.
COCO stands for Common Objects in Context, as the image dataset was created with the goal of
advancing image recognition. The COCO dataset contains challenging, high-quality visual
datasets for computer vision, mostly state-of-the-art neural networks.
For example, COCO is often used to benchmark algorithms to compare the performance of real-
time object detection. The format of the COCO dataset is automatically interpreted by
advanced neural network libraries.
xxii
The COCO dataset classes for object detection and tracking include the following pre-trained 80
objects:
'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep',
'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase',
'frisbee', 'skis’, ‘snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut',
'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator',
'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
xxiii
The COCO key points include 17 different pre-trained key points (classes) that are annotated
with three values (x,y,v). The x and y values mark the coordinates, and v indicates the visibility
of the key point (visible, not visible).
The large dataset comprises annotated photos of everyday scenes of common objects in their
natural context. Those objects are labeled using pre-defined classes such as “chair” or “banana”.
The process of labeling, also named image annotation and is a very popular technique in
computer vision.
While other object recognition datasets have focused on 1) image classification, 2) object
bounding-box localization, or 3) semantic pixel-level segmentation – the mscoco dataset focuses
on 4) segmenting individual object instances.
With COCO, Microsoft introduced a visual dataset that contains a massive number of photos
depicting common objects in complex everyday scenes. This sets COCO apart from other object
recognition datasets that may be specifically specific sectors of artificial intelligence. Such
sectors include image classification, object bounding box localization, or semantic pixel-level
segmentation.
Meanwhile, the annotations of COCO are mainly focused on the segmentation of multiple,
individual object instances. This broader focus allows COCO to be used in more instances than
other popular datasets like CIFAR-10 and CIFAR-100. However, compared to the OID dataset,
COCO does not stand out too much and in most cases, both could be used.
With 2.5 million labeled instances in 328k images, COCO is a very large and expansive dataset
that allows many uses. However, this amount does not compare to Google’s OID, which contains
a whopping 9 million annotated images.
Google’s 9 million annotated images were manually annotated, while OID discloses that it
generated object bounding boxes and segmentation masks using automated and computerized
methods. Both COCO and OID have not disclosed bounding box accuracy, so it remains up to
the user whether they assume automated bounding boxes would be more precise than manually
made ones.
xxiv
4.2 Proposed Algorithm
Object detection can be identified using convolution neural networks as an algorithm. Detection
algorithms can be compared as identification of the image can be done not only by identifying
the label of a particular class but also identifies its location in which the object is placed. This
algorithm also helps to divide the picture into parts but also in identifying different objects that
are located along with the image. This convolution neural network algorithm uses only a single
structure neural network to detect the complete image by dividing the image into separate
portions and identifying closed boxes and probabilities for separate portions. These bounding
boxes in the image are calculated using pre-identified probabilities.
The image which has been kept under identifying the objects uses YOLO as the substitute for its
size. In general, we consider only fixed size due to different issues that only reveal its main
objectives while evaluating the detection process by using algorithms.
This algorithm initially fixes the image’s height and width as input and gets the output. It lists
out all the boxes available in the frame with a class multi labelling. Each box in the frame
contains (pp ,bx, by,bh,bw,p) as a parameter where ‘pp ’ can be either 0 or 1 which defines the
probability of a person ‘p’ is present in the image or not, ‘bx’ and ‘by’ defines the midpoint of
the box and ‘bh,’ ‘by’ defines the height and width of the box respectively.
xxv
Workflow of YOLOv3
The above shows the workflow procedure of Yolo V3. In this, the feature prediction mapping is
done on each box available in the frame. Each box is treated three individual boxes, which
xxvi
defines V3. Each box’s attributes contain ‘box coordinates,’ ‘probability scores,’ and ‘multi-label
scores’ for bounding all the boxes available in the frame. The ‘red’ color cell is at the 5th row of
the sixth cell on the grid image. Now we have applied the feature mapping on it to detect the
person
Algorithm:
Step 1: if (setModel=YoloV3( )|| TinyYoloV3( ))
Step 9: {
Step 11: }
xxvii
After applying the frame’s echoing technique to get the twodimensional shape, we have
computed the probability score for each box by applying the product. The sample example on
how to get the probability scores as follows:
xxviii
CHAPTER 4
CONSTRUCTION
An object detection technique lets you understand the details of an image or a video as it allows
for the recognition, localization, and detection of multiple objects within an image.
It is usually utilized in applications like image retrieval, security, surveillance, and advanced
driver assistance systems (ADAS). Object Detection is done through many ways:
Computerized picture preparing is a range portrayed by the requirement for broad test
work to build up the practicality of proposed answers for a given issue. A critical
trademark hidden the plan of picture preparing frameworks is the huge level of testing
and experimentation that
Processing on image:
Processing on image can be of three types They are low-level, mid-level, high level.
Low-level Processing:
Contrast enhancement.
Image sharpening.
xxix
Medium Level Processing:
Segmentation.
Edge detection
Object extraction.
Image analysis
Scene interpretation
YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par
with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and
accuracy simply by changing the size of the model, no retraining required!
xxx
Performance on the COCO Dataset
xxxi
Retinanet-50-500 COCO trainval test-dev 50.9 - 14
Retinanet-101-500 COCO trainval test-dev 53.1 - 11
Retinanet-101-800 COCO trainval test-dev 57.5 - 5
YOLOv3-320 COCO trainval test-dev 51.5 38.97 Bn 45
YOLOv3-416 COCO trainval test-dev 55.3 65.86 Bn 35
YOLOv3-608 COCO trainval test-dev 57.9 140.69 Bn 20
YOLOv3-tiny COCO trainval test-dev 33.1 5.56 Bn 220
YOLOv3-spp COCO trainval test-dev 60.6 141.45 Bn 20
How It Works
We use a totally different approach. We apply a single neural network to the full image.
This network divides the image into regions and predicts bounding boxes and
probabilities for each region. These bounding boxes are weighted by the predicted
probabilities. Our model has several advantages over classifier-based systems. It looks at
the whole image at test time so its predictions are informed by global context in the
image. It also makes predictions with a single network evaluation unlike systems like R-
CNN which require thousands for a single image. This makes it extremely fast, more than
1000x faster than R-CNN and 100x faster than Fast R-CNN. See our paper for more
details on the full system.
xxxii
YOLOv3 uses a few tricks to improve training and increase performance, including:
multi-scale predictions, a better backbone classifier, and more. The full details are in our
paper!
This post will guide you through detecting objects with the YOLO system using a pre-
trained model. If you don't already have Darknet installed, you should do that first.
Algorithm:
Yolo
Step 9: {
Step 11: }
To run this demo, you will need to compile Darknet with CUDA and OpenCV. Then run
the command:
YOLO will display the current FPS and predicted classes as well as the image with
bounding boxes drawn on top of it.
You will need a webcam connected to the computer that OpenCV can connect to or it
won't work. If you have multiple webcams connected and want to select which one to use
you can pass the flag -c <num> to pick (OpenCV uses webcam 0 by default).
You can also run it on a video file if OpenCV can read the video:
Implemented Classes:
Some of the main classes that have been implemented in the project are as follows there
are many kind of classes that are present in the weights section of the YOLOv3 but some
of the main classes are as follows.
ClassIndex: Used for defining the index of the particular class and the array that we have
created using the coco dataset in our project.
Confidence: Used for creating the boxes outside of the object that has been detected using
the YOLO.It can be of green or any color we want.
Bbox: Used for creating the boxes outside of the object that has been detected using the
YOLO. It can be of green or any color we want.
xxxiv
Implemented Functions
Weighted Sum
Inputs to a neuron can either be features from a training set or outputs from the neurons
of a previous layer. Each connection between two neurons has a unique synapse with a
unique weight attached. If you want to get from one neuron to the next, you have to travel
along the synapse and pay the “toll” (weight). The neuron then applies an activation
function to the sum of the weighted inputs from each incoming synapse. It passes the
result on to all the neurons in the next layer. When we talk about updating weights in a
network, we’re talking about adjusting the weights on these synapses.
A neuron’s input is the sum of weighted outputs from all the neurons in the previous
layer. Each input is multiplied by the weight associated with the synapse connecting the
input to the current neuron. If there are 3 inputs or neurons in the previous layer, each
neuron in the current layer will have 3 distinct weights: one for each synapse.
In a nutshell, the activation function of a node defines the output of that node.
The activation function (or transfer function) translates the input signals to output signals.
It maps the output values on a range like 0 to 1 or -1 to 1. It’s an abstraction that
represents the rate of action potential firing in the cell. It’s a number that represents the
likelihood that the cell will fire. At it’s simplest, the function is binary: yes (the neuron
fires) or no (the neuron doesn’t fire). The output can be either 0 or 1 (on/off or yes/no), or
it can be anywhere in a range. If you were using a function that maps a range between 0
and 1 to determine the likelihood that an image is a cat, for example, an output of 0.9
would show a 90% probability that your image is, in fact, a cat.
Activation function
In a nutshell, the activation function of a node defines the output of that node.
The activation function (or transfer function) translates the input signals to output signals.
It maps the output values on a range like 0 to 1 or -1 to 1. It’s an abstraction that
represents the rate of action potential firing in the cell. It’s a number that represents the
xxxv
likelihood that the cell will fire. At it’s simplest, the function is binary: yes (the neuron
fires) or no (the neuron doesn’t fire). The output can be either 0 or 1 (on/off or yes/no), or
it can be anywhere in a range.
Threshold function
This is a step function. If the summed value of the input reaches a certain threshold the
function passes on 0. If it’s equal to or more than zero, then it would pass on 1. It’s a very
rigid, straightforward, yes or no function.
Sigmoid function
This function is used in logistic regression. Unlike the threshold function, it’s a smooth,
gradual progression from 0 to 1. It’s useful in the output layer and is used heavily for
linear regression.
This function is very similar to the sigmoid function. But unlike the sigmoid function
which goes from 0 to 1, the value goes below zero, from -1 to 1. Even though this isn’t a
lot like what happens in a brain, this function gives better results when it comes to
training neural networks. Neural networks sometimes get “stuck” during training with the
sigmoid function. This happens when there’s a lot of strongly negative input that keeps
the output near zero, which messes with the learning process.
Rectifier function
xxxvi
This might be the most popular activation function in the universe of neural networks.
It’s the most efficient and biologically plausible. Even though it has a kink, it’s smooth
and gradual after the kink at. This means, for example, that your output would be either
“no” or a percentage of “yes.” This function doesn’t require normalization or other
complicated calculations.
It comes under the layer of machine learning, where machines can acquire skills and
learn from past experience without any involvement of human. Deep learning comes
under machine learning where artificial neural networks, algorithms inspired by the
human brain, learn from large amounts of data.
xxxvii
CHAPTER 5
xxxviii
Figure 8: Object detection in an outdoor environment with multi labelling.
Figure 8 shows the loaded image in the outdoor environment at one part, and the other
hands, the model is marked with all the objects available in the picture with blue-
coloured frames.
xxxix
Figure 9: Accuracy values of available objects in the image. Figure 9 shows all the
available objects in the images with accuracies. The detection module observed that there
are five bottles, one chair, and eight persons are detected in the loaded image. By playing
the audio, the module says, “Hey! There are five bottles, one chair’s eight people’s before
you”.
Figure 10 shows all the available variables with labels at the traffic signal environment.
In this case, the model detected seven cars, two trucks, one person’s, and one bicycle’s
before the person, along with accuracy.
xl
Figure 11: Playing audio output. Figure 11 shows the accuracy of the objects available in
the loaded image. By playing the “play audio” module, the visually impaired people can
listen to the type of objects in the surrounding environment and the count.
xli
Figure 12 shows all the detected variables applied at a traffic signal and used for the
comparative analysis to compare results by Retina Net, Yolo V3, and Yolo Tiny on the
same image
For humans and many other animals, visual perception is one of the most important senses; we
heavily rely on vision whenever we interact with our environment. In order to pick up a glass, we
need to first determine which part of our visual impression corresponds to the glass before we
can find out where we have to move our hands in order to grasp it.
The same code that can be used to recognize Stop signs or pedestrians in a self-driving vehicle
signs can also be used to find cancer cells in a tissue biopsy.
xlii
If we want to recognize another human, we first have to find out which part of the image we see
represents that individual, as well as any distinguishing factors of their face.
Notably, we generally do not actively consider these basic steps, but these steps pose a major
challenge for artificial systems dealing with image processing.
1. Problem Domain
People who can’t see, face problems in recognizing things by feeling them. If there
will be some other way or felicity in recognizing things like by someone telling
them or any software that tells them the name and distance of thing from a respective
point.
xliii
2. Solution Domain
A person cannot remain with a blind person all the time in order to assist them in
recognizing objects.
Our project will aid blind people in recognizing things by telling them the name of
thing.
This is an object detection system which currently works only for recognizing the
name or type of things, not their distance from a point.
3. System Domain
A smart device with camera feature and speaker.
The camera will first capture a single frame from the live field, and then the code will
scan the frame to identify the object. Once the object recognition is complete, the
name or information will be sent to the speakers. Finally, the speaker will dictate the
name or information it has.
4. Application Domain
Every blind person can get help from this,
6. References:
Kaggle
xliv
Darknet
xlv
xlvi
xlvii
xlviii
REFERENCES
1. Asa Gautam, Anjana Kumari, Pankaj Singh: "The Concept of Object Recognition",
International Journal of Advanced Research in Computer Science and Software Engineering,
Volume 5, Issue 3, March 2015
2. Tatsuro UE, Hirohiko K, Tetsuo T, Akihisa O, Shin’ich Y. Visual Information Assist System
Using 3D SOKUIKI Sensor for Blind People. the 32nd annual of the IEEE Industrial Electronics
Society (IECON). 2006.
xlix