0% found this document useful (0 votes)
20 views89 pages

Jayadeva

Uploaded by

skrithi2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views89 pages

Jayadeva

Uploaded by

skrithi2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

VIDEO ANALYTICS: OBJECT DETECTION USING DEEP NEURAL NETWORKS

Submitted by

JAYADEVA JAVALI
202011

AIMIT, ST ALOYSIUS COLLEGE [AUTONOMOUS]


MANGALORE

Submitted in Partial Fulfillment of the Requirements for the Award of the Degree of
Master of Science (Big Data Analytics)

Under the guidance of

Dr. Hemalatha N Kevin Joy D’Souza


Dean, Information Technology Cloud Specialist
AIMIT, St Aloysius College (Autonomous) Niveus Solutions Private Limited
Mangalore Mangalore

Submitted to

ST ALOYSIUS INSTITUTE OF MANAGEMENT AND INFORMATION


TECHNOLOGY (AIMIT)
ST ALOYSIUS COLLEGE (AUTONOMOUS)
MANGALURU, KARNATAKA

2022
CERTIFICATE OF AUTHENTICATED WORK

This is to certify that the project report entitled VIDEO ANALYTICS: OBJECT
DETECTION USING DEEP NEURAL NETWORKS submitted to St. Aloysius
Institute of Management and Information Technology (AIMIT), St Aloysius College,
Mangalore affiliated to Mangalore University in partial fulfillment of the
requirement for the award of the degree of MASTER OF BIG DATA ANALYTICS
is an original work carried out by Mr. JAYADEVA JAVALI Register number
202011 under my guidance. The matter embodied in this project is authentic and is
genuine work done by the student and has not been submitted whether to this
University, or to any other University / Institute for the fulfilment of the requirement
of any course of study.

Signature of the Student: Signature of the Guide:

Date: _______________ Date: _______________

Jayadeva Javali Mr. Kevin Joy D’Souza

II M.Sc. Big Data Analytics Cloud Specialist,

AIMIT, St. Aloysius College, Niveus Solutions Pvt. Ltd.

Mangaluru - 575 022 Mangaluru - 575 001

Register Number: 202011


HR/Internship Letter/June/ 2022-23 Date : 6th June 2022

TO WHOMSOEVER IT MAY CONCERN

This is to certify that Mr. Jayadeva Javali has undergone internship with Niveus Solutions Pvt
Ltd from 21st Feb 2022 to 31st May 2022.

During the period of internship, he has worked with the Data Modernization team and has
successfully met the objectives of the internship. We found Mr. Jayadeva Javali to be
hardworking, sincere and displayed good conduct.

We are happy to additionally inform you that the student has been offered a Pre-Placement
offer at Niveus and has joined our organization as a permanent employee effective from 1st
Jun, 2022.

We wish him
her all success in future endeavors.

For M/s Niveus Solutions Pvt. Ltd

Yours sincerely,
For Niveus Solutions

____________________
Rashmi George - Chief Talent Officer

Doc ID: 7a8bcfbd0fe4dcdd9ea1f2db69afeea5b28b2d95


Audit trail

TITLE
Jayadeva Javali Internship completion ltr
FILE NAME
Jayadeva Javali -...tificate.docx.pdf
DOCUMENT ID
7a8bcfbd0fe4dcdd9ea1f2db69afeea5b28b2d95
AUDIT TRAIL DATE FORMAT
MM / DD / YYYY
STATUS
Signed

06 / 06 / 2022 Sent for signature to Rashmi George


11:44:22 UTC ([email protected]) from
[email protected]
IP: 103.89.232.109

06 / 06 / 2022 Viewed by Rashmi George ([email protected])


11:58:59 UTC IP: 202.140.47.58

06 / 06 / 2022 Signed by Rashmi George ([email protected])


12:01:28 UTC IP: 202.140.47.58

06 / 06 / 2022 The document has been completed.


12:01:28 UTC
Project Proposal Synopsis for
Video Analytics: Object Detection Using
Deep Neural Networks

Jayadeva Javali
202011
MSc. Big Data Analytics
St Aloysius College of Management and Information Technology
Mangalore
30-03-2022

Under the guidance of


Dr. Hemalatha N
Dean, Department of IT
St Aloysius College of Management and Information Technology
Mangalore

Submitted to

ST ALOYSIUS INSTITUTE OF MANAGEMENT AND INFORMATION


TECHNOLOGY (AIMIT)
ST ALOYSIUS COLLEGE (AUTONOMOUS)
MANGALORE, KARNATAKA

2022

6
I. Title of the Project

Video Analytics: Object Detection Using Deep Neural Networks


Video Analytics enables a rapidly growing number of embedded video products such as smart
cameras and intelligent digital video recorders with automated capabilities.

II. Statement of the Problem


The reason behind the problem object detection is industry requirement and user experience. To
brief, industries have specific responsibility in maintaining and storing large datasets of video
type. E.g.: -News data to track news of any specific day etc.

To easily understand the video for any person who knows or doesn’t know a person in video for
better understanding through pre trained object detection models like face, object, activity etc.

Using video analytics makes your surveillance system more efficient, reduces the workload on
security and management staff, and helps you capture the full value of security video by making
your IP camera system more intelligent in its work.

These systems would contribute a more reliable return on investment as they are applied across
the organizations as compared to current systems

III. Why this particular topic chosen?


Industry Requirement: - The need to easily understand the objects existing in any type of videos
and maintaining higher accuracy by training the model to yield a better result. To implement
Artificial Intelligence to Video Analytics. It becomes a relevant asset for companies. These
systems would contribute a more reliable return on investment as they are applied across the
organizations as compared to current system

Maintaining and storing large datasets of video type: The process includes management of both
unstructured and structured data. The primary objective is to ensure the data is of high quality
and accessible for business intelligence along with big data analytics applications.

User Experience: - Considering an individual is interested to watch a movie, he/she would like
to directly skip to video content rather than going through the text information in the beginning,
to meet user requirements for better understanding of various genre of videos.

7
IV. Objective and Scope
Using video analytics makes your surveillance system more efficient, reduces the workload on
security and management staff, and helps you capture the full value of security video by making
your IP camera system more intelligent in its work. video analytics helps quantify the returns on
their video marketing, making it easy to build the business case for producing more of it. Sales
teams can leverage video analytics to find the most contextual video for their customers, use it to
engage customers in a meaningful way, the video analytics market size was valued at $4.10 billion
and Traditionally, increasing security threats and the need for advanced surveillance have driven
demand in the video analytics market.

Object detection is a computer technology related to computer vision and image processing that
deals with detecting instances of semantic objects of a certain class (such as humans, buildings,
or cars) in digital images and videos. The state-of-the-art methods can be categorized into two
main types: one-stage methods and two stage-methods. One-stage methods prioritize inference
speed, and example models include YOLO, SSD and RetinaNet. Two-stage methods prioritize
detection accuracy, and example models include Faster R-CNN, Mask R-CNN and Cascade R-
CNN.

Scope

Image recognition only outputs a class label for an identified object and image segmentation
creates a pixel level understanding of a scene’s elements. What separates object detection from
the other tasks is its unique ability to locate objects within an image or video. This then Allows
us to count and then track those objects. Deep Learning outperform other techniques if the data
size is large. But with small data size, traditional Machine Learning algorithms are preferable.
Deep Learning techniques need to have high end infrastructure to train in reasonable time. Deep
Learning really shines when it comes to complex problems such as image classification, natural
language processing, and speech recognition.

V. Methodology
LabelImg - A tool used for labelling images for custom dataset. LabelImg is a graphical
annotation tool written in python, it uses Qt for its graphical interface. Annotations are saved as
XML files in PASCAL VOC format, it also supports YOLO and CreateML formats.

8
YoloV5- YOLO an acronym for 'You only look once', is an object detection algorithm that divides
images into a grid system. Each cell in the grid is responsible for detecting objects within itself.
YOLO is one of the most famous object detection algorithms due to its speed and accuracy.

Deep Neural Networks and Artificial Intelligence Approach-


AI-powered production is set to transform how you work with technologies such as video
analytics and with less trouble and better, more refined results. AI-based Video Analytics's
primary goal is to detect temporal and spatial events in videos automatically.

VI. Process Description

Fig 1: - Project Work Flow Diagram


Collecting and creating custom datasets based on requirement and preprocessing with
conditional requirements and modeling using Deep Neural Networks (YOLOV5 Model) to
train and deploy on large scale and simultaneously take inputs from deployment to make a
better service model.

VII. Resources and Limitations

Hardware Software
Operating System: - Windows 7 or higher LabelImg -Image Annotation tool
Processor: - i5 or higher Python - Version 3 or higher
HDD: - 500GB-1TB Python IDE-Google Collab/Jupiter
Network Connectivity Notebook/PyCharm

7
VIII. Testing Technologies Used

Serving these Models through APIs-Fast API with industry constraints, API Documentation is
necessary to understand the workflow and test the various models.
Evaluation Metrics-These are used to measure the quality of the statistical or machine learning
model. Evaluating any ML Model is essential for any project. There are many different types of
evaluation metrics available to test a model.

IX. Conclusion

Object detection using Deep Neural Networks through Artificial Intelligence approach is
the need of the hour as the Artificial Intelligence field is the technology to look out for in
present and near future. This consists of series of procedure from data collection to
accurately detecting the object (face, thing etc.) which includes training and testing various
modular approaches to look for better accuracy. Various suitable evaluation metrics will
be used based on the model to understand it in depth.
Video processing can take surveillance and other monitoring tasks to a whole new level,
reducing time, money, and human effort; in turn, this makes industries more secure,
reliable, and consistent and also This described the methodologies for extracting high-level
information.

10
INDEX
Acknowledgement 13
List of Figures 14
Abbreviation’s 17

1. Introduction

1.1 Overview 19

1.2 Problem Statement 20

1.3 Project Objective 21

1.4 Organization Details 22

2. Literature Review

2.1 Introduction 24

2.2 Details of Literature Review 25

2.3 Models/Algorithms Used 29

2.4 Findings-Gaps Identified 32

3. Methodology

3.1 Introduction 38

3.2 Data Sources and Format 38

3.3 Data Pre-Processing -Extraction and Preparation 39

3.4 Data Exploration and Analysis 40

3.5 Process Description 44

3.6 Workflow Diagram 51

3.7 Hardware and Software Requirements 51

11
4. Model Details and Implementation

4.1 Model Building 53

4.2 Data Inputs 54

4.3 User Interface 56

4.4 Results and Model Accuracy 57

4.5 Results Analysis and Discussion 79

5. Conclusion

5.1 Summary 82

5.2 Limitation 82

5.3 Future Scope 83

References 84

Annexures

12
Acknowledgment

At the very beginning I would like to thank the Almighty who enabled me to go on with my
research work and give my best efforts to bring it to a conclusion.

Secondly, I would like to thank none other than my Internal Guide Dr. Hemalatha N and
external guide Kevin Joy D’Souza who supported and guided me since the very beginning of
my research. Without their proper guidance it would have not been possible for my progress
and finish this work. Whenever I hit a wall, they encouraged me to overcome it and offered any
kind of help that was at their disposal. I am grateful to them for their excellent supervision,
guidance and encouragement that pushed me to successfully conduct and finish my research
work. I would like to extend my gratitude to my parents, my friends and all well-wishers who
had been supportive and helpful through the entire journey.

Last but not the least, I thank AIMIT, St. Aloysius College(Autonomous) and every single
person associated with this organization for providing me the opportunity of conducting this
research and for giving me the opportunity to complete my Master’s degree.

13
List of Figures
Fig 1:- Project Work Flow Diagram
Fig 2:-Yolo Architecture
Fig 3:-R-CNN Architecture
Fig 4:-DNN Based Regression
Fig 5:-Deep Neural Network
Fig 6:-Yolov5 Architecture
Fig 7:-Process Description for Object Detection
Fig 8:-Detailed Process of YoloV5 used in the paper
Fig 9:-SSD Architecture
Fig 10:-Working Example of SSD Model
Fig 11:-LabelImg via Command Prompt
Fig 12:-LabelImg Installation in Unix Systems
Fig 13:-LabelImg Installation Guide for Windows
Fig 14:-LabelImg Window
Fig 15:-LabelImg Window Showing Image with Bounding Box
Fig 16:-Packages Used and Cloning the Yolov5 Repo from github
Fig 17:-Annotation File
Fig 18:-The data configuration file
Fig 19:-Directory structure
Fig 20:-classes
Fig 21:-Model Workflow
Fig 22:-The data configuration file
Fig 23:-Directory structure
Fig 24:-classes
Fig 25:-Confusion Matrix(Case1)
Fig 26:-Labels
Fig 27:-PR-Curve
Fig 28:-Results

14
Fig 29:-Train - batch 0,1,2
Fig 30:-Validation Batch 0 Labels
Fig 31:-Validation Batch 0 Predicted
Fig 32:-Validation Batch 1 Labels
Fig 33:-Validation Batch 1 Predicted
Fig 34:-Validation Batch 2 Labels
Fig 35:-Validation Batch 2 Predicted
Fig 36:-Confusion Matrix(Case2)
Fig 37:-Labels
Fig 38:-PR Curve
Fig 39:-Model Results
Fig 40:-Train Batch 0 Labels
Fig 41:-Train Batch 1 Labels
Fig 42: -Validation Batch 0 Labels
Fig 43:-Validation Batch 0 Predicted
Fig 44:-Confusion Matrix(Case3)
Fig 45:-Labels
Fig 46: -PR Curve
Fig 47: -Model Results
Fig 48: -Train Batch 0 Labels
Fig 49: -Train Batch 1 Labels
Fig 50: -Validation Batch 0 Labels
Fig 51:-Validation Batch 0 Predicted
Fig 52:-Validation Batch 1 Labels
Fig 53:-Validation Batch 1 Predicted
Fig 54:-Confusion Matrix(Case4)
Fig 55:-Labels
Fig 56: -PR Curve
Fig 57: -Results
Fig 58: -Train Batch - 0,1,2
Fig 59: -Val Batch 0 Labels

15
Fig 60:-Val Batch 0 Predicted
Fig 61:-Val Batch 1 Labels
Fig 62:-Val Batch 1 Predicted
Fig 63:-Video Output Results(1)
Fig 64:-Video Output Results(2)
Fig 65:-Video Output Results(3)
Fig 66:-Video Output Results(4)
Fig 67:-Case 1:- Model Results
Fig 68:-Case 2:- Model Results
Fig 69:-Case 3:- Model Results
Fig 70:-Case 4:- Model Results

16
Abbreviation’s
AI Artificial Intelligence

ANPR Automatic Number Plate Recognition

R-CNN Region Based Convolutional Neural Network

YOLO You Only Look Once

GPU Graphics Processing Unit

IP Internet Protocol

SSD Single Shot Detector

DNN Deep Neural Networks

AP Average Precision

AIS Automatic Identification Systems

SMD Singapore Maritime Dataset

VOC Visual Object Classes

XML Extensible Markup Language

SVM Support Vector Machine

FPS Frames Per Second

YAML Yet Another Markup Language

RAM Random Access Memory

IDE Integrated Development Environment

PR Precision-Recall Curve

TP True Positive

TN True Negetive

FP False Positive

FN False Negetive

PPV Positive Predicted Value

17
mAP Mean Average Precision

IoU Intersection Over Union

18
Chapter 1
Introduction

1.1 OVERVIEW

Innovative video analytics solutions are quickly gaining traction. Companies who want to
leverage the latest AI technology to tackle long-standing challenges, as well as those that have
been using video surveillance systems before the advent of AI, are key adopters (AI).
By applying computer vision and deep learning to video footage or live video streams, video
analytics uses artificial intelligence to fulfil numerous tasks. Video content analysis or intelligent
video analytics are other terms for video analytics.
Deep and Machine Learning, two types of AI, have enabled video analytics to revolutionise the
task automation landscape, allowing tasks that previously required human intervention to be
successfully automated to be automated.
The market for video analytics is always changing. Deep Learning, the capacity to do real-time
video processing, and the increased accuracy of video recognition software are among the most
recent breakthroughs in video analytics.
In the video analytics market, the most popular applications include security: incident detection,
intrusion management, people counting, traffic monitoring, Automatic Number Plate
Identification (ANPR), facial recognition, augmented reality, and ego-motion estimation. Video
analytics has also proven effective in a variety of areas, including manufacturing, security, retail,
healthcare, and hospitality.
The advent of techniques like Mask R-CNN or YOLO has made real-time Object Detection in
video feeds possible for years. These algorithms are pre-programmed to distinguish between
objects in a field of view.
They allow video analysis systems to detect and track items in real time, such as vehicles,
people, traffic signals, and so on. These objects are labelled and can be used for purposes such as
counting people in congested locations or cars.
Video analytics remains an interesting aspect and application of computer vision as a part of
visual artificial intelligence.

19
Object detection is a computer vision approach for detecting things in photos and videos. To
obtain relevant results, object detection algorithms often use machine learning or deep learning.
When we look at photographs or videos, we may quickly distinguish and find objects of interest.
The purpose of object detection is to use a computer to imitate this intelligence.
Object detection can be accomplished using a variety of methods. R-CNN and YOLO v2, two
popular deep learning–based techniques that use convolutional neural networks (CNNs),
automatically learn to detect objects within photos.
The optimal method for detecting objects is determined by your application and the problem
you're trying to address. When deciding between machine learning and deep learning, the most
important factor to consider is if you have a powerful GPU and a large number of labelled
training images. If you answered no to either of these questions, machine learning might be a
better option. When you have more photos, deep learning approaches work better, and GPUs
reduce the time it takes to train the model.

1.2 PROBLEM STATEMENT

The reason behind the problem object detection is industry requirement and user experience. To
brief, industries have specific responsibility in maintaining and storing large datasets of video
type. E.g.: -News data to track news of any specific day etc.
To easily understand the video for any person who knows or doesn’t know a person in video for
better understanding through pre trained object detection models like face, object, activity etc.
Using video analytics makes your surveillance system more efficient, reduces the workload on
security and management staff, and helps you capture the full value of security video by making
your IP camera system more intelligent in its work.
These systems would contribute a more reliable return on investment as they are applied across
the organizations as compared to current systems.
The project "Object Detection Using Deep Neural Networks" efficiently detects objects using the
YOLO technique, which is applied to image and video data to detect things.
The development of reliable object detection system to perform reliable object recognition and
tracking is still a challenge today.We cannot underestimate the complexity of this problem as
recognition is process that can be easily done by human eyes.However for a computer to model

20
and imitate the human eyes,there are many challenges involved.Some of the challenges faced in
object detection are variation of view points,illumination,shape and sizes of the interested
objects.

1.3 PROJECT OBJECTIVES

Industry Requirement: - The need to easily understand the objects existing in any type of
videos and maintaining higher accuracy by training the model to yield a better result. To
implement Artificial Intelligence to Video Analytics. It becomes a relevant asset for companies.
These systems would contribute a more reliable return on investment as they are applied across
the organizations as compared to current system Maintaining and storing large datasets of video
type: The process includes management of both unstructured and structured data. The primary
objective is to ensure the data is of high quality and accessible for business intelligence along
with big data analytics applications.
User Experience: - Considering an individual is interested to watch a movie, he/she would like
to directly skip to video content rather than going through the text information in the beginning,
to meet user requirements for better understanding of various genre of videos.
Using video analytics makes your surveillance system more efficient, reduces the workload on
security and management staff, and helps you capture the full value of security video by making
your IP camera system more intelligent in its work. video analytics helps quantify the returns on
their video marketing, making it easy to build the business case for producing more of it. Sales
teams can leverage video analytics to find the most contextual video for their customers, use it to
engage customers in a meaningful way, the video analytics market size was valued at $4.10
billion and Traditionally, increasing security threats and the need for advanced surveillance have
driven demand in the video analytics market.
Object detection is a computer technology related to computer vision and image processing that
deals with detecting instances of semantic objects of a certain class (such as humans, buildings,
or cars) in digital images and videos. The state-of-the-art methods can be categorized into two
main types: one-stage methods and two stage-methods. One-stage methods prioritize inference
speed, and example models include YOLO, SSD and RetinaNet. Two-stage methods prioritize

21
detection accuracy, and example models include Faster R-CNN, Mask R-CNN and Cascade
RCNN.
Scope: Image recognition only outputs a class label for an identified object and image
segmentation creates a pixel level understanding of a scene’s elements. What separates object
detection from the other tasks is its unique ability to locate objects within an image or video.
This then Allows us to count and then track those objects. Deep Learning outperform other
techniques if the data size is large. But with small data size, traditional Machine Learning
algorithms are preferable. Deep Learning techniques need to have high end infrastructure to train
in reasonable time. Deep Learning really shines when it comes to complex problems such as
image classification, natural language processing, and speech recognition.

1.4 ORGANIZATION DETAILS

Niveus Solutions Pvt. Ltd. is a boot-strapped cloud engineering services organization founded by
Suyog Shetty, Rashmi George, Roshan Bava, and Mohsin Khan in Karnataka, India. The seeds
of Niveus were planted in 2013 when the four founders realised they shared a common aim of
developing a world-class cloud engineering services firm. They had extensive experience
working with companies such as Infosys, Wipro, Cognizant, and Sapient. The vast talent pool
available in adjacent education cities Udupi, Manipal, and Mangalore, which boast some of the
country's and world's best minds, persuaded the Founders that their goal might become a reality.
Niveus has grown quickly over the years, making the strategic decision to solely partner with
Google Cloud India in 2019, scaling up to become its 'Premier' partner in less than two years and
winning the 'Breakthrough Partner of the Year - Asia Pacific' award for 2020. Bengaluru, Delhi,
Mumbai, and Singapore are among the cities where the organisation now works.Niveus believes
a unique empathetic approach to problem-solving is at the heart of its success; a trait embraced
and imbibed by the team and articulated through its brand identity "We Solve For You", where
"You" refers to all stakeholders, including customers, customers' customers, and the Niveus
team. Among its fast-growing clientele are industry leaders in BFSI, Automotive, Media and
Entertainment, Manufacturing, PSUs, and Digital Natives.
Niveus employs cloud technologies as a Google Cloud Platform Partner to assist businesses with
cloud consulting, app modernization, infrastructure modernization, data modernization, platform

22
migration, cloud-native application development, cloud security, and managed services. The
corporation enables businesses to harness the power of cloud services and construct scalable,
resilient infrastructures.
Vision:“Our vision is to be a people-centric company and set an uncompromising standard in
terms of the work environment so that people can grow to their true potential”.
Mission:“To bring out the best in our people by providing an open work environment and
challenging responsibilities that will empower them to learn and grow while delivering their best
to delight our customers globally”.

23
Chapter 2
Literature Review
2.1 Introduction
Object detection is a computer technology that deals with finding instances of semantic items of
a specific class (such as individuals, buildings, or cars) in digital photos and videos. It is related
to computer vision and image processing. There are two primary categories of state-of-the-art
approaches: one-stage methods and two-stage methods. YOLO, SSD, and RetinaNet are
examples of one-stage algorithms that favour inference speed. Faster R-CNN, Mask R-CNN, and
Cascade R-CNN are examples of two-stage algorithms that prioritise detection accuracy.
Deep learning-based object detection predicts the location of an object in an image quickly and
accurately. Deep learning is a strong machine learning technology in which the object detector
learns essential picture attributes for detection tasks automatically. A deep neural network is a
neural network with more than two layers that has a particular amount of complexity. Deep
neural networks analyse data in complex ways using advanced mathematical modelling.

Deep learning is a machine learning technique that allows computers to learn by example in the
same way that humans do. Deep learning is a critical component of self-driving automobiles,
allowing them to detect a stop sign or discriminate between a pedestrian and a lamppost. It
enables voice control in consumer electronics such as phones, tablets, televisions, and hands-free
speakers. Deep learning has gotten a lot of press recently, and with good cause. It's
accomplishing accomplishments that were previously unattainable.

A computer model learns to execute categorization tasks directly from images, text, or sound in
deep learning. Deep learning models can achieve cutting-edge accuracy, sometimes even
outperforming humans. A vast set of labelled data and neural network topologies are used to
train models.Deep learning achieves higher recognition accuracy than ever before. This assists
consumer electronics in meeting customer expectations, and it is vital for safety-sensitive
applications such as self-driving automobiles. Deep learning has progressed to the point where it
now beats humans in some tasks, such as categorising objects in photographs.
While deep learning was first proposed in the 1980s, it has only lately been relevant for two
reasons:

24
Large volumes of labelled data are required for deep learning. For example, the creation of
self-driving cars involves millions of photos and thousands of hours of video.
Deep learning takes a lot of computational power. The parallel design of high-performance GPUs
is ideal for deep learning. When paired with clusters or cloud computing, development teams can
reduce deep learning network training time from weeks to hours or even minutes.
The deep neural network system has a strong feature depiction competency in idea preparation,
and it is frequently employed as an object detection feature. Deep learning models do not always
require additional assistance, and they can be developed and used as a classifier and regression
instrument.
As a result, deep learning technology has a bright future in the field of object detection. The goal
of object observation is to figure out where objects are physically located in a given image
(object localization) and to detect them. As a result, the pipeline for detecting ancient objects is
divided into three stages: informative region selection, feature extraction, and detection.
2.2 Details of Literature Review
The paper introduces traditional vision tactics and discusses the similarities and differences
between traditional vision strategies and deep learning algorithms in object detection. It
describes the emergence of deep learning-assisted recognition algorithms and elaborates on the
most common deep learning-assisted object detection strategies now in use. The paper covers the
model's structure, style, and working rule, as well as the model's performance over time and
detection accuracy. Finally, it analyses the limitations of object detection using deep learning and
provides several options for consideration.
The deep neural network system has a strong feature depiction competency in idea preparation,
and it is frequently employed as an object detection feature. Deep learning models do not always
require additional assistance, and they can be developed and used as a classifier and regression
instrument.
As a result, deep learning technology has a bright future in the field of object detection. The goal
of object observation is to figure out where objects are physically located in a given image
(object localization) and to detect them. As a result, the pipeline for detecting ancient objects is
divided into three stages: informative region selection, feature extraction, and detection.
The precision and recall across each of the best matched bounding boxes for the known objects
in the image are used to evaluate a model's object identification performance.It describes the

25
emergence of deep learning-based object identification algorithms and the most common
approaches used now in deep learning-based object detection. The study focuses on the
framework design and model operating principles, as well as the model's real-time performance
and detection accuracy.Object detection has long been a fascinating study direction and emphasis
in computer vision, with applications in autonomous vehicles, robotics, video surveillance, and
pedestrian identification. The introduction of deep learning technology has altered the old
methods of object detection and identification. In image processing, the deep neural network has
a strong feature depiction capacity and is commonly employed as the feature extraction module
in object detection.[1]

In this research, we looked into the ability of DNNs for object detection, where we tried to not
only classify, but also precisely localise objects. The problem we're trying to solve here is
complex since we want to find many object instances with different sizes within the same image
while using a limited amount of computer resources. We usually present a formula that can
predict the bounding boxes of many objects in a given image.
Furthermore, we have chosen to formulate a DNN-based regression that outputs a binary mask of
the item bounding box (and parts of the box as well), to boot, we have a tendency to use an easy
bounding box illation to extract detections from the masks, to extend localization exactness, we
have a tendency to apply the DNN mask generation in an extremely multi-scale fashion on the
entire image as well as on a tiny low range of huge image crops, followed by a refinement.
Object recognition is formulated as a regression issue to object bounding box masks in this
simple but strong formulation. We propose a multi-scale reasoning technique that may be used
by a handful of network applications to provide high-resolution object detections at a low cost.
On Pascal VOC, the approach's improving performance is illustrated.To evaluate the algorithm's
performance, we employ precision-recall curves and average precision (AP) per class.
These findings come at a computational expense during training because a network must be
trained for each object type and mask type.They show that DNN-based regression can learn
features that are useful not only for classification but also for capturing substantial geometric
information. We adopt the general architecture proposed by for classification and substitute a
regression layer for the last layer. The somewhat surprising but powerful conclusion is that
networks that encode translation invariance to some extent can also capture object positions. To

26
obtain exact detections, multi-scale box inference is followed by a refinement process. In this
method, we may apply a DNN that predicts pixel-wise precision for a low-resolution mask,
limited by the output layer size, at a cheap cost – the network is only applied a few hundred
times per input image.There is no need to create a model by hand that explicitly captures pieces
and their relationships. This simplicity has the benefit of being easily adaptable to a large number
of classes, as well as superior detection performance across a wider range of objects, both rigid
and malleable.In the future, we hope to lower the cost by using a single network to discover
objects from many categories and so expand to a wider range of categories.[2]

In this paper, we see a saliency-inspired neural network detection model that predicts a set of
class-agnostic bounding boxes together with a single score for each box, such as its probability
of containing any object of interest. At the highest levels of the network, the model automatically
manages a variable number of cases for each category and allows for crossclass generalisation.
We were able to achieve competitive recognition performance on VOC2007 and ILSVRC2012
by utilising only the highest few predicted locations in each image and a small number of neural
network evaluations.
We prefer to subscribe to the latter attitude in this research, and we propose to train a detector
called "DeepMultiBox" that generates a small range of bounding boxes as object candidates. One
Deep Neural Network (DNN) generates these boxes in an extremely category agnostic manner.
There are numerous contributions to our model. First, we often think of object detection as a
regression problem with several bounding boxes' coordinates. The loss that trains the bounding
box predictors as part of the network coaching is the second key contribution.

Bounding Box:Encodes top left and bottom right The coordinates of each box as the four node
values It can be written as the vector li ∈ R4. These coordinates are Normalized w.r. t. Image
dimensions to achieve immutability Absolute image size from num. Normalized coordinates is
generated by the last linear transformation Hidden layer.

Confidence:It is the confidence value for the box that contains: The object is encoded as a
single node value ci ∈ [0, 1]. This value is generated by a linear transformation The last hidden
layer followed by the sigmoid.
.

27
Finally, we train our object box predictor in a way that is classagnostic. We prefer to think of
this as a step forward, given to changes in the cost of detecting a wide range of object
classifications. We have a tendency to show in our studies that by simply post-classifying ten
boxes obtained by a single network application, we can achieve a competitive detection result.
The final detection score is calculated by multiplying the localizer score for the supplied box by
the classifier score for the greatest square region around the crop. The precision recall curves
were computed using these scores, which were provided to the evaluation.
In the future, we expect to be able to combine the localization and recognition approaches into a
single network, allowing us to extract each location and sophistication label data in a single
feed-forward network.[3]

Road condition recognition , face detection , and cuisine recognition are just a few of the
applications that leverage domain-specific datasets. Another essential domain-specific topic for
numerous security and safety needs in maritime contexts is object recognition. An autonomous
ship equipped with an Automatic Identification System (AIS), for example, requires safe
navigation, which is accomplished by detecting nearby objects. Because the objects at sea alter
dynamically due to environmental conditions such as sunlight, fog, rain, wind, and light
reflection, this is a difficult task. Furthermore, depending on the angle, the same ship can appear
in a variety of shapes. The ships on the water usually have a wide-open view because the ocean
is usually open. They divided the SMD into three sections: training, validation, and testing. They
also published benchmark results for object detection with the Mask R-CNN model using the
split datasets. Their benchmark results, on the other hand, were for item detection only, with no
additional classification for each discovered object. In fact, the majority of previous studies that
used the dataset focused solely on object detection. However, we must additionally define the
type of detected object for applications in marine security, such as the usage of Unmanned
Surface Vehicles (USV). We may use the SMD for both object identification and classification
problems because the original SMD includes the class labels of the items as well as their
bounding box information.[4]

Object detection, counting objects, security tools, and so on are some of the applications. Object
tracking is a popular image processing technique with a bright future ahead of it. Because of
deep learning, computer vision, and machine learning, the MOT has grown significantly in

28
recent years. The goal of this study is to develop software that keeps track of objects and can
handle object lists and counts. The system aims for item identification, tracking, and counting
utilising YOLO "You Only Look Once" Technology and Pytorch. In addition, unlike the generic
yolo object detection tool, which detects all items at the same time, this MOT system only
identifies the objects that the user requires[5]

Deep neural networks, particularly convolutional neural networks, are used to detect objects.
PASCAL VOC 2012 was utilised as the dataset, which has 20 labels. The dataset is widely used
in picture identification, object detection, and other types of image processing. Using Decision
trees or, more likely, SVM, supervised learning can be used to solve the problem.
However, because neural networks can handle images well, they operate best in image
processing. Detecting a specific object from a complicated image with many lines and forms is
known as object detection.Face detection, object tracking, picture retrieval, and automatic
parking systems all use object detection.
The quantity of applications is steadily increasing. Image classification, or more accurately
image retrieval, is the most common use of object detection. Deep neural networks are useful for
comprehending convolution neural networks. The concepts of convolution neural networks are
investigated using papers on deep neural networks. Object detection is also employed in other
sectors such as defence and architecture.[6]

2.3 Models/Algorithms Used


Models and algorithms used in the papers referred were
● In the first paper R-CNN and YOLO models were used

Fig 2:-Yolo Architecture

29
Fig 3:-R-CNN Architecture
● In the Second paper DNN-Based Regression algorithm is used.

Fig 4:-DNN Based Regression


● In Third Paper DNN Algorithm has been used

Fig 5:- Deep Neural Network


● In Fourth Paper YoloV5 has been used

Fig 6:- Yolov5 Architecture

30
● In Fifth paper Neural Net and YOLOV5 is used

Fig 7 :- Process Description for Object Detection

Fig 8:- Detailed Process of YoloV5 used in the paper

31
● In Sixth Paper Yolo and SSD Algorithms are used

Fig 9:-SSD Architecture

Fig 10:- Working Example of SSD Model

2.4 Findings & Gaps Identified


Most of the papers used Deep Neural Networks and YOLO based approach to detect the object.
● It is found that in order to detect the object we need to have a annotated dataset and
perform validation checks.
● Data pre-processing has a very important role in the Object Detection.
● Deep Neural Networks, Single Shot Detector(SSD),R-CNN,YOLOV2,YOLOV5 can be
used as Object Detection methods in our dataset in order to detect objects in the dataset..

32
Sl Concept Performance Advantages Disadvantages Reference
No
1 The paper introduces The precision and Object detection Because of the [1]
traditional object recall of a model for is inextricably complicated data
identification object detection are linked to other models, training is
methods and measured across computer vision quite costly. Deep
explains the each of the best technique s like learning also
relationship and matched bounding image necessitates the use
differences between boxes for the known segmentation and of pricey GPUs and
them and deep objects in the image. image hundreds of
learning methods in It describes the recognition, workstations.
object detection. emergence of deep which help us This raises the price
The study focuses on learning- based comprehend and for users.
the framework object identification evaluate scenes in
design and model algorithms and the movies and3
operating principles, most common photos.
as well as the approaches used
model's real- time now in deep
performance and learning-based
detection accuracy. object detection.
2 We use DNNs to DNNs differ Part-based models It requires very [2]
solve the problem of significantly from inspired deep large amount of
object detection in standard architectures for data.
this research, where classification object detection
we not only algorithms. For and parsing,
categorise but also starters, they are which are known
strive to precisely deep architectures, as compositional
place items. The which can learn models since the
problem we're trying more complex object is
to solve is difficult models than expressed as a
since we need to shallow layered
detect a high number architectures. composition of
of object instances in Because of this picture primitives
the same image with expressivity and the .

33
different sizes robustness of the
utilising a limited training techniques,
quantity of computer sophisticated object
resources. representations can
We provide a be learned without
concept for the need to hand-
estimating the design features. This
bounding boxes of has been empirically
several objects in a verified across
single image. We hundreds of classes
design a DNN- based on the difficult
regression that ImageNet
produces a binary classification
mask of the object problem.
bounding box.
3 In this paper, the The performance is As a fundamental This number does [3]
problem is to achieve class- feature extraction not grow linearly
philosophy and agnostic, scalable and learning with the number of
professionalism. A object detection by model, the classes to be
pose to train a predicting the set of technique recognised, the
detector called bounding boxes it is employs a deep proposed method is
"Deep MultiBox" responsible for. convolutional still quite
that uses gen Create Annoy potential neural network. competitive with
a small number of objects. It develops a DPM- like methods.
bounding boxes as Specifically, use multiple- box
object candidates deep Neural localization cost
event. We define network (DNN) that that can take
object detection as outputs a fixed advantage of a
one Regression number Bounding configurable
problem with box. In addition, number of ground
coordinates of each box has a score truth sites of
multiple boundaries Represents the interest in a given
Boxing In addition, network trust in this image and learn
for each predicted box that contains to anticipate these

34
box, net spending object. The final locations in
The confidence value detection score is unseen images.
that this box may calculated by
contain an object. multiplying the
This is very different localizer score for
from the traditional the supplied box by
approach. Has the the classifier score
advantages and score for the greatest
functions of square region
predefined fields. around the crop.
Very compact The precision recall
number of days to curves were
represent object computed using
detection It’s an these scores, which
efficient method. were provided to the
evaluation.
4 In this paper, for the We utilised the Copy & Paste was One of the most [4]
gauge of DNN MATLAB executed before problematic aspects
algorithms, we ImageLabeler tool training as an of object recognition
correct the to change the offline pre- is that an object
annotations of the SMD's ground processing might appear
SMD dataset and truth. The technique. radically different
present an upgraded MATLAB depending on the
variant, that we ImageLabeler angle from which
created SMD-Plus. application interface it is seen.
We still propose makes it simple to
improving methods produce video
devised particularly clips and add
for the SMD- Plus. annotations to each
In particular, the object.
projected ‘Online In both YOLO-V4
Copy & Paste’ and all versions of
system was showed YOLO-V5, the
expected active in detection

35
alleviating the class- performance of the
shortcoming SMD-Plus increased
question. Our SMD- by more than 10%
Plus dataset and the when compared to
changed YOLO-V5 the SMD.The
are available all for difficulty with
future research. We detecting only the
hope that our foreground and
discovery-therefore- background is that it
categorization model can be used to
of YOLO-V5 assess the bounding
established the box detection
SMD-Plus serves as accuracy but not the
a criterion for future class label
test initiatives for recognition
mechanical accuracy. As a
following in nautical result, we may
surroundings. utilise the data to
confirm the
bounding box
accuracy of the
model.
5 The goal of this YOLO's This MOT object motion, [5]
study is to develop performance is system has changing appearance
software that keeps primarily assessed various real-time patterns of both
track of objects and using three terms: applications like the item and the
can handle object mAP, Precision, and detecting scene, nonrigid
lists and counts. The Recall . mAP is a particular objects object structures,
system aims for item metric that from object object-to- object
identification, combines recall and crowded and object-to- scene
tracking, and precision to detect environments, occlusions, and
counting utilising object correctness. It tracking a camera motion can
YOLO "You Only is produced using particular type of all cause tracking
Look Once" the average object or problems.

36
Technology and precision value for detecting a set of
Pytorch. In addition, recall values object classes or
unlike the generic ranging from 0 to 1 counting a
yolo object detection and IOU particular object.
tool, which detects (intersection over
all items at the same union) values
time, this MOT ranging from 0.5 to
system only 0.95. Precision
identifies the objects refers to the
that the user requires. accuracy with which
objects are
predicted. Precision
refers to how well
the model predicts
the positive class.
mAP is used to
determine Yolo
Accuracy.
6 The dataset is Evaluation Region based It is in dispute [6]
widely used in methodology: The convolution whether it can be
picture identification, final detection score neural network is said as the best form
object detection, is the product of the more optimized at of solution to the
and other types of localizer score for a very basic level. problem or not. This
image processing. the given box Another result is valid
Using Decision trees multiplied by the researcher can International
or, more likely, score of the engender new Conference on
SVM, supervised classifier evaluated parameter s and Intelligent
learning can be used on the maximum would achieve Computing and
to solve the problem. square region less error rates Control Systems
around the crop. than this but one only in certain
These scores are cannot argue that parameter.
were used for the RCNN is
computing precision better than the
recall curves. other neural net.

37
Chapter 3
Methodology
3.1 Introduction
Object detection's main goal is to identify and find one or more effective targets in still or
video data. It covers a wide range of techniques, including image processing, pattern
recognition, artificial intelligence, and machine learning. Object detection creates bounding
boxes around identified items, allowing us to see where they are in (and how they move
through) a scene. The components or patterns of an object in a picture that help to identify it
are called features. For example, a square contains four corners and four edges, which are
known as square characteristics and help us humans recognise it as such. The technique of
detecting a target object in an image or a single frame of video is known as object detection.
The goal of object detection is to find essential items, create rectangular bounding boxes
around them, and classify each one. Object detection has applications in a variety of
industries, including traffic sensors, robotics, people detection in security, animal detection in
agriculture, AI-assisted vehicle detection in transportation, and medical feature detection in
healthcare.
Object detection is inextricably linked to other computer vision techniques like image
segmentation and image recognition, which help us comprehend and evaluate scenes in
movies and photos. Object identification is traditionally thought to be far more difficult than
image categorization, owing to the following five challenges: Dual priorities, speed,
numerous scales, limited data, and a disparity in class are all factors to consider. A feature is a
piece of information about the content of an image in computer vision and image processing,
usually concerning whether a certain portion of the image has certain attributes. Specific
structures in the image, such as points, edges, or objects, might be used as features. Object
detection's main goal is to identify and find one or more effective targets in still or video data.
It covers a wide range of techniques, including image processing, pattern recognition,
artificial intelligence, and machine learning.

3.2 Data Sources & Format

Primary Source:- Creating custom dataset i.e 500 images for various brand logo’s,Smoking
and 800 images for Alcohol. We collected the images from various sources like google
images,shutterstock etc .All the images are in .jpg format. After collecting images we

38
annotate the images using LabelImg software. LabelImg is a graphical image annotation tool.
It is written in Python and uses Qt for its graphical interface. Annotations are saved as XML
files(.txt files), PASCAL VOC and YOLO format, the format used by ImageNet. The details
of the items in your individual photographs are stored in either of the formats mentioned by
user. We'll use an easy-to-use programme called LabelImg to create these text files for the
images, which allows you to put visual boxes around your objects in the photographs. It also
stores the values (.txt files) for your photographs(Bounding Box)automatically.

3.3 Data Preprocessing – Extraction & preparation

● The process of converting raw data into a comprehensible format is known as data
preparation. As a result, certain procedures are followed in order to convert the data into
a tiny, clean data set. Data Preprocessing is the term for the sequence of steps. It
contains –

o Data Preparation: The set of images collected are read in LabelImg software to
create corresponding text files. Therefore, the final dataset will have images and
corresponding text files and a classes.txt file which includes the list of labels in
the dataset.

Example: - Logo Images of 10 brands, 50 each, therefore we have 500 images

and corresponding text files makes it 1000 and a classes.txt file making it 1001

files.

o Data Cleaning: Data preprocessing is mostly used to ensure data quality. We


must first read the directory of photos. The practice of removing undesired noise
from data is known as data cleaning.

In the dataset we use we have data validation check that needs to be done.

The cases are explained below:

Case 1:If image and text file exists check if text file is empty or not if empty
either annotate or delete both image and text file. if not empty pass.

Case 2: If image is exists, text file doesn’t exist then either annotate it or delete
the image.

Case 3: if image doesn’t exist and text file exists then delete txt file.

Code Block - Please Refer Annexure 1

39
o Data Integration: Data Integration is the process of combining multiple sources
into a single dataset.

We might come across a situation where we need to do further processing only


after merging 2 datasets i.e merge a dataset of logos and smoking for example.

Here we come across a situation as it is not easy to merge datasets as it contains same
labels and the numbering(Index Values) in the files.

Code Block :- Please Refer Annexure 2

3.4 Data Exploration & Analysis


Many machine learning processes rely on data exploration. However, when it comes to object
detection and picture segmentation datasets, there is no simple way to undertake data
exploration in a systematic fashion. LabelImg is a graphical image annotation tool.

It is developed in Python and has a graphical user interface built with Qt.

Annotations are saved XML, PASCAL VOC AND YOLO file formats, It also supports the
Create ML file format.

Labels are used to designate components in your data that you wish to train your model to
recognize in unlabeled datasets. For computer vision and constructing a high-performance
model, high-quality datasets are necessary. The garbage in, garbage out attitude is followed
while creating computer vision models, which means categorizing images thoroughly and
precisely is crucial.

Dealing with standard image datasets differs from working with object and segmentation
datasets in several ways:

● The image and the label are inextricably linked. Suddenly, anything you do to your
photographs must be carefully considered because it may break the image-label-
mapping.
● Usually much more labels per image.
● Much more hyper parameters to tune (especially if you train on your custom datasets)

The easiest way to download and install LabelImg is via pip, and it assumes you’re running
Python3. Simply run the following in your command line:

40
pip3 install labelImg.
Then, launch LabelImg by typing labelImg in your command line prompt.
If you require more specific instructions based on your machine (e.g. Python 2 on Linux,
Windows, MacOS Catalina, or using LabelImg with Anaconda), please check the
documentation below.

Website:- https://pypi.org/project/labelImg/1.4.0/

LabelImg Installation Guide

Fig 11 :- LabelImg via Command Prompt

Fig 12:- LabelImg Installation in Unix Systems

41
Windows

Fig 13 :- LabelImg Installation Guide for Windows

Link to the official repository :- GitHub - tzutalin/labelImg: LabelImg is a graphical


image annotation tool and label object bounding boxes in images

Using LabelImg
We address the upcoming scenarios with regards to Unix Systems.
To launch the software we run the below command in the terminal.
Python3 labelImg.py
We will see an interactive window, which is an open-source tool, once the above command is
executed.

Fig 14 :- LabelImg Window

42
Step 1:- Click on Open Dir to choose the location where the image dataset exists.

Step 2:- Click on Change Save Dir to choose the location to store the text files.(Generally
we store the corresponding text files in the same directory as image).

Step 3:- Choose the file format you want to save with. There are 3 formats available i.e
PASCAL VOC,YOLO and CreateML.

Step 4:- Click on Create RectBox to draw bounding boxes on the images loaded from the
directory and name it with label. Ex:- Domino’s,Smoking etc

Fig 15:- LabelImg Window Showing Image with Bounding Box

Step 5:- Click on Save so that the text file generated is saved with the values of bounding
box.

Step 6:- Click on Next Image and Repeat Step 4 and 5 still all the images are annotated.

In an image, you can draw several boxes. It will prompt you to assign a class; choose the
category that you specified in the previous stage.

Finally, you'll have a folder with the same name as your image that will image label data.
Object detection is now possible with your data.

43
Object detection is one of the most exciting aspects of computer vision since it allows you to
detect and find each individual object in an image, as well as their position and size in
relation to the rest of the image. Deep Learning is at the heart of today's state-of-the-art object
detection models, and the training dataset is at the heart of training deep learning networks.

Images gathered as samples and labelled for deep neural network training make up the
training dataset. There are a variety of formats for preparing and annotating your dataset for
object detection training. The following are the most prevalent formats for annotating your
datasets:

● Pascal VOC

● CreateML

● YOLO

The details of the items in your individual photographs are stored in text files in the YOLO
format. To quickly generate these YOLO files for the photos, we'll use LabelImg, an easy-to-
use programme that allows you to draw visual boxes around your objects in the photographs,
and the text files for your images are saved automatically.

3.5 Process Description

3.5.1 Python Modules used for project implementation

The entire project work has done using python programming. For this project work I used the
following modules in python Sklearn ,Numpy,Matplotlib.

For this project I have used the sklearn module in python. Sklearn is probably the most useful
library for machine learning in Python. The sklearn library contains a lot of efficient tools for
machine learning and statistical modeling including classification, regression, clustering and
dimensionality reduction. NumPy is a general-purpose array-processing package. It provides a
highperformance multidimensional array object, and tools for working with these arrays. It is the
fundamental package for scientific computing with Python.Matplotlib is a cross-platform, data
visualization and graphical plotting library for Python and its numerical extension NumPy. As
such, it offers a viable open source alternative to MATLAB.in other words we can say that
Matplotlib is a plotting library available for the Python programming language as a component of

44
NumPy, a big data numerical handling resource. Numpy Here numpy is used to make the values
into array.in may project in correlation plot, Standardization, In order to make the values into array
we need to use numpy. It provides a high-performance multidimensional array object, and tools
for working with these arrays.For data preprocessing and implementation of algorithm in this
project the numpy is used.

Fig 16:- Packages Used and Cloning the Yolov5 Repo from github

Numpy

Here numpy is used to make the values into array.in may project in correlation plot,
Standardization, In order to make the values into array we need to use numpy. It provides a high-
performance multidimensional array object, and tools for working with these arrays.For data
preprocessing and implementation of algorithm in this project the numpy is used.

Matplotlib

This module is used for the graphical representations. It used in exploratory data analytics and
model implementation section also. The regression plots are drawn using this module. The counter
plot,barchart,pie chart,cross tabb,correlation plots, boxplot all are visualizing using this module. It
has an important role in the graphical representation of skewness and outliers.

45
Sklearn

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.

PyTorch

PyTorch is a Python package that provides two high-level features:

● Tensor computation (like NumPy) with strong GPU acceleration

● Deep neural networks built on a tape-based autograd system

Usually, PyTorch is used either as:

● A replacement for NumPy to use the power of GPUs.

● A deep learning research platform that provides maximum flexibility and speed.

Minidom
xml.dom.minidom is a stripped-down version of the Document Object Model interface with a
similar API to other languages. It's meant to be less complicated than the entire DOM and
much smaller. For XML processing, users who aren't familiar with the DOM should consider
utilising the xml.etree.ElementTree module instead.
Basically Used to create document type output for ex: .xml file ,.csv files etc

Shutil
Shutil module provides high-level file operations such as copy, create, and remote operations.
It falls within the umbrella of Python's basic utility modules. This module aids in the
automation of the copying and deleting of files and folders.

Os
This module allows you to use operating system-dependent functions on the go. Open() is
used to read and write files; the os.path module is used to manage paths; and the fileinput
module is used to read all the lines in all the files on the command line. The tempfile module

46
creates temporary files and directories, whereas the shutil module handles high-level file and
directory operations.
All of Python's built-in operating system dependant modules are designed such that they
utilise the same interface as long as the same capability is available.The os module also
provides extensions specific to a given operating system; however, employing them
compromises portability.If a path or file name is returned, all functions handling path or file
names accept both bytes and string objects and return an object of the same kind.

IPython
IPython is a robust framework for interactive computing that includes:

● A sophisticated interactive shell.


● A kernel for Jupyter notebooks.
● Interactive data visualisation and the use of GUI toolkits are supported.
● Interpreters that are flexible and embeddable in your own projects.

● Tools for parallel computing that are simple to use and high in performance.

3.5.2 YOLOV5

Identifying items in an image is a frequent task for the human brain, but it is not so simple for a
machine. Object detection is a computer vision task that involves identifying and localising
objects in pictures, and numerous methods have emerged in recent years to address the problem.
YOLO (You Only Look Once), originally introduced by Redmond, is one of the most popular
real-time object detection methods to date.

● Glenn Jocher established the project on GitHub under the Ultralytics organisation.
● It was created using the Python programming language and the PyTorch framework.
● It is a collection of object detection models in and of itself. Starting with extremely
small models capable of providing real-time FPS on edge devices and progressing to
highly large and accurate models for cloud GPU installations, It contains practically
everything that one would require.
● It also includes a slew of other features and capabilities that make it the go-to object
detection model/repository for everyone who even considers object detection today.
● It's visible from the repository that it makes training and inference on customized
datasets a cake walk. So much so that, if you already have a dataset in the suitable

47
format, you can get started with training in under two minutes.However, training and
inference aren't everything. It also has a number of other traits that make it truly
unique.

Dataset is collected from various sources and is prepared for training once its annotated.

Most annotation services allow you to export your annotations in the YOLO labelling format,
which gives you one annotations text file per image. Each object in the image has one bounding-
box (BBox) annotation in each text file. The annotations are scaled to fit the image size and fall
between 0 and 1. The following format is used to represent them:

< object-class-ID> <X center> <Y center> <Box width> <Box height>

The content of the YOLO annotations text file might look like this if there are two items in the
image:

Fig 17 :- Annotation File

Here, we will go through all the necessary and important coding parts. These include:

● The dataset preparation.


● Training of the model.
● Performance comparison.
● Inference on images and videos.
Configuration Files
The training configurations are divided into three YAML files that are included with the repo.
Depending on the work, we will modify these files to meet our requirements.

48
3.5.2.1 Data Configuration file:
The dataset parameters are described in the data-configurations file. We will change this file
to provide: the paths to the train, validation, and test (optional) datasets; the number of
classes (nc); and the names of the classes in the same order as their index since we are
training on our custom dataset. There can be ‘n’ number of classes.Our custom data
configurations file is named ‘coco128.yaml' and is located in the 'data' directory. The
following is the content of this YAML file:

Fig 18 :- The data configuration file

The data is organised as follows to conform with the Ultralytics directory structure:

Fig 19 :- Directory structure


3.5.2.2 Model Configurations file:
The model architecture is determined by the model-configurations file. YOLOv5n (nano),
YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), YOLOv5x (extremely big),
YOLOv5x (extremely large), YOLOv5x (extremely large), YOLOv5x (extremely large),
YOLOv5x (extremely large (extra large). With an image size of 640*640 pixels, these
architectures are excellent for training. P6 is an additional series optimised for training with a
greater image size of 1280*1280. (YOLOv5n6, YOLOv5s6, YOLOv5m6, YOLOv5l6,

49
YOLOv5x6). For larger object detection, P6 models have an additional output layer. They
benefit the most from higher-resolution training and get greater results.
For each of the above architectures, Ultralytics provides built-in model configuration files in
the'models' directory. If you're starting from scratch, select the model-configurations YAML
file with the chosen architecture (in this example, it's 'YOLOv5s6.yaml'), and just change the
number of classes (nc) parameter to the number of classes in your custom data.

Fig 20 :- classes
There is no need to update the model-configurations file when training is started with pre-
trained weights.
3.5.2.3 Hyper-parameters configuration file
The hyperparameters-configurations file specifies the training hyperparameters, such as
learning rate, momentum, losses, and augmentations. Under the 'data/hyp/hyp.scratch.yaml'
directory, Ultralytics supplies a default hyperparameters file. To establish a performance
baseline, it's usually best to start training with default hyperparameters,
Training Code
# Train YOLOv5s on custom dataset for 3 epochs
!python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt
--cache

● batch — batch size (-1 for auto batch size). Use the largest batch size that your

hardware allows for.

● epochs — number of epochs.

● data — path to the data-configurations file.

● cfg — path to the model-configurations file.

● weights — path to initial weights.

● cache — cache images for faster training.

● img — image size in pixels (default — 640).

50
If ‘project’ and ‘name’ arguments are supplied, the results are automatically saved there.

Else, they are saved to ‘runs/train’ directory. We can view the metrics and losses saved to

results.png file

Fine-tuning is an optional step in training that entails unfreezing the entire model we created

earlier and retraining it on our data with a very modest learning rate. By gradually modifying

the pretrained features to the fresh data, this could lead to significant improvements. The

learning rate parameter in the hyperparameters-configurations file can be changed.

3.6 Workflow diagram

Fig 21 :- Model Workflow

3.7 Hardware & Software specifications

Hardware specifications:

• Operating system: Windows 7 or higher/Linux above 2.5

• Memory: Minimum 4 GB RAM.

• Processor: Minimum Intel(R) Core(TM) i3

Hard disk: Minimum 500 GB

Software Specifications:

• Tools : LabelImg

51
LabelImg is a graphical image annotation tool. It is written in Python and uses Qt for its graphical
interface. Annotations are saved as XML files in PASCAL VOC format, the format used by
ImageNet. Besides, it also supports YOLO and CreateML formats.

Coding language: Python

Python Interpreter – Python 3.7, Python 3.8

Python is a computer programming language often used to build websites and software, automate
tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to
create a variety of different programs and isn't specialized for any specific problems.

Python 3.7

Python 3.7 adds new classes for data handling, optimizations for script compilation and garbage
collection, and faster asynchronous I/O. Python 3.7, the latest version of the language aimed at
making complex tasks simple, is now in production release.

New Features in Python 3.8

The list uses about 11% less memory in Python 3.8 compared with Python 3.7. Other
optimizations include better performance in subprocess , faster file copying with shutil , improved
default performance in pickle , and faster operator.

IDE for Python:

● Pycharm - PyCharm is a dedicated Python Integrated Development Environment (IDE)


providing a wide range of essential tools for Python developers, tightly integrated to create
a convenient environment for productive Python, web, and data science development.
● Google Colab - Colab allows anybody to write and execute arbitrary python code through
the browser, and is especially well suited to machine learning, data analysis and education.

52
Chapter 4
Model Details and Implementation

4.1 Model building


YOLOV5
Identifying items in an image is a frequent task for the human brain, but it is not so simple for a
machine. Object detection is a computer vision task that involves identifying and localising
objects in pictures, and numerous methods have emerged in recent years to address the problem.
YOLO (You Only Look Once), originally introduced by Redmond, is one of the most popular
real-time object detection methods to date.
● Glenn Jocher established the project on GitHub under the Ultralytics organisation.
● It was created using the Python programming language and the PyTorch framework.
● It is a collection of object detection models in and of itself. Starting with extremely small
models capable of providing real-time FPS on edge devices and progressing to highly
large and accurate models for cloud GPU installations, It contains practically everything
that one would require.
● It also includes a slew of other features and capabilities that make it the go-to object
detection model/repository for everyone who even considers object detection today.
● It's visible from the repository that it makes training and inference on customized datasets
a cake walk. So much so that, if you already have a dataset in the suitable format, you can
get started with training in under two minutes. However, training and inference aren't
everything. It also has a number of other traits that make it truly unique.
Dataset is collected from various sources and is prepared for training once its annotated. Most
annotation services allow you to export your annotations in the YOLO labelling format, which
gives you one annotations text file per image. Each object in the image has one bounding-box
(BBox) annotation in each text file. The annotations are scaled to fit the image size and fall
between 0 and 1. The following format is used to represent them:

< object-class-ID> <X-Center><Y-Center><Box-Width><Box-Height>

53
4.2 Data inputs
Configuration Files
The training configurations are divided into three YAML files that are included with the repo.
Depending on the work, we will modify these files to meet our requirements.
4.2.1 Data Configuration file:
The dataset parameters are described in the data-configurations file. We will change this file to
provide: the paths to the train, validation, and test (optional) datasets; the number of classes (nc);
and the names of the classes in the same order as their index since we are training on our custom
dataset. There can be ‘n’ number of classes.Our custom data configurations file is named
‘coco128.yaml' and is located in the 'data' directory. The following is the content of this YAML
file:

Fig 22 :- The data configuration file


The data is organised as follows to conform with the Ultralytics directory structure:

Fig 23 :- Directory structure

54
4.2.2 Model Configurations file:
The model architecture is determined by the model-configurations file. YOLOv5n (nano),
YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), YOLOv5x (extremely big),
YOLOv5x (extremely large), YOLOv5x (extremely large), YOLOv5x (extremely large),
YOLOv5x (extremely large (extra large). With an image size of 640*640 pixels, these
architectures are excellent for training. P6 is an additional series optimised for training with a
greater image size of 1280*1280. (YOLOv5n6, YOLOv5s6, YOLOv5m6, YOLOv5l6,
YOLOv5x6). For larger object detection, P6 models have an additional output layer. They
benefit the most from higher-resolution training and get greater results.
For each of the above architectures, Ultralytics provides built-in model configuration files in
the'models' directory. If you're starting from scratch, select the model-configurations YAML file
with the chosen architecture (in this example, it's 'YOLOv5s6.yaml'), and just change the number
of classes (nc) parameter to the number of classes in your custom data.

Fig 24:- classes


There is no need to update the model-configurations file when training is started with pre-trained
weights.
4.2.3 Hyper-parameters configuration file
The hyperparameters-configurations file specifies the training hyperparameters, such as learning
rate, momentum, losses, and augmentations. Under the 'data/hyp/hyp.scratch.yaml' directory,
Ultralytics supplies a default hyperparameters file. To establish a performance baseline, it's
usually best to start training with default hyperparameters,
Training Code

# Train YOLOv5s on custom dataset for 3 epochs


!python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt
--cache

55
● batch — batch size (-1 for auto batch size). Use the largest batch size that your
hardware allows for.
● epochs — number of epochs.
● data — path to the data-configurations file.
● cfg — path to the model-configurations file.
● weights — path to initial weights.
● cache — cache images for faster training.
● img — image size in pixels (default — 640).
If ‘project’ and ‘name’ arguments are supplied, the results are automatically saved there. Else,
they are saved to ‘runs/train’ directory. We can view the metrics and losses saved to results.png
file

Fine-tuning is an optional step in training that entails unfreezing the entire model we created
earlier and retraining it on our data with a very modest learning rate. By gradually modifying the
pretrained features to the fresh data, this could lead to significant improvements. The learning
rate parameter in the hyperparameters-configurations file can be changed.

4.3 User interface


Google Collab - Collab allows anybody to write and execute arbitrary python code through the
browser, and is especially well suited to machine learning, data analysis and education.Similarly
we use the IDE to extract,view and analyze results.

The outputs are in Jpg and Video format,we use the IDE to view images or simply download on
your system and view in any image reader software and same for the video.

56
4.4 Results, & Model Accuracy
4.4.1 Case 1:- Dataset-Logo’s Batch Size-32 Approcah-Direct

Fig 25 :- Confusion Matrix

Fig 26 : Labels

57
Fig 27 :- PR-Curve

Fig 28:-Results

58
Fig 29 :- Train - batch 0,1,2

59
Fig 30:- Validation Batch 0 Labels

Fig 31:-Validation Batch 0 Predicted

60
Fig 32:-Validation Batch 1 Labels

Fig 33:-Validation Batch 1 Predicted

61
Fig 34:- Validation Batch 2 Labels

Fig 35:- Validation Batch 2 Predicted

62
4.4.2 Case 2:- Dataset-Logo’s Batch Size-32 Approach-Split

Fig 36:- Confusion Matrix

Fig 37:- Labels

63
Fig 38:- PR Curve

Fig 39:- Model Results

64
Fig 40:- Train Batch 0 Labels

Fig 41:-Train Batch 1 Labels

65
Fig 42:- Validation Batch 0 Labels

Fig 43:- Validation Batch 0 Predicted

66
4.4.3 Case 3:- Dataset-Merged Batch Size-32 Approach-Split

Fig 44:- Confusion Matrix

Fig 45:- Labels

67
Fig 46 :- PR Curve

Fig 47:- Model Results

68
Fig 48:- Train Batch 0 Labels

Fig 49:- Train Batch 1 Labels

69
Fig 50:-Validation Batch 0 Labels

Fig 51:- Validation Batch 0 Predicted

70
Fig 52:- Validation Batch 1 Labels

Fig 53:-Validation Batch 1 Predicted

71
4.4.4 Case 4:- Dataset:-Merged Batch Size:-64 Approach-Split

Fig 54 :- Confusion Matrix

Fig 55:- Labels

72
Fig 56:- PR Curve

Fig 57:-Results

73
Fig 58:- Train Batch - 0,1,2

74
Fig 59:- Val Batch 0 Labels

Fig 60:- Val Batch 0 Predicted

75
Fig 61:- Val Batch 1 Labels

Fig 62:- Val Batch 1 Predicted

76
4.4.5 Case 5:- Dataset:-Video

Fig 63:- Video Output Results(1)

Fig 64:- Video Output Results(2)

77
Fig 65:- Video Output Results(3)

Fig 66 :- Video Output Results(4)

78
4.5 Result Analysis & Discussion
PR Curve-A PR curve is a graph in which the y-axis represents Precision and the x-axis
represents Recall. In other words, the PR curve has the y-axis TP/(TP+FN) and the x-axis
TP/(TP+FP).It's worth noting that Precision is sometimes referred to as the Positive Predictive
Value (PPV).
Sensitivity, Hit Rate, and True Positive Rate are other terms for recall (TPR).
Precision aids in highlighting the relevance of the retrieved results.
Precision measures how much of the bbox predictions are correct ( True positives / (True
positives + False positives)).
Recall measures how much of the true bbox were correctly predicted ( True positives / (True
positives + False negatives)).
Confusion Matrix-A confusion matrix is a table that shows how well a classification model
(or "classifier") performs on a set of test data for which the true values are known. The confusion
matrix itself is straightforward, but the associated nomenclature might be perplexing.

∈∈∈∈∈∈∈∈ = ∈∈∈∈∈∈ ∈∈ ∈∈∈∈∈∈∈ ∈∈∈∈∈∈∈∈∈∈∈/∈∈∈∈∈ ∈∈∈∈∈∈ ∈∈ ∈∈∈∈∈∈∈∈∈∈∈


∈∈∈∈∈∈∈∈ = ∈∈ + ∈∈ /∈∈ + ∈∈ + ∈∈ + ∈∈
Where TP = True Positives, TN = True Negatives, FP = False Positives, and FN = False
Negatives.
True Positive: How many times did the model accurately categorise a Positive sample as
Positive.
False Negative: How many times did the model categorise a Positive sample as Negative.
False Positive:How many times did the model wrongly categorise a Negative sample as Positive.
True Negative:How many times did the model properly categorise a Negative sample as
Negative.
box_loss — bounding box regression loss (Mean Squared Error).
obj_loss — the confidence of object presence is the objectness loss (Binary Cross Entropy).
cls_loss — the classification loss (Cross Entropy).
‘mAP_0.5’ is the mean Average Precision (mAP) at IoU (Intersection over Union) threshold of
0.5.
‘ mAP_0.5:0.95’ is the average mAP over different IoU thresholds, ranging from 0.5 to 0.95.

79
Fig 67:- Case 1:- Model Results

Fig 68:- Case 2:- Model Results

80
Fig 69:- Case 3:- Model Results

Fig 70:- Case 4:- Model Results

81
Chapter 5
Conclusion

5.1 Summary
● Data is collected through primary sources like google etc.
● Data is preprocessed to created a dataset to process YOLOV5 training using LabelImg.
● It is then validated with specific conditions.
● If necessary multiple datasets are merged to perform object detection using YOLOV5.
● The Model is trained using Google Collab and is the UI to view the output.
● The Model is trained with different datasets and batch sizes.
● PR Curve, [email protected],[email protected]:0.95 are evaluation matrics for the model.
● Box Loss,Object Loss and Classification Loss are few other evaluation metrics.
● Based on the best weights of the model we test on video data to observe the accuracy.
5.2 Limitations
● Time consumption for data collection and preprocessing to create a dataset as required.
● Multiple Aspect Ratios and Spatial Sizes-As objects always vary in size and ratio,
hence, it is difficult for detection algorithms to identify different objects at different
scales and views.
● Viewpoint Variation-Objects can look entirely different when viewed from a different
angle. As most of the models are tested in ideal scenarios, it is a formidable task for
detectors to recognize different objects from different viewpoints.
● Occlusion-Some items are difficult to notice because they are only partially visible.
Objects that take up a larger portion of the screen are easier to grasp than objects that take
up a smaller portion of the screen.
● Object Positioning-The classification of items and finding their location are the two
most difficult aspects of object detection.
● Cluttered or textured background-If an image's background is textured or busy, the
object of interest is at risk of being lost in the background. The object has a potential of
camouflaging in this situation, making it difficult for detectors to distinguish distinct
things of interest.

82
5.3 Future Scope
Object detection technology's future is still being proven, and like the initial Industrial
Revolution, it has the potential to release people from tiresome work that machines can
accomplish more efficiently and effectively. It will also open up new research and operational
possibilities, which will yield more benefits in the future. As a result, these problems avoid the
requirement for extensive training that necessitates a large number of datasets in order to perform
more sophisticated jobs. With continuing evolution, as well as the devices and techniques that
enable it, it could soon become the next big thing in the future.

83
References
[1] Pranita Jadhav,Vrushali Koli,2020,Object Detection using Deep Learning,International
Research Journal of Engineering and Technology (IRJET)-Volume: 07 Issue: 09.
Website:- https://www.irjet.net/archives/V7/i9/IRJET-V7I943.pdf

[2] Christian Szegedy, Alexander Toshev, Dumitru Erhan,2013,Deep Neural Networks for Object
Detection.
Website:- http://papers.neurips.cc/paper/5207-deep-neural-networks-for-object-detection.pdf

[3] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov,2014,Scalable
Object Detection using Deep Neural Network
Website:-
https://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Erhan_Scalable_Object_D
etection_2014_CVPR_paper.pdf

[4] Jun-Hwa Kim , Namho Kim, Yong Woon Park and Chee Sun Won,2022,Object Detection
and Classification Based on YOLO-V5 with Improved Maritime Dataset,Journal of Marine
Science and Engineering
Website:- https://www.mdpi.com/2077-1312/10/3/377

[5] Abhinu C G, Aswin P, Kiran Krishnan, Bonymol Baby,2021,Multiple Object Tracking using
Deep Learning with YOLO V5,ISSN: 2278-0181
Website:-
https://www.ijert.org/research/multiple-object-tracking-using-deep-learning-with-yolo-v5-IJERT
CONV9IS13010.pdf

[6] Mrs. Swetha M S, Ms. Veena M Shellikeri , Mr. Muneshwara M S , Dr. Thungamani
M,Survey of Object Detection using Deep Neural Networks-International Journal of Advanced
Research in Computer and Communication Engineering(IJARCCE)-Vol. 7, Issue 11, November
2018
Website:- https://ijarcce.com/wp-content/uploads/2019/02/IJARCCE.2018.71104.pdf

84
[7] Yang Yang , Guang Shu , Mubarak Shah,Semi-supervised Learning of Feature Hierarchies for
Object Detection in a Video.
Website:-https://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Yang_Semi-supe
rvised_Learning_of_2013_CVPR_paper.pdf

[8] Oscar Chang, Patricia Constante, Andres Gordon , Marco Singana ,Deep neural network that
uses space- time features for tracking and recognizing a moving object,Deep neural network that
uses space- time features for tracking and recognizing a moving object
Website:- :
https://bibliotekanauki.pl/api/full-texts/2020/12/10/d30c532a-f0f0-4fd9-a4e9-8548e757e56d.pdf

[9] Author Name: Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu & Matti
Pietikäinen ,Deep Learning for Generic Object Detection,International Journal of Computer
Vision volume 128.
Website:- : Deep Learning for Generic Object Detection: A Survey | SpringerLink

[10] https://viso.ai/computer-vision/video-analytics-ultimate-overview/

[11] https://www.mathworks.com/discovery/object-detection.html#:~:text=Object%20detection%
20is%20a%20computer,learning%20to%20produce%20meaningful%20results.

[12] www.geeksforgeeks.org

[13] https://docs.ultralytics.com/

[14] https://colab.research.google.com/github/ultralytics/yolov5/blob/master/tutorial.ipynb

[15] https://towardsdatascience.com/

[16] https://pypi.org/project/yolov5/

85

You might also like