0% found this document useful (0 votes)
4 views75 pages

Project Report Object Detection

The project report details the development of an Object Detection System utilizing YOLOv8, aimed at accurately identifying and localizing multiple objects in real-time images and video streams. It emphasizes the significance of deep learning techniques and computer vision in various applications, including surveillance and autonomous vehicles. The report outlines the project's objectives, methodologies, and potential impact on automation and decision-making across industries.

Uploaded by

Tanu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views75 pages

Project Report Object Detection

The project report details the development of an Object Detection System utilizing YOLOv8, aimed at accurately identifying and localizing multiple objects in real-time images and video streams. It emphasizes the significance of deep learning techniques and computer vision in various applications, including surveillance and autonomous vehicles. The report outlines the project's objectives, methodologies, and potential impact on automation and decision-making across industries.

Uploaded by

Tanu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

COLLEGE OF TECHNOLOGY AND ENGINEERING

MAHARANA PRATAP UNIVERSITY OF AGRICULTURE & TECHNOLOGY

UDAIPUR (RAJ.)

PROJECT REPORT

ON

OBJECT DETECTION USING YOLOV8

submitted in partial fulfillment for the award of the Degree of Bachelor of Technology
in Department of Computer Science & Engineering
(Session 2021-2025)

Submitted To:​ Submitted By:


Dr. Kaplana Jain,​ Anjali Singh
HOD(Head of Department)​ Harsha Rajawat
Computer Science and Engineering​ Sakshi Soni
​ Tanu Sharma
​ B.E. Final Year, AI&DS
ACKNOWLEDGEMENT

It is a well-known fact the talent of a person is to be nurtured properly to obtain


the best result for a person, the main requirement is proper guidance along with
the right direction so that the desired goal can be achieved. Making optimum use
of each opportunity is very important.

First of all, we thanked all those people who helped us with their guidance and
assistance, without which this project would not be successful.

We had acknowledged the kind grace of our Dean Dr. Kalpana Jain for
providing us with a great opportunity by giving me full support and cooperation.

I am very grateful to Dr. Kalpana Jain (H.O.D, C.S.E) (Project In-charge)


College of Technology and Engineering, Udaipur (Raj.) who actively supported
our project by letting us select the project report and offered us guidance.

I am very grateful to Dr. Kalpana Jain for their guidance and constant
supervision as well as for providing necessary information regarding the report.

I cannot express my thanks in words to my parents who have given me this


opportunity and my family members for their great support and love.

Submitted By:
Anjali Singh
Harsha Rajawat
Sakshi Soni
Tanu Sharma
B. Tech. Final Year, AI & DS
DECLARATION

I hereby declare that the Project titled “Object Detection” has been developed by
us and is not reproduce as it is from any other source. It has been submitted in
partial fulfillment of the requirement for award of Bachelor of Technology in
Computer Science and Engineering, MPUAT Udaipur, and has not been
submitted anywhere else for the award of any other degree.

Date: 16th May, 2025 Name of Students:


Place: Udaipur Anjali Singh
​ ​ ​ ​ ​ ​ ​ Harsha Rajawat
​ ​ ​ ​ ​ ​ ​ ​ Sakshi Soni
​ ​ ​ ​ ​ ​ ​ ​ Tanu Sharma
COLLEGE OF TECHNOLOGY AND ENGINEERING

MAHARANA PRATAP UNIVERSITY OF AGRICULTURE & TECHNOLOGY

UDAIPUR (RAJ.)

CERTIFICATE

This is to certify that the project entitled “Object Detection” has been complete
and submitted by Anjali Singh, Harsha Rajawat, Sakshi Soni, Tanu Sharma in
partial fulfillment of the requirement for the award of Bachelor of Technology
in Computer Science and Engineering from College of Technology and
Engineering, a constitute of Maharana Pratap University of Agriculture and
Technology, Udaipur.

Dr. Kalpana Jain


Head of Department
Computer Science and Engineering
College of Technology and Engineering, Udaipur
TABLE OF CONTENT

S.No. TOPIC PAGE No.


List of Tables i

List of Figures ii

Abbreviation iii

1. Abstract 1

2. Chapter 1: Introduction 2

3. Chapter 2: Vision Behind the Project 3

2.1: Background and Significance of the 5

Project​

2.2: Objectives of the Project​ 7

2.3: Scope of the Project​ 9

2.4: Methodology and Approach​ 10

4. Chapter 3: Overview of Object 12

Detection

3.1: Object Detection​

3.2: How Computer Vision Works

3.3: History of Computer Vision

3.4: Computer Vision Example

3.5: Introduction to Object Detection

3.6: Learning Object Detection

3.7: Evaluation Metrics- Mean, Average

Precision

3.8: Object Detection Algorithms


Chapter 4: YOLO (You Only Look
5.
Once)

4.1: Evaluation from YOLO to YOLOv8

4.2: Why we used YOLOv8?


6.
Chapter 5: Results and Discussion

5.1: Codes and Output

5.2: Results and Discussion

7. Chapter 6: Conclusion and Future Scope

6.1: Conclusion

6.2: Future Scope

8. References
LIST OF FIGURES

●​ project front end ui


●​ example of object detection
●​ difference between image classification and object detection
●​ prediction of object
●​ Problem task
●​ target class and target value
●​ original boundary box
●​ predictive boundary box
●​ Area of intersection
●​ intersection over union
●​ range of iou
●​ area of union
●​ precision score
●​ working of rcnn
●​ faster rcnn
●​ Architecture of ssd
●​ working of yolo
Abbreviations
●​ AI: Artificial Intelligence

●​ YOLO: You Only Look Once

●​ CV: Computer Vision

●​ CNN: Convolutional Neural Network

●​ RNN: Recurrent Neural Network

●​ RCNN: Region-based Convolutional Neural Network

●​ mAP: mean, Average Precision

●​ API: Application programming interface

●​ GPU: Graphics Processing Unit

●​ IOU: Intersection over Union

●​ ICR: Intelligent Character Recognition

●​ SSD: Single Shot Detector

●​ SSP-Net: Spatial Pyramid Pooling

●​ FPN: Feature Pyramid Network

●​ HOG: History of Oriental Gradients

●​ MLOps: Machine Learning Operations

●​ COCO: Common Objects in Context

●​ OCR: Optical Character Recognition

●​ HTML: HyperText Markup Language

●​ JS: JavaScript

●​ UI: User Interface

●​ UX: User Experience

●​ HTTP: HyperText Transfer Protocol


ABSTRACT
The Object Detection System project is designed to develop an intelligent visual
recognition solution capable of identifying and localizing multiple objects within
images or video streams in real-time. Leveraging deep learning techniques and
state-of-the-art convolutional neural network (CNN) architectures such as YOLO
(You Only Look Once), this system can detect, classify, and track objects with
high accuracy and speed.

This tool aims to address real-world challenges across various domains, including
surveillance, autonomous vehicles, smart retail, and industrial automation. The
system is developed using modern frameworks like TensorFlow, PyTorch, and
OpenCV, ensuring scalability, adaptability, and seamless integration into existing
workflows. Key features include real-time detection, confidence scoring,
bounding box visualization, and support for custom datasets to suit specific
application needs.

The project emphasizes modularity and user-friendliness, with a flexible design


that allows easy deployment on edge devices or cloud platforms. With a focus on
performance optimization and accuracy enhancement, the Object Detection
System provides actionable insights that drive intelligent decision-making in
automated environments.

Ultimately, this project showcases the transformative potential of computer vision


and AI, contributing to smarter, safer, and more efficient systems in a wide range
of industries.
Chapter 1

INTRODUCTION

The rapid evolution of artificial intelligence and computer vision has


revolutionized how machines interpret and interact with visual data. One of the
most impactful applications of this advancement is object detection, a
technology that enables systems to identify and locate multiple objects within
digital images or real-time video streams. From enhancing surveillance systems
to enabling autonomous vehicles and powering retail analytics, object detection
has become a cornerstone in numerous modern technological solutions.

This project explores the design and development of an intelligent object


detection system, utilizing deep learning techniques and advanced neural network
architectures such as YOLO (You Only Look Once). The system is built with a
focus on accuracy, speed, and adaptability, allowing for real-time analysis and
recognition of diverse object categories in dynamic environments.

In this report, we present the objectives, methodologies, tools, and


implementation process behind the development of the object detection solution.
Emphasis is placed on the training and evaluation of models using labelled
datasets, integration with real-time camera feeds, and optimization for
performance. The project also demonstrates practical applications across sectors
like security, transportation, and smart automation, showcasing the transformative
potential of AI-driven visual recognition.
Chapter 2

Vision Behind the Project


The vision behind the Object Detection System is to create a smart, efficient, and
highly accurate visual recognition solution that transforms the way machines
perceive and interpret visual information. The system is designed to be adaptable,
scalable, and capable of operating in real-time across diverse environments and
applications. By harnessing the power of deep learning and computer vision, this
project aims to bridge the gap between raw visual data and actionable insights,
enabling intelligent automation and enhanced decision-making.

2.1 Background and Significance of the Project:


In today’s data-driven and automation-focused world, the ability to accurately
interpret visual information is becoming increasingly critical across a wide range
of industries. From security and surveillance to autonomous vehicles, healthcare
diagnostics, retail analytics, and industrial automation, the need for intelligent
systems that can detect and recognize objects in real-time has never been more
significant. Traditional image processing methods often lack the precision,
adaptability, and scalability required to meet modern demands, leading to the
growing importance of deep learning-based object detection systems.

This Object Detection System project was conceived to address these challenges
by developing a robust and efficient solution capable of accurately identifying
and localizing objects in both static images and dynamic video streams. The
purpose of the system is to enhance automation, improve operational safety, and
enable faster, data-driven decisions through real-time visual analysis.

By leveraging cutting-edge technologies such as convolutional neural networks


(CNNs), transfer learning, and frameworks like YOLO and Faster R-CNN, the
system provides a comprehensive approach to visual recognition.
2.2 Objectives of the Project:
●​ To develop a robust system capable of accurately detecting and classifying
multiple objects in images or video streams.

●​ To enable real-time object recognition for use in dynamic and time-sensitive


environments.

●​ To improve the automation and efficiency of visual monitoring tasks across


various industries.

●​ To reduce human error and workload in object identification processes.

●​ To provide actionable insights through accurate localization and labelling of


objects.

●​ To ensure scalability and adaptability for integration into different platforms


and use cases (e.g., surveillance, transportation, healthcare, retail).

2.3​Scope of the Project:


The scope of the Object Detection Project encompasses the development of an
intelligent and efficient visual recognition system capable of identifying,
classifying, and localizing objects in images and video streams. The system is
designed to support a wide range of applications by incorporating key
functionalities, including:

●​ Real-Time Detection: Accurate identification and localization of multiple


objects in live video feeds or static images.

●​ Object Classification: Categorization of detected objects into predefined


classes using advanced machine learning models.

●​ Bounding Box Generation: Visualization of object positions through


dynamically generated bounding boxes.
●​ Custom Dataset Support: Flexibility to train and adapt models to
domain-specific objects and scenarios.

●​ Performance Metrics: Monitoring accuracy, precision, recall, and inference


time to ensure optimal system performance.

●​ Scalability: Capability to deploy across various platforms including edge


devices, web applications, and cloud environments.

●​ Integration: Support for integration with other systems (e.g., surveillance,


automation, or robotics) for enhanced decision-making and automation.

2.4 Methodology and Approach:


●​ The development of the Object Detection System follows the principles of Agile
methodology, particularly the Scrum framework. This iterative approach supports
continuous feedback, rapid prototyping, and adaptability to evolving project
requirements. The system leverages powerful deep learning frameworks such as
TensorFlow and PyTorch, alongside supportive technologies like OpenCV for
image processing. These tools ensure that the solution is accurate, efficient, and
scalable for deployment across various environments, including web, cloud, and
edge devices.

●​ In conclusion, the Object Detection System is positioned to become a valuable


asset across multiple industries, enabling intelligent automation, enhanced safety,
and real-time visual analysis. The subsequent chapters will explore the system’s
core functionalities in detail, showcasing how each component contributes to
meeting the project's goals and addressing real-world challenges in visual
recognition.
Chapter 3

Overview of Object Detection

The Object Detection Project is a robust AI-powered platform designed to identify,


classify, and localize objects within digital images and video streams. This solution
supports a wide range of applications—from security surveillance and autonomous
vehicles to retail analytics and industrial automation. The platform provides
end-to-end capabilities, including dataset management, model training and
evaluation, real-time inference, and performance monitoring.

This chapter outlines the core components and features of the Object Detection
System, detailing how each contributes to the efficient development, deployment,
and operation of intelligent visual recognition models.

By consolidating these features within a unified system, the Object Detection Project
empowers data scientists, ML engineers, and domain specialists to collaborate
seamlessly, reduce development time, and increase the accuracy and reliability of
detection results. Each module is designed with user experience and scalability in
mind, offering intuitive controls and flexible configuration options to support
evolving project needs.

In the following sections, we will delve into the primary modules of the Object
Detection Platform, beginning with the Data Ingestion Interface. This will provide a
clear view of how raw visual data is transformed into structured, actionable insights.

The entry point to the system is the Login Page. It ensures secure access by
authenticating all users before allowing entry into the system. Role-based access
control is enforced, allowing administrators to manage team roles, assign privileges,
and monitor user activity—ensuring both operational security and accountability
throughout the object detection lifecycle.

Technical Implementation:

•​ Frontend: The website is built using HTML and styled with CSS and ensure a

responsive and clean design.

•​ Backend: Input data of video or image passed to detect objects from the input.

3.1​Object Detection:
Object detection is a technique that uses neural networks to localize and classify
objects in images. This computer vision task has a wide range of applications,
from medical imaging to self-driving cars. Object detection is a computer
vision task that aims to locate objects in digital images. As such, it is an instance
of artificial intelligence that consists of training computers to see as humans do,
specifically by recognizing and classifying objects according to semantic
categories.1 Object localization is a technique for determining the location
specific objects in an image by demarcating the object through a bounding box.
Object classification is another technique that determines to which category a
detected object belongs. The object detection task combines subtasks of object
localization and classification to simultaneously estimate the location and type of
object instances in one or more images.

Object detection overlaps with other computer vision techniques, but developers
nevertheless treat it as a discrete endeavor.

Object detection, by comparison, delineates individual objects in an image according to


specified categories. While image classification divides images among those that have stop
signs and those that do not, object detection locates and categorizes all of the road signs in
an image, as well as other objects such as cars and people.

Computer vision is a field of artificial intelligence (AI) that uses machine


learning and neural networks to teach computers and systems to derive meaningful
information from digital images, videos and other visual inputs—and to make
recommendations or take actions when they see defects or issues.

Computer vision works much the same as human vision, except humans have a
head start. Human sight has the advantage of lifetimes of context to train how to
tell objects apart, how far away they are, whether they are moving or something
is wrong with an image.
Computer vision trains machines to perform these functions, but it must do it in
much less time with cameras, data and algorithms rather than retinas, optic nerves
and a visual cortex. Because a system trained to inspect products or watch a
production asset can analyze thousands of products or processes a minute,
noticing imperceptible defects or issues, it can quickly surpass human
capabilities.

Computer vision is used in industries that range from energy and utilities to
manufacturing and automotive—and the market is continuing to grow. It is
expected to reach USD 48.6 billion by 2022.

3.2​ How Computer Vision Works:

Computer vision needs lots of data. It runs analyses of data over and over until it
discerns distinctions and ultimately recognize images. For example, to train a
computer to recognize automobile tires, it needs to be fed vast quantities of tire
images and tire-related items to learn the differences and recognize a tire,
especially one with no defects.

Two essential technologies are used to accomplish this: a type of machine


learning called deep learning and a convolutional neural network (CNN).

Machine learning uses algorithmic models that enable a computer to teach itself
about the context of visual data. If enough data is fed through the model, the
computer will “look” at the data and teach itself to tell one image from another.
Algorithms enable the machine to learn by itself, rather than someone
programming it to recognize an image.
A CNN helps a machine learning or deep learning model “look” by breaking
images down into pixels that are given tags or labels. It uses the labels to perform
convolutions (a mathematical operation on two functions to produce a third
function) and makes predictions about what it is “seeing.” The neural network
runs convolutions and checks the accuracy of its predictions in a series of
iterations until the predictions start to come true. It is then recognizing or seeing
images in a way similar to humans.

Much like a human making out an image at a distance, a CNN first discerns hard
edges and simple shapes, then fills in information as it runs iterations of its
predictions. A CNN is used to understand single images. A recurrent neural
network (RNN) is used in a similar way for video applications to help computers
understand how pictures in a series of frames are related to one another.

3.3​The History of Computer Vision:

Scientists and engineers have been trying to develop ways for machines to see
and understand visual data for about 60 years. Experimentation began in 1959
when neurophysiologists showed a cat an array of images, attempting to correlate
a response in its brain. They discovered that it responded first to hard edges or
lines and scientifically, this meant that image processing starts with simple shapes
like straight edges.

At about the same time, the first computer image scanning technology was
developed, enabling computers to digitize and acquire images. Another milestone
was reached in 1963 when computers were able to transform two-dimensional
images into three-dimensional forms. In the 1960s, AI emerged as an academic
field of study and it also marked the beginning of the AI quest to solve the human
vision problem.

1974 saw the introduction of optical character recognition (OCR) technology,


which could recognize text printed in any font or typeface. Similarly, intelligent
character recognition (ICR) could decipher hand-written text that is using neural
networks.4 Since then, OCR and ICR have found their way into document and
invoice processing, vehicle plate recognition, mobile payments, machine
conversion and other common applications.

In 1982, neuroscientist David Marr established that vision works hierarchically


and introduced algorithms for machines to detect edges, corners, curves and
similar basic shapes. Concurrently, computer scientist Kunihiko Fukushima
developed a network of cells that could recognize patterns. The network, called
the Neocognitron, included convolutional layers in a neural network.

By 2000, the focus of study was on object recognition; and by 2001, the first
real-time face recognition applications appeared. Standardization of how visual
data sets are tagged and annotated emerged through the 2000s. In 2010, the
ImageNet data set became available. It contained millions of tagged images
across a thousand object classes and provides a foundation for CNNs and deep
learning models used today. In 2012, a team from the University of Toronto
entered a CNN into an image recognition contest. The model, called AlexNet,
significantly reduced the error rate for image recognition. After this
breakthrough, error rates have fallen to just a few percent.5

3.4​ How Computer Vision Works:


Computer vision needs lots of data. It runs analyses of data over and over until it
discerns distinctions and ultimately recognize images. For example, to train a
computer to recognize automobile tires, it needs to be fed vast quantities of tire
images and tire-related items to learn the differences and recognize a tire,
especially one with no defects.

Two essential technologies are used to accomplish this: a type of machine


learning called deep learning and a convolutional neural network (CNN).

Machine learning uses algorithmic models that enable a computer to teach itself
about the context of visual data. If enough data is fed through the model, the
computer will “look” at the data and teach itself to tell one image from another.
Algorithms enable the machine to learn by itself, rather than someone
programming it to recognize an image.

A CNN helps a machine learning or deep learning model “look” by breaking


images down into pixels that are given tags or labels. It uses the labels to perform
convolutions (a mathematical operation on two functions to produce a third
function) and makes predictions about what it is “seeing.” The neural network
runs convolutions and checks the accuracy of its predictions in a series of
iterations until the predictions start to come true. It is then recognizing or seeing
images in a way similar to humans.

Much like a human making out an image at a distance, a CNN first discerns hard
edges and simple shapes, then fills in information as it runs iterations of its
predictions. A CNN is used to understand single images. A recurrent neural
network (RNN) is used in a similar way for video applications to help computers
understand how pictures in a series of frames are related to one another.
3.5​The History of Computer Vision:

Scientists and engineers have been trying to develop ways for machines to see
and understand visual data for about 60 years. Experimentation began in 1959
when neurophysiologists showed a cat an array of images, attempting to correlate
a response in its brain. They discovered that it responded first to hard edges or
lines and scientifically, this meant that image processing starts with simple shapes
like straight edges.

At about the same time, the first computer image scanning technology was
developed, enabling computers to digitize and acquire images. Another milestone
was reached in 1963 when computers were able to transform two-dimensional
images into three-dimensional forms. In the 1960s, AI emerged as an academic
field of study and it also marked the beginning of the AI quest to solve the human
vision problem.

1974 saw the introduction of optical character recognition (OCR) technology,


which could recognize text printed in any font or typeface. Similarly, intelligent
character recognition (ICR) could decipher hand-written text that is using neural
networks.4 Since then, OCR and ICR have found their way into document and
invoice processing, vehicle plate recognition, mobile payments, machine
conversion and other common applications.

In 1982, neuroscientist David Marr established that vision works hierarchically


and introduced algorithms for machines to detect edges, corners, curves and
similar basic shapes. Concurrently, computer scientist Kunihiko Fukushima
developed a network of cells that could recognize patterns. The network, called
the Neocognitron, included convolutional layers in a neural network.

By 2000, the focus of study was on object recognition; and by 2001, the first
real-time face recognition applications appeared. Standardization of how visual
data sets are tagged and annotated emerged through the 2000s. In 2010, the
ImageNet data set became available. It contained millions of tagged images
across a thousand object classes and provides a foundation for CNNs and deep
learning models used today. In 2012, a team from the University of Toronto
entered a CNN into an image recognition contest. The model, called AlexNet,
significantly reduced the error rate for image recognition. After this
breakthrough, error rates have fallen to just a few percent.5

3.6​Computer Vision Example:


Many organizations don’t have the resources to fund computer vision labs and
create deep learning models and neural networks. They may also lack the
computing power that is required to process huge sets of visual data. Companies
such as IBM are helping by offering computer vision software development
services. These services deliver pre-built learning models available from the
cloud—and also ease demand on computing resources. Users connect to the
services through an application programming interface (API) and use them to
develop computer vision applications.

IBM has also introduced a computer vision platform that addresses both
developmental and computing resource concerns. IBM Maximo® Visual
Inspection includes tools that enable subject matter experts to label, train and
deploy deep learning vision models—without coding or deep learning expertise.
The vision models can be deployed in local data centers, the cloud and edge
devices.

While it’s getting easier to obtain resources to develop computer vision


applications, an important question to answer early on is: What exactly will these
applications do? Understanding and defining specific computer vision tasks can
focus and validate projects and applications and make it easier to get started.

Here are a few examples of established computer vision tasks:

●​ Image classification sees an image and can classify it (a dog, an apple, a person’s
face). More precisely, it is able to accurately predict that a given image belongs to
a certain class.
●​ Object detection can use image classification to identify a certain class of image
and then detect and tabulate their appearance in an image or video. Examples
include detecting damages on an assembly line or identifying machinery that
requires maintenance.​

●​ Object tracking follows or tracks an object once it is detected. This task is often
executed with images captured in sequence or real-time video feeds. Autonomous
vehicles, for example, need to not only classify and detect objects such as
pedestrians, other cars and road infrastructure, they need to track them in motion
to avoid collisions and obey traffic laws.[7]​

●​ Content-based image retrieval uses computer vision to browse, search and


retrieve images from large data stores, based on the content of the images rather
than metadata tags associated with them. This task can incorporate automatic
image annotation that replaces manual image tagging. These tasks can be used
for digital asset management systems and can increase the accuracy of search and
retrieval.

3.7​Introduction to Object Detection:

Computer vision has advanced considerably but is still challenged in matching


the precision of human perception. This article belongs to computer vision. Here
we will learn from scratch. It can be challenging for beginners to distinguish
between different related computer vision tasks.

Humans can easily detect and identify object detection using machine learning
present in an image. The human visual system is fast and accurate and can
perform complex tasks like identifying multiple objects and detecting obstacles
with little conscious thought. With the availability of large amounts of data, faster
GPUs, and better algorithms, we can now easily train computers to detect and
classify multiple objects within an image with high accuracy.

With this kind of identification and localization, you can use object detection to
count objects in a scene, determine their precise locations, and track them while
accurately labeling them.

Object detection, within computer vision, involves identifying objects within


images or videos. These algorithms commonly rely on machine learning or deep
learning methods to generate valuable outcomes.

Now let’s simplify this statement a bit with the help of the below image.

So instead of classifying, which type of dog is present in these images, we have


to actually locate a dog in the image. That is, I have to find out where is the dog
present in the image? Is it at the center or at the bottom left? And so on. Now the
next question comes into the human mind, how can we do that? Well, we can
create a box around the dog that is present in the image and specify the x and y
coordinates of this box.

For now, consider that you can represent the location of the object in the image as
coordinates of these boxes. This box around the object is formally known as a
bounding box. This situation creates an image localization problem where you
receive a set of images and must identify where the object is present in each
image.

Example:
In this image, we have to locate the objects in the image but note that all the
objects are not dogs. Here we have a dog and a car. So we not only have to locate
the objects in the image but also classify the located object as a dog or Car. So
this becomes an object detection problem.

This article will also discuss a few points regarding image classification also. we
will discuss image classification v/s object detection.

In the case of object detection problems, we have to classify the objects in the
image and also locate where these objects are present in the image. But the image
classification problem had only one task where we had to classify the objects in
the image.
So, In the example below the image, we predict only the target class, and we refer
to such tasks as image classification problems. While in the second case, along
with predicting the target class, we also have to find the bounding box which
denotes the location of the object.

This is all about the object detection using machine learning problem. So broadly
we have three tasks for object detection problems:

●​ To identify if there is an object present in the image,


●​ Where is this object located,
●​ What is this object?

Specific to this example, we have an object in the image. We can create a


bounding box around the object and this object is an emergency vehicle.

Now the object detection problem can also be divided into multiple categories.

First is the case when you have images that have only one object. That is you can
have 1000 images in the data set, and all of these images will have only one
object. And if all these objects belong to a single class, that is all the objects are
cars, then this will be an image localization problem.
Another problem could be where you are provided with multiple images, and
within each of these images, you have multiple objects. Also, these objects can be
of the same class, or another problem can be that these objects are of different
classes.

So in case you have multiple objects in the image and all of the objects are of
different classes. you would have to not only locate the objects but also classify
these objects.

The next section will discuss the problem statement for object detection.

Why Object Detection Matters?

●​ Safety: It helps keep us safe by spotting dangers and intruders.


●​ Driving: It’s crucial for self-driving cars to avoid accidents.
●​ Shopping: It helps stores manage products and understand customers.
●​ Healthcare: Doctors use it to find diseases early in medical images.
●​ Manufacturing: It ensures products are made correctly in factories.

Here Object Detection Works:

Looking at the Picture: Imagine a computer looking at a picture.


Finding Clues: The computer looks for clues like shapes, colors, and patterns
in the picture.
Guessing What’s There: Based on those clues, it makes guesses about what
might be in the picture.
Checking the Guesses: It checks each guess by comparing it to things it
already knows.
Drawing Boxes: If it’s pretty sure about something, it draws a box around it to
show where it thinks the object is.
Making Sure: Finally, it double-checks its guesses to make sure it got things
right and fix any mistakes

Learning Object Detection


In the last section, we discussed the object detection using deep learning problem
and how it is different from a classification problem. We also discussed that​
there are broadly three tasks for an object detection using machine learning
problem.

Now in this section, we’ll understand what the data would look like for an object
detection using deep learning task.

So, let’s first take an example from the classification problem. In the below
image, we have an input image and a target class against each of these input
images.
Now, suppose the task at hand is to detect the cars in the images. So in that case will not
only have an input image but along with a target variable that has the bounding box that
denotes the location of the object in the image.

So, in this case, this target variable has five values the value p denotes the
probability of an object being in the above image whereas the four values Xmin,
Ymin, Xmax, and Ymax denote the coordinates of the bounding box. Let us
understand how these coordinate values are calculated.

So, consider the x-axis and y-axis above the image there. In that case, the Xmin
and Ymin represent the top left corner of the bounding box, while Xmax and
Ymax represent the bottom right corner. Now, note that the target variable(P)
answers only two questions?

1. Is there an object present in the image?

Answer: If an object is not present then p will be zero and when there is an object
present in the image p will be one.

2. if an object is present in the image where is the object located?

Answer: You can find the object location using the coordinates of the bounding
box.

In case all the images have a single class that is just a car. What happens when
there are more classes? In that case, this is what the target variable would look
like.

So, if you have two classes which are an emergency vehicle and a non-emergency
vehicle, you’ll have two additional values c1 and c2 denoting which class does
the object present in the above image belong.

So if we consider this example, we have the probability of an object present in the


image as one. We have the given Xmin, Ymin, Xmax, and Ymax as the
coordinates of the bounding box. And then we have c1 is equal to 1 since this is
an emergency vehicle and c2 would be 0 because of a non-emergency vehicle.
Now, this is what the training data should look like in the above image.

let’s say we build a model and get some predictions from the model, this is a
possible output that you can get from a model. The probability that an object is
present in this predicted bounding box is 0.8. You have the coordinates of this
blue bounding box, which are (40, 20) and (210, 180), along with the class values
of c1 and c2.

So now we understand what is an object detection using deep learning problem


and what the training data for an object detection problem would look like.

Before moving into depth, we need to know a few concepts regarding images
such that:

●​ How to do Bounding Box Evaluation?


●​ How to calculate IoU?
●​ Evaluation Metric – mean Average Precision
●​ Bounding Box Evaluation – Intersection over Union (IoU)

In this section, we are going to discuss a very interesting concept, which is the
intersection over the union(IoU). And we are going to use this, in order to
determine the target variable for the individual patches that we have created.

So, consider the following scenario. Here we have two bounding boxes, box1 and
box2. Now if I ask you which of these two boxes is more accurate, the obvious
answer is box1

Why? Because it has a major region of the WBC and has correctly detected the
WBC. But how can we find this out mathematically?

So, compare the actual, and the predicted bounding boxes. if we are able to find
out the overlap of the actual, and the predicted bounding box, we will be able to
make a decision as to which bounding box is a better prediction.
So the bounding box that has a higher overlap with the actual bounding box is a
better prediction. Now, this overlap is called the area of intersection for this first
box, which is box1. We can say that the area of intersection is about 70% of the
actual bounding box.

Whereas, if you consider box2, the area of intersection of the second bounding
box, and the actual bounding box is about 20 %.

So we can say that of these two bounding boxes obviously, box1 is a better
prediction. But having the area of intersection alone is not enough.
Scenarios: 1 Let’s consider another example: suppose we have created multiple
bounding boxes or patches of different sizes.

Here, the intersection of the left bounding box is certainly 100% whereas, in the
second image, the intersection of this predicted bounding box, or this particular
patch is just 70%. So at this stage, would you say that the bounding box on the
left is a better prediction? obviously not. The bounding box on the right is more
accurate.

So, to deal with such scenarios, we also consider the area of union, which is the
patch area, as well as the actual bounding box area.

So, higher this area of union(blue region) we can say that less accurate will be the
predicted bounding box, or the particular patch. Now, this is known as
intersection over the union(IoU).

So here we have the formula for the intersection over union, which is the area of the
intersection divided by the area of union.
Now, what would be the range of intersection? Let’s consider some extreme scenarios.
So in case we have our actual bounding box and predicted bounding box, and
both of these have no overlap at all, in that case, the area of the intersection will
be zero, whereas the area of union will be the sum of the area of this patch. So,
overall the IoU would be zero.

Scenario:- 2

Another possible scenario could be when both the predicted bounding box and
the actual bounding box completely overlap.

In that case, the area of the intersection will be equal to this overlap, and the area
of union will also be the same. Since the numerator and the denominator would
be the same in this case, the IoU would be 1.

So, basically, the range of IoU or intersection over union is between 0 and 1.

Now we often consider a threshold, in order to identify if the predicted bounding


box is the right prediction. So let’s say if the IoU is greater than a threshold which
can be, let’s say 0.5 or 0.6. In that case, we will consider that the actual bounding
box and the predicted bounding box are quite similar.

Whereas if the IoU is less than a particular threshold, we’ll say that​
the predicted bounding box is nothing close to the actual bounding box.

Example: We have to identify the target or whether a WBC is present in either of


these patches.
So we can consider the intersection over union for a particular threshold. Let’s
say if the Iou value is greater than 0.5, we’ll classify that the particular patch has
a WBC and if the IoU is less than this particular threshold we can say that the
particular patch does not have the WBC.

We are obviously free to set this threshold at our own end.

Now apart from using “IoU”, It Can be Used as:-

●​ For selecting the best bounding box


●​ As an evaluation Metric
Since if the intersection over union is high, then the predicted bounding boxes are
close to the actual bounding box, and we can say that the model is performing
well. Hence “IoU” can also be used as an evaluation metric now in the next
section we’ll learn how to calculate the IoU for bounding boxes.
Calculation IOU
In this section, we’ll learn how to calculate the IoU value or the intersection over
the union.

This will also be helpful to understand the code for the intersection over the union
in the notebook. So in the last section, we discussed that in order to calculate the
IoU value. We need the area of intersection as well as the area of union.

Now the question is, how do we find out these two values? So to find out the area
of intersection, we need the area of this blue box. And we can calculate that using
the coordinates for this blue box.

So the coordinates will be Xmin, Ymin, Xmax and, Ymax using these
coordinates values will be easily able to calculate the area of intersection. So let’s
focus on determining the value of Xmin here.

In order to find out the value of Xmin, we are going to use the Xmin values for
these two bounding boxes, which are represented as X1min and X2min.
Now, as you can see above the diagram, the Xmin for this blue bounding box is
simply equivalent to X2min. We can also say that the Xmin for this blue box will
always be the maximum value out of these two values X1min and X2min.

Similarly, in Order to Find Out the Value:

Xmax for this blue bounding box, we are going to compare the values X1max
and X2max. We can see that the Xmax for this blue bounding box is equivalent to
X1max. It can also be written as the minimum of X1max and X2max.

Similarly in order to find out the value for Ymin and Ymax. We are going to
compare the Y1min and Y2min, and Y1max and Y2max. The value of Ymin will
simply be the maximum of Y1 minimum and Y2 minimum which you can see
here.

And similarly, the

Ymax will be the minimum of Y1max and Y2max.


Now once we have these four values which are Xmin, Ymin, Xmax, and Ymax.

We can calculate the area of intersection by multiplying the length and the width
of this rectangle, which is the blue rectangle right here.
So to find out the length, we are going to subtract Xmax and Xmin. And to find
out the height, or the width here, we are going to find the difference between
Ymax and Ymin. Once we have the length and width, the area of the intersection
will simply be the length multiplied by width. So now we understand how to
calculate the area of intersection.

Area of union

Next, the focus is on calculating the area of union. So in order to calculate the
area of union, we are going to use the coordinate values of these two bounding
boxes which are the green bounding box and the red bounding box.

Now note that, when we are calculating the areas of box1 and box2, we are
actually counting this blue shaded region twice. So this is a part of the green
rectangle as well as the red rectangle. Since this part is counted twice we’ll have
to subtract it once, in order to get the area of union.

So the area of union finally will be the summation of the area of box1 and the
area of box2. After that I have to subtract the intersection area since this has been
counted twice.
So now we have the area of intersection for two bounding boxes and also have
the area of union for two bounding boxes. Now we can simply​
calculate the intersection over union as the area of the intersection divided by the
area of union.

3.8​ Evaluation Metric – Mean, Average Precision:


Evaluation Metric – mean Average Precision

Now, we are going to discuss some popularly used evaluation metrics for object
detection using deep learning.

Evaluation Metrics for Objection Detection:-

●​ Intersection over union(IoU)


●​ Mean Average Precision(mAP)

So we have previously discussed the intersection over the union and How it can
be used to evaluate the model performance by comparing the predicted bounding
boxes with the actual bounding boxes. Another popularly used metric is mean
average precision. So in this section, we will understand what is mean, average
precision and how it can be used.
Mean, Average Precision

Now, I’m sure you’re familiar with the metric precision, which simply takes into
account the number of true positives, and is divided by the true positives and
false positives. So this is basically the actual positive values upon the predicted
positive values.

Now, let’s take an example to understand, how precision is calculated. So let’s


say if we have a set of bounding box predictions. Along with that, we have the
IoU score which we calculated by comparing these bounding box predictions
with the actual bounding boxes.

Now, let’s say we have a threshold of 0.5.

So in that case, we would be able to classify these predictions as true positives


and false positives. Once we have the total number of true positives and false
positives, we would be able to calculate the precision rate. So the precision, in
this case, is 0.6.
Now there’s another metric which is average precision. So average precision
basically calculates the average of the precision values across the data.

So let’s understand this with an example of how it works that will give you a
better idea of what average precision is.

Example:

So we saw that in this above image example, we have five bounding boxes with
their IoU scores, and based on the IoU score we can define if this bounding box is
a true positive or a false positive. Now, we calculate the precision for this
particular scenario where we are only considering the bounding box1.

Let’s break down object detection for machine learning. We’re talking about how
well a system can spot objects in images. Now, let’s get into the numbers.
Imagine we’re looking at the first box around an object. If it’s correctly identified
(a true positive), we give it a score of one.

The bottom number of our precision calculation is the total of true positives and
false positives. In this case, it’s also one. So, the precision for this box is one.
Even if there’s a false positive, we keep the precision value the same. We repeat
this process for the other boxes. Say we’re checking the third box and find a true
positive. Now, we have two true positives in total. The sum of true positives and
false positives is three. So, the precision at this point is calculated as 2 divided by
3, which equals 0.66.

Similarly, we would calculate for all the bounding boxes. So for the fourth
bounding box, we’ll have three true positives and a total number of 4 true
positives and false positives. Hence, this value would be 3 by 4 or 0.75.

Once we calculate all the precision values for the bounding boxes, we will take
an average of these values, known as interpolated precision, to determine the
average precision.

Mean Average Precision

Now, mean average precision is simply calculated across all the classes.

So let’s say we have multiple classes or let’s say we have k classes, then for each
individual class, we’ll calculate this average precision, and take an average across
all the classes. This would give you the mean average precision. So this is how
mean average precision is calculated for the object detection problems and is
used as an evaluation metric to compare and evaluate the performance of these
object detectors.

3.9​Object Detection Algorithms:


Since the popularization of deep learning in the early 2010s, there’s been a
continuous progression and improvement in the quality of algorithms used to
solve object detection. We’re going to explore the most popular algorithms while
understanding their working theory, benefits, and their flaws in certain scenarios.

1.​ Histogram of Oriented Gradients (HOG)

→ Introduction

The Histogram of Oriented Gradients is one of the oldest methods of object


detection. It was first introduced in 1986. Despite some developments in the
upcoming decade, the approach did not gain a lot of popularity until 2005 when it
started being used in many tasks related to computer vision. HOG uses a feature
extractor to identify objects in an image.​

The feature descriptor used in HOG is a representation of a part of an image
where we extract only the most necessary information while disregarding
anything else. The function of the feature descriptor is to convert the overall size
of the image into the form of an array or feature vector. In HOG, we use the
gradient orientation procedure to localize the most critical parts of an image.

→ Overview of architecture

HOG – Object Detection Algorithm

Before we understand the overall architecture of HOG, here’s how it works. For a
particular pixel in an image, the histogram of the gradient is calculated by
considering the vertical and horizontal values to obtain the feature vectors. With
the help of the gradient magnitude and the gradient angles, we can get a clear
value for the current pixel by exploring the other entities in their horizontal and
vertical surroundings.​

As shown in the above image representation, we’ll consider an image segment of
a particular size. The first step is to find the gradient by dividing the entire
computation of the image into gradient representations of 8×8 cells. With the help
of the 64 gradient vectors that are achieved, we can split each cell into angular
bins and compute the histogram for the particular area. This process reduces the
size of 64 vectors to a smaller size of 9 values.​

Once we obtain the size of 9 point histogram values (bins) for each cell, we can
choose to create overlaps for the blocks of cells. The final steps are to form the
feature blocks, normalize the obtained feature vectors, and collect all the features
vectors to get an overall HOG feature. Check the following links for more
information about this.

→ Achievements of HOG
1.​ Creation of a feature descriptor useful for performing object detection.

2.​ Ability to be combined with support vector machines (SVMs) to achieve


high-accuracy object detection.

3.​ Creation of a sliding window effect for the computation of each position.

→ Points to consider

1.​ Limitations – While the Histogram of Oriented Gradients (HOG) was quite
revolutionary in the beginning stages of object detection, there were a lot of
issues in this method. It’s quite time-consuming for complex pixel computation in
images, and ineffective in certain object detection scenarios with tighter spaces.

2.​ When to use HOG? – HOG should often be used as the first method of object
detection to test other algorithms and their respective performance. Regardless,
HOG finds significant use in most object detection and facial landmark
recognition with decent accuracy.

3.​ Example use cases – One of the popular use cases of HOG is in pedestrian
detection due to its smooth edges. Other general applications include object
detection of specific objects. For more information, refer to the following link.

2.​ Region-based Convolutional Neural Networks (R-CNN)

→ Introduction

The region-based convolutional neural networks are an improvement in the object


detection procedure from the previous methods of HOG and SIFT. In the R-CNN
models, we try to extract the most essential features (usually around 2000
features) by making use of selective features. The process of selecting the most
significant extractions can be computed with the help of a selective search
algorithm that can achieve these more important regional proposals.

→ Working process of R-CNN

R-CNN – Object Detection Algorithm

The working procedure of the selective search algorithm to select the most
important regional proposals is to ensure that you generate multiple
sub-segmentations on a particular image and select the candidate entries for your
task. The greedy algorithm can then be made use of to combine the effective
entries accordingly for a recurring process to combine the smaller segments into
suitable larger segments.​

Once the selective search algorithm is successfully completed, our next tasks are
to extract the features and make the appropriate predictions. We can then make
the final candidate proposals, and the convolutional neural networks can be used
for creating an n-dimensional (either 2048 or 4096) feature vector as output. With
the help of a pre-trained convolutional neural network, we can achieve the task of
feature extraction with ease.

The final step of the R-CNN is to make the appropriate predictions for the image
and label the respective bounding box accordingly. In order to obtain the best
results for each task, the predictions are made by the computation of a
classification model for each task, while a regression model is used to correct the
bounding box classification for the proposed regions. For further reading and
information about this topic, refer to the following link.

→ Issues with R-CNN

1. Despite producing effective results for feature extraction with the pre-trained
CNN models, the overall procedure of extraction of all the region proposals, and
ultimately the best regions with the current algorithms, is extremely slow.​
2. Another major drawback of the R-CNN model is not only the slow rate of
training but also the high prediction time. The solution requires the use of large
computational resources, increasing the overall feasibility of the process. Hence,
the overall architecture can be considered quite expensive.​
3. Sometimes, bad candidate selections can occur at the initial step due to the lack
of improvements that can be made in this particular step. A lot of problems in the
trained model could be caused by this.

→ Points to consider

1.​ When To Use R-CNN? – R-CNN, similar to the HOG object detection method,
must be used as a first baseline for testing the performance of the object detection
models. The time taken for predictions of images and objects can take a bit longer
than anticipated, so usually the more modern versions of R-CNN are preferred.
2.​ Example use cases – There are several applications of R-CNN for solving
different types of tasks related to object detection. For example, tracking objects
from a drone-mounted camera, locating text in an image, and enabling object
detection in Google Lens. Check out the following link for more information.
3.​ Faster R-CNN

→ Introduction

While the R-CNN model was able to perform the computation of object detection
and achieve desirable results, there were some major lackluster elements,
especially the speed of the model. So, faster methods for tackling some of these
issues had to be introduced to overcome the problems that existed in R-CNN.
Firstly, the Fast R-CNN was introduced to combat some of the pre-existing issues
of R-CNN.

In the fast R-CNN method, the entire image is passed through the pre-trained
Convolutional Neural Network instead of considering all the sub-segments. The
region of interest (RoI) pooling is a special method that takes two inputs of the
pre-trained model and selective search algorithm to provide a fully connected
layer with an output. In this section, we will learn more about the Faster R-CNN
network, which is an improvement on the fast R-CNN model.

→ Understanding Faster R-CNN

The Faster R-CNN model is one of the best versions of the R-CNN family and
improves the speed of performance tremendously from its predecessors. While
the R-CNN and Fast R-CNN model make use of a selective search algorithm to
compute the region proposals, the Faster R-CNN method replaces this existing
method with a superior region proposal network. The region proposal network
(RPN) computes images from a wide range and different scales to produce
effective outputs.
Faster R-CNN – Object Detection Algorithm

The regional proposal network reduces the margin computation time, usually 10
ms per image. This network consists of the convolutional layer from which we
can obtain the essential feature maps of each pixel. For each feature map, we
have multiple anchor boxes which have varying scales, different sizes, and aspect
ratios. For each anchor box, we make a prediction of the particular binary class
and generate a bounding box for the same.​

The following information is then passed through the non-maximum suppression
to remove any unnecessary data since many overlaps are produced while creating
the feature maps. The output from the non-maximum suppression is passed
through the region of interest, and the rest of the process and computation is
similar to the working of Fast R-CNN.

→ Points to consider

1.​ Limitations – One of the main limitations of the Faster R-CNN method is the
amount of time delay in the proposition of different objects. Sometimes, the
speed depends on the type of system being used.
2.​ When To Use Faster R-CNN? – The time for prediction is faster compared to
other CNN methods. While R-CNN usually takes around 40-50 seconds for the
prediction of objects in an image, the Fast R-CNN takes around 2 seconds, but
the Faster R-CNN returns the optimal result in just about 0.2 seconds.
3.​ Example use cases – The examples of use cases for Faster R-CNN are similar to
the ones described in the R-CNN methodology. However, with Faster R-CNN, we
can perform these tasks optimally and achieve results more effectively.

4.​ Single Shot Detector (SSD)

→ Introduction

The single-shot detector for multi-box predictions is one of the fastest ways to
achieve the real-time computation of object detection tasks. While the Faster
R-CNN methodologies can achieve high accuracies of prediction, the overall
process is quite time-consuming and it requires the real-time task to run at about
7 frames per second, which is far from desirable.

The single-shot detector (SSD) solves this issue by improving the frames per
second to almost five times more than the Faster R-CNN model. It removes the
use of the region proposal network and instead makes use of multi-scale features
and default boxes.

→ Overview of architecture

SSD – Object Detection Algorithm

The single-shot multibox detector architecture can be broken down into mainly
three components. The first stage of the single-shot detector is the feature
extraction step, where all the crucial feature maps are selected. This architectural
region consists of only fully convolutional layers and no other layers. After
extracting all the essential feature maps, the next step is the process of detecting
heads. This step also consists of fully convolutional neural networks.
However, in the second stage of detection heads, the task is not to find the
semantic meaning for the images. Instead, the primary goal is to create the most
appropriate bounding maps for all the feature maps. Once we have computed the
two essential stages, the final stage is to pass it through the non-maximum
suppression layers for reducing the error rate caused by repeated bounding boxes.

→ Limitations of SSD

1.​ The SSD, while boosting the performance significantly, suffers from decreasing
the resolution of the images to a lower quality.
2.​ The SSD architecture will typically perform worse than the Faster R-CNN for
small-scale objects.
→ Points to consider

1.​ When To Use SSD? – The single-shot detector is often the preferred method.
The main reason for using the single-shot detector is because we mainly prefer
faster predictions on an image for detecting larger objects, where accuracy is not
an extremely important concern. However, for more accurate predictions for
smaller and precise objects, other methods must be considered.
2.​ Example use cases – The Single-shot detector can be trained and experimented
on a multitude of datasets, such as PASCAL VOC, COCO, and ILSVRC datasets.
They can perform well on larger object detections like the detection of humans,
tables, chairs, and other similar entities.

5.​ YOLO (You Only Look Once)


→ Introduction

You only look once (YOLO) is one of the most popular model architectures and
algorithms for object detection. Usually, the first concept found on a Google
search for algorithms on object detection is the YOLO architecture. There are
several versions of YOLO, which we will discuss in the upcoming sections. The
YOLO model uses one of the best neural network archetypes to produce high
accuracy and overall speed of processing. This speed and accuracy is the main
reason for its popularity.

→ Working process of YOLO

YOLO – Object Detection Algorithm

The YOLO architecture utilizes three primary terminologies to achieve its goal of
object detection. Understanding these three techniques is quite significant to
know why exactly this model performs so quickly and accurately in comparison
to other object detection algorithms. The first concept in the YOLO model is
residual blocks. In the first architectural design, they have used 7×7 residual
blocks to create grids in the particular image.

Each of these grids acts as central points and a particular prediction for each of
these grids is made accordingly. In the second technique, each of the central
points for a particular prediction is considered for the creation of the bounding
boxes. While the classification tasks work well for each grid, it’s more complex
to segregate the bounding boxes for each of the predictions that are made. The
third and final technique is the use of the intersection of union (IOU) to calculate
the best bounding boxes for the particular object detection task.

→ Advantages of YOLO

1.​ The computation and processing speed of YOLO is quite high, especially in
real-time compared to most of the other training methods and object detection
algorithms.
2.​ Apart from the fast computing speed, the YOLO algorithm also manages to
provide an overall high accuracy with the reduction of background errors seen in
other methods.
3.​ The architecture of YOLO allows the model to learn and develop an
understanding of numerous objects more efficiently.
→ Limitations of YOLO

1.​ Failure to detect smaller objects in an image or video because of the lower recall
rate.
2.​ Can’t detect two objects that are extremely close to each other due to the
limitations of bounding boxes.
→ Versions of YOLO

The YOLO architecture is one of the most influential and successful object
detection algorithms. With the introduction of the YOLO architecture in 2016,
their consecutive versions YOLO v2 and YOLO v3 arrived in 2017 and 2018.
While there was no new release in 2019, 2020 saw three quick releases: YOLO
v4, YOLO v5, and PP-YOLO. Each of the newer versions of YOLO slightly
improved on their previous ones. The tiny YOLO was also released to ensure that
object detection could be supported on embedded devices.
Chapter 4

YOLO (You Only Look Once)

The technical architecture and implementation of the Object Detection System


are pivotal to its performance, accuracy, and scalability, forming the foundation
that enables its advanced computer vision capabilities. This chapter explores the
design strategies and technological frameworks that support the system,
explaining how its various components collaborate to achieve efficient and
precise object detection.

The Object Detection System utilizes cutting-edge AI and machine learning


technologies, combined with robust data processing pipelines, to ensure high
accuracy and responsiveness. The architecture is engineered to support real-time
inference, handle large volumes of image and video data, and scale efficiently
across diverse deployment environments, including cloud and edge devices. A
modular and service-oriented design ensures that individual components—such
as data ingestion, model inference, and result visualization—can be
independently upgraded and optimized.

Critical aspects of the technical architecture include the model training pipeline,
data preprocessing workflows, system infrastructure, and integration with
external platforms such as APIs, camera feeds, or cloud storage. Each component
is meticulously designed to contribute to the system’s overall performance,
flexibility, and maintainability. This chapter will provide a comprehensive
overview of these elements, illustrating how they collectively enable the Object
Detection System to deliver reliable and intelligent visual analysis.
4.1​Evolution from YOLO to YOLOv8:
One of the most, if not the most, well-known models in Artificial intelligence
(AI) is the “YOLO” model series.

YOLO (You Only Look Once) is a popular set of object detection models used
for real-time object detection and classification in computer vision.

Originally developed by Joseph Redmon, Ali Farhadi, and Santosh Divvala,


YOLO aims to achieve high accuracy in object detection with real-time speed.
The model family belongs to one-stage object detection models that process an
entire image in a single forward pass of a convolutional neural network (CNN).

The key feature of YOLO is its single-stage detection approach, which is


designed to detect objects in real time and with high accuracy. Unlike two-stage
detection models, such as R-CNN, that first propose regions of interest and then
classify these regions, YOLO processes the entire image in a single pass, making
it faster and more efficient.

In this article, we will be focusing on YOLOv8, the latest version of the YOLO
system developed by Ultralytics. We will discuss its evolution from YOLO to
YOLOv8, its network architecture, new features, and applications. Additionally,
we will provide a step-by-step guide on how to use YOLOv8, and lastly how to
use it to create model-assisted annotations with Encord Annotate.

Whether you’re a seasoned machine learning engineer or just starting out, this
guide will provide you with all the knowledge and tools you need to get started
with YOLOv8.
YOLOv1 was the first official YOLO model. It used a single convolutional neural
network (CNN) to detect objects in an image and was relatively fast compared to
other object detection models. However, it was not as accurate as some of the
two-stage models at that time.

YOLOv2 was released in 2016 and made several improvements over YOLOv1. It
used anchor boxes to improve detection accuracy and introduced the Upsample
layer, which improved the resolution of the output feature map.

YOLOv3 was introduced in 2018 with the goal of increasing the accuracy and
speed of the algorithm. The primary improvement in YOLOv3 over its
predecessors was the use of the Darknet-53 architecture, a variant of the ResNet
architecture specifically designed for object detection.

YOLO v3 also improved the anchor boxes, allowing different scales and aspect
ratios to better match the size and shape of the detected objects. The use
of Feature Pyramid Networks (FPN) and GHM loss function, along with a wider
range of object sizes and aspect ratios and improved accuracy and stability, were
also hallmarks of YOLO v3.

YOLOv4, released in 2020 by Bochkovskiy et. al., introduced a number of


improvements over YOLOv3, including a new backbone network, improvements
to the training process, and increased model capacity. YOLOv4 also
introduced Cross mini-Batch Normalization, a new normalization method
designed to increase the stability of the training process.

YOLOv5, introduced in 2020, builds upon the success of previous versions and
was released as an open-source project by Ultralytics. YOLOv5 used
the EfficientDet architecture, based on the EfficientNet network, and several new
features and improvements, to achieve improved object detection performance.
YOLOv5 became the world’s state-of-the-art repo for object detection back in
2020 given its flexible Pythonic structure and was also the first model we
incorporated for model-assisted learning at Encord.

YOLOv6 focused on making the system more efficient and reducing its memory
footprint. It made use of a new CNN architecture called SPP-Net (Spatial
Pyramid Pooling Network). This architecture is designed to handle objects of
different sizes and aspect ratios, making it ideal for object detection tasks.

YOLOv7 was introduced in 2022. One of the key improvements in YOLOv7 is


the use of a new CNN architecture called ResNeXt.

YOLOv7 also introduces a new multi-scale training strategy, which involves


training the model on images at multiple scales and then combining the
predictions. This helps the model handle objects of different sizes and shapes
more effectively.
Finally, YOLOv7 incorporates a new technique called “Focal Loss”, which is
designed to address the class imbalance problem that often arises in object
detection tasks. The Focal Loss function gives more weight to hard examples and
reduces the influence of easy examples.

4.2 WHY WE USE YOLOV8 -


A few of the main reasons you should consider using YOLOv8 in your next
computer vision project are:

●​ YOLOv8 has better accuracy than previous YOLO models.


●​ The latest YOLOv8 implementation comes with a lot of new features, we
especially like the user-friendly CLI and GitHub repo.
●​ It supports object detection, instance segmentation, and image classification.
●​ The community around YOLO is incredible, just search for any edition of the
YOLO model and you’ll find hundreds of tutorials, videos, and articles.
Furthermore, you can always find the help you need in communities such
as MLOps Community, DCAI, and others.
●​ Training of YOLOv8 will be probably faster than the other two-stage object
detection models.
One reason not to use YOLOv8:
●​ At the current time YOLOv8 does not support models trained in 1280 (in pixels)
resolution, thus if you’re looking to run inference at high resolution it is not
recommended to use YOLOv8.
YOLOv8 network architecture and design

The actual published paper has not been released yet but the creators of YOLOv8
promised that it will come out soon (To avoid the drama around YOLOv5).

Thus we do not have a good overview of the methodologies used during creation
nor do we have access to the ablation studies conducted by the team. We will
release an updated version as soon as it is published.

The layout of YOLOv8

We won’t go too much into detail about the YOLOv8 architecture, but we will
cover some of the major differences from previous iterations.

The following layout was made by RangeKing on GitHub and is a great way of
visualizing the architecture.

Ancher-free Detections

Anchor-free detection is when an object detection model directly predicts the


center of an object instead of the offset from a known anchor box.

Anchor boxes are a pre-defined set of boxes with specific heights and widths,
used to detect object classes with the desired scale and aspect ratio. They are
chosen based on the size of objects in the training dataset and are tiled across the
image during detection.

The network outputs probability and attributes like background, IoU, and offsets
for each tiled box, which are used to adjust the anchor boxes. Multiple anchor
boxes can be defined for different object sizes, serving as fixed starting points for
boundary box guesses.

The advantage of anchor-free detection is that it is more flexible and efficient, as


it does not require the manual specification of anchor boxes, which can be
difficult to choose and can lead to suboptimal results in previous YOLO models
such as v1 and v2.
New convolutions in YOLOv8

There are a series of updates and new convolutions in the YOLOv8 architecture
according to the introductory post from Ultralytics:
Chapter 5 Results and Discussions

5.1​Coding:

Code for Object Detection Application:

Code for Object Detection Video Access:


Main Code for Object Detection:
​ Terminal:

5.2​ Output and Discussion:


Choose the file to give it as an input to Object Detection Application.
Now, press Upload and Process button showing on screen.

This is the Image of video given as an input to the Object Detection application.
Discussion

The feedback collected from the evaluation of the Object Detection project reveals
both the effectiveness of the current implementation and areas that warrant
further development. The model demonstrated strong performance in detecting
objects with high accuracy across diverse test scenarios, indicating the robustness
of the underlying architecture and data preprocessing techniques. The following
action points have been identified based on the results and user feedback:

●​ Model Refinement: While the current model performs well on standard objects,
certain classes showed lower precision and recall. Additional data augmentation,
class balancing, or fine-tuning with a more specialized dataset could improve
detection for underrepresented categories.

●​ Real-Time Performance: Users noted some latency in real-time applications,


especially on lower-end devices. Optimization strategies such as model
quantization, pruning, and hardware-specific acceleration should be explored to
reduce inference time.

●​ Deployment and Usability: To enhance usability, especially for non-technical


users, improvements in the user interface and API documentation are
recommended. Simplified deployment options (e.g., containerization or mobile
support) would further broaden accessibility.

●​ Edge Case Handling: The model occasionally misclassified overlapping or


occluded objects. Incorporating more training data with complex scenes and
experimenting with attention mechanisms or instance segmentation could help
address these challenges.
Chapter 6 Conclusion and Future Scope

6.1 Conclusion
The Object Detection Project was initiated to address the increasing demand for
accurate, efficient, and scalable computer vision solutions capable of identifying
and localizing objects in real time. Through various stages of development,
training, testing, and validation, the system has demonstrated strong potential in
automating visual recognition tasks across multiple domains.

Key Achievements

• High Detection Accuracy: The system achieved reliable object detection


performance across a variety of test environments, accurately identifying and
localizing multiple objects with minimal false positives and false negatives.

• Robust Model Architecture: Leveraging advanced deep learning models such


as YOLO , the project implemented a solid framework capable of handling
complex visual data with high speed and precision.

• Real-Time Capabilities: Optimizations to model size and inference speed


enabled real-time object detection, making the solution suitable for live
applications such as surveillance, autonomous navigation, and retail monitoring.

• Domain Flexibility: The system was designed to be adaptable, with the ability
to retrain on different datasets for applications ranging from industrial inspection
to medical imaging, showcasing its broad applicability.

•User-Friendly Deployment: Efforts to simplify deployment via Docker


containers, REST APIs, and edge device compatibility have made the system
more accessible for both developers and end-users with varying technical
backgrounds.

Summary:
In conclusion, the Object Detection Project has laid a strong foundation for the
development of intelligent visual recognition systems. Its high accuracy,
versatility, and real-time processing capabilities make it a promising solution for
a wide range of real-world applications. Guided by data-driven development and
user feedback, the project is well-positioned for future expansion and refinement.
Continued advancements in performance, customization, and usability will ensure
its relevance and value across industries embracing computer vision technologies.
The Object Detection Project aimed to build an intelligent system capable of
identifying and localizing objects within images.


The model achieved high accuracy in recognizing multiple object classes across
diverse environments. Robust preprocessing and data augmentation techniques
enhanced model generalization. Real-time inference was enabled through model
optimization and hardware acceleration.


The system demonstrated versatility across domains such as surveillance, retail,
and healthcare.​
User-friendly deployment options were implemented via APIs and Docker
containers.​
Edge-device compatibility ensured the system's applicability in low-latency
scenarios.​
Custom training support allowed adaptation to specific industry use cases.​
Challenges like detecting occluded or small objects were partially addressed and
noted for future work. ​
Feedback indicated strong performance but highlighted opportunities for UI and
speed improvements. Security and privacy concerns were acknowledged for
sensitive use cases.​
Scalability was considered, with design support for high-volume image
processing.​
The project sets a strong foundation for more advanced applications like instance
segmentation.​
Overall, the system proves to be a reliable, adaptable, and promising solution for
modern visual detection needs.
6.1​ Future Scope
While the Object Detection project has demonstrated promising results in
identifying and localizing objects across various environments, there remain
several avenues for further enhancement and exploration. The following outlines
key future directions that will help improve the system’s performance, scalability,
and applicability across diverse real-world scenarios.

1.​ Model Improvements

●​ Accuracy Enhancement: Future work could involve training on more diverse


and larger datasets to improve generalization, especially for rare or overlapping
objects.

●​ Instance Segmentation: Moving beyond bounding boxes, incorporating instance


segmentation can enable pixel-level accuracy for tasks requiring detailed object
shapes.

●​ Zero-Shot and Few-Shot Learning: Implementing techniques that allow the


system to detect new or rare classes with minimal training data could enhance
adaptability.

2.​ Performance and Efficiency

●​ Real-Time Optimization: Optimizing inference speed using model quantization,


pruning, and deploying lightweight models like MobileNet or YOLO-Nano for
edge devices.

●​ Hardware Acceleration: Leveraging hardware-specific optimizations (e.g.,


TensorRT, EdgeTPU, or GPU parallelism) to ensure efficient deployment on
various platforms.

3.​ Application Expansion

●​ Domain-Specific Models: Adapting the system for specialized domains like


medical imaging, autonomous driving, agriculture, or retail analytics to improve
use-case precision.
●​ Multi-Modal Integration: Integrating object detection with other modalities
such as audio or sensor data to provide richer contextual understanding in
complex environments.

4.​ Deployment and Usability

●​ Edge and IoT Deployment: Developing capabilities for low-latency object


detection on embedded or IoT devices for applications in surveillance, robotics,
and smart cities.

●​ Customizable Pipelines: Offering configurable pipelines that allow users to


retrain or fine-tune models on their own data with minimal setup.

5.​ Data Management and Annotation

●​ Active Learning: Incorporating active learning frameworks to streamline the


annotation process by focusing on uncertain or high-impact samples.

●​ Synthetic Data Generation: Using synthetic data or simulation environments to


supplement training data and reduce dependency on manual annotations.

6.​ Ethics, Privacy, and Security

●​ Bias Mitigation: Addressing bias in training datasets and model predictions to


ensure fairness and accuracy across diverse demographic and environmental
conditions.

●​ Privacy Protection: Implementing techniques like federated learning or


on-device inference to protect user data and comply with privacy regulations.

Future Vision

The long-term vision for the Object Detection project is to evolve into a flexible,
real-time, and intelligent perception system capable of seamlessly integrating into
a wide range of applications, from autonomous systems to assistive
technologies—ensuring high accuracy, efficiency, and ethical reliability in every
deploy.
Conclusion
The Object Detection project has successfully demonstrated the capability to
accurately identify and localize multiple objects within diverse visual
environments. Through the application of advanced deep learning techniques, the
system has achieved strong performance in both controlled and real-world
scenarios. The model's effectiveness validates the robustness of the data
preprocessing, architecture selection, and training strategies employed.

Despite its achievements, the project also highlights opportunities for further
enhancement in areas such as real-time performance, precision in complex
scenes, and broader applicability across specialized domains. By continuing to
refine the model, expand its integration potential, and address deployment
challenges, the system can become a powerful tool in a variety of industries
including surveillance, healthcare, retail, and autonomous systems.

With a clear path for future development and a strong foundational framework,
the Object Detection project stands as a significant step toward building
intelligent, adaptable, and scalable computer vision solutions.
References

●​ Computer Vision from


https://www.analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the
-basic-object-detection-algorithms-part-1/

●​ Object Detection Algorithms from


https://neptune.ai/blog/object-detection-algorithms-and-libraries

●​ YOLOv8 description from


https://medium.com/cord-tech/yolov8-for-object-detection-explained-practical-ex
ample-23920f77f66a

You might also like