Project Report Object Detection
Project Report Object Detection
UDAIPUR (RAJ.)
PROJECT REPORT
ON
submitted in partial fulfillment for the award of the Degree of Bachelor of Technology
in Department of Computer Science & Engineering
(Session 2021-2025)
First of all, we thanked all those people who helped us with their guidance and
assistance, without which this project would not be successful.
We had acknowledged the kind grace of our Dean Dr. Kalpana Jain for
providing us with a great opportunity by giving me full support and cooperation.
I am very grateful to Dr. Kalpana Jain for their guidance and constant
supervision as well as for providing necessary information regarding the report.
Submitted By:
Anjali Singh
Harsha Rajawat
Sakshi Soni
Tanu Sharma
B. Tech. Final Year, AI & DS
DECLARATION
I hereby declare that the Project titled “Object Detection” has been developed by
us and is not reproduce as it is from any other source. It has been submitted in
partial fulfillment of the requirement for award of Bachelor of Technology in
Computer Science and Engineering, MPUAT Udaipur, and has not been
submitted anywhere else for the award of any other degree.
UDAIPUR (RAJ.)
CERTIFICATE
This is to certify that the project entitled “Object Detection” has been complete
and submitted by Anjali Singh, Harsha Rajawat, Sakshi Soni, Tanu Sharma in
partial fulfillment of the requirement for the award of Bachelor of Technology
in Computer Science and Engineering from College of Technology and
Engineering, a constitute of Maharana Pratap University of Agriculture and
Technology, Udaipur.
List of Tables i
List of Figures ii
Abbreviation iii
1. Abstract 1
2. Chapter 1: Introduction 2
Project
Detection
Precision
6.1: Conclusion
8. References
LIST OF FIGURES
● JS: JavaScript
This tool aims to address real-world challenges across various domains, including
surveillance, autonomous vehicles, smart retail, and industrial automation. The
system is developed using modern frameworks like TensorFlow, PyTorch, and
OpenCV, ensuring scalability, adaptability, and seamless integration into existing
workflows. Key features include real-time detection, confidence scoring,
bounding box visualization, and support for custom datasets to suit specific
application needs.
INTRODUCTION
This Object Detection System project was conceived to address these challenges
by developing a robust and efficient solution capable of accurately identifying
and localizing objects in both static images and dynamic video streams. The
purpose of the system is to enhance automation, improve operational safety, and
enable faster, data-driven decisions through real-time visual analysis.
This chapter outlines the core components and features of the Object Detection
System, detailing how each contributes to the efficient development, deployment,
and operation of intelligent visual recognition models.
By consolidating these features within a unified system, the Object Detection Project
empowers data scientists, ML engineers, and domain specialists to collaborate
seamlessly, reduce development time, and increase the accuracy and reliability of
detection results. Each module is designed with user experience and scalability in
mind, offering intuitive controls and flexible configuration options to support
evolving project needs.
In the following sections, we will delve into the primary modules of the Object
Detection Platform, beginning with the Data Ingestion Interface. This will provide a
clear view of how raw visual data is transformed into structured, actionable insights.
The entry point to the system is the Login Page. It ensures secure access by
authenticating all users before allowing entry into the system. Role-based access
control is enforced, allowing administrators to manage team roles, assign privileges,
and monitor user activity—ensuring both operational security and accountability
throughout the object detection lifecycle.
Technical Implementation:
• Frontend: The website is built using HTML and styled with CSS and ensure a
• Backend: Input data of video or image passed to detect objects from the input.
3.1Object Detection:
Object detection is a technique that uses neural networks to localize and classify
objects in images. This computer vision task has a wide range of applications,
from medical imaging to self-driving cars. Object detection is a computer
vision task that aims to locate objects in digital images. As such, it is an instance
of artificial intelligence that consists of training computers to see as humans do,
specifically by recognizing and classifying objects according to semantic
categories.1 Object localization is a technique for determining the location
specific objects in an image by demarcating the object through a bounding box.
Object classification is another technique that determines to which category a
detected object belongs. The object detection task combines subtasks of object
localization and classification to simultaneously estimate the location and type of
object instances in one or more images.
Object detection overlaps with other computer vision techniques, but developers
nevertheless treat it as a discrete endeavor.
Computer vision works much the same as human vision, except humans have a
head start. Human sight has the advantage of lifetimes of context to train how to
tell objects apart, how far away they are, whether they are moving or something
is wrong with an image.
Computer vision trains machines to perform these functions, but it must do it in
much less time with cameras, data and algorithms rather than retinas, optic nerves
and a visual cortex. Because a system trained to inspect products or watch a
production asset can analyze thousands of products or processes a minute,
noticing imperceptible defects or issues, it can quickly surpass human
capabilities.
Computer vision is used in industries that range from energy and utilities to
manufacturing and automotive—and the market is continuing to grow. It is
expected to reach USD 48.6 billion by 2022.
Computer vision needs lots of data. It runs analyses of data over and over until it
discerns distinctions and ultimately recognize images. For example, to train a
computer to recognize automobile tires, it needs to be fed vast quantities of tire
images and tire-related items to learn the differences and recognize a tire,
especially one with no defects.
Machine learning uses algorithmic models that enable a computer to teach itself
about the context of visual data. If enough data is fed through the model, the
computer will “look” at the data and teach itself to tell one image from another.
Algorithms enable the machine to learn by itself, rather than someone
programming it to recognize an image.
A CNN helps a machine learning or deep learning model “look” by breaking
images down into pixels that are given tags or labels. It uses the labels to perform
convolutions (a mathematical operation on two functions to produce a third
function) and makes predictions about what it is “seeing.” The neural network
runs convolutions and checks the accuracy of its predictions in a series of
iterations until the predictions start to come true. It is then recognizing or seeing
images in a way similar to humans.
Much like a human making out an image at a distance, a CNN first discerns hard
edges and simple shapes, then fills in information as it runs iterations of its
predictions. A CNN is used to understand single images. A recurrent neural
network (RNN) is used in a similar way for video applications to help computers
understand how pictures in a series of frames are related to one another.
Scientists and engineers have been trying to develop ways for machines to see
and understand visual data for about 60 years. Experimentation began in 1959
when neurophysiologists showed a cat an array of images, attempting to correlate
a response in its brain. They discovered that it responded first to hard edges or
lines and scientifically, this meant that image processing starts with simple shapes
like straight edges.
At about the same time, the first computer image scanning technology was
developed, enabling computers to digitize and acquire images. Another milestone
was reached in 1963 when computers were able to transform two-dimensional
images into three-dimensional forms. In the 1960s, AI emerged as an academic
field of study and it also marked the beginning of the AI quest to solve the human
vision problem.
By 2000, the focus of study was on object recognition; and by 2001, the first
real-time face recognition applications appeared. Standardization of how visual
data sets are tagged and annotated emerged through the 2000s. In 2010, the
ImageNet data set became available. It contained millions of tagged images
across a thousand object classes and provides a foundation for CNNs and deep
learning models used today. In 2012, a team from the University of Toronto
entered a CNN into an image recognition contest. The model, called AlexNet,
significantly reduced the error rate for image recognition. After this
breakthrough, error rates have fallen to just a few percent.5
Machine learning uses algorithmic models that enable a computer to teach itself
about the context of visual data. If enough data is fed through the model, the
computer will “look” at the data and teach itself to tell one image from another.
Algorithms enable the machine to learn by itself, rather than someone
programming it to recognize an image.
Much like a human making out an image at a distance, a CNN first discerns hard
edges and simple shapes, then fills in information as it runs iterations of its
predictions. A CNN is used to understand single images. A recurrent neural
network (RNN) is used in a similar way for video applications to help computers
understand how pictures in a series of frames are related to one another.
3.5The History of Computer Vision:
Scientists and engineers have been trying to develop ways for machines to see
and understand visual data for about 60 years. Experimentation began in 1959
when neurophysiologists showed a cat an array of images, attempting to correlate
a response in its brain. They discovered that it responded first to hard edges or
lines and scientifically, this meant that image processing starts with simple shapes
like straight edges.
At about the same time, the first computer image scanning technology was
developed, enabling computers to digitize and acquire images. Another milestone
was reached in 1963 when computers were able to transform two-dimensional
images into three-dimensional forms. In the 1960s, AI emerged as an academic
field of study and it also marked the beginning of the AI quest to solve the human
vision problem.
By 2000, the focus of study was on object recognition; and by 2001, the first
real-time face recognition applications appeared. Standardization of how visual
data sets are tagged and annotated emerged through the 2000s. In 2010, the
ImageNet data set became available. It contained millions of tagged images
across a thousand object classes and provides a foundation for CNNs and deep
learning models used today. In 2012, a team from the University of Toronto
entered a CNN into an image recognition contest. The model, called AlexNet,
significantly reduced the error rate for image recognition. After this
breakthrough, error rates have fallen to just a few percent.5
IBM has also introduced a computer vision platform that addresses both
developmental and computing resource concerns. IBM Maximo® Visual
Inspection includes tools that enable subject matter experts to label, train and
deploy deep learning vision models—without coding or deep learning expertise.
The vision models can be deployed in local data centers, the cloud and edge
devices.
● Image classification sees an image and can classify it (a dog, an apple, a person’s
face). More precisely, it is able to accurately predict that a given image belongs to
a certain class.
● Object detection can use image classification to identify a certain class of image
and then detect and tabulate their appearance in an image or video. Examples
include detecting damages on an assembly line or identifying machinery that
requires maintenance.
● Object tracking follows or tracks an object once it is detected. This task is often
executed with images captured in sequence or real-time video feeds. Autonomous
vehicles, for example, need to not only classify and detect objects such as
pedestrians, other cars and road infrastructure, they need to track them in motion
to avoid collisions and obey traffic laws.[7]
Humans can easily detect and identify object detection using machine learning
present in an image. The human visual system is fast and accurate and can
perform complex tasks like identifying multiple objects and detecting obstacles
with little conscious thought. With the availability of large amounts of data, faster
GPUs, and better algorithms, we can now easily train computers to detect and
classify multiple objects within an image with high accuracy.
With this kind of identification and localization, you can use object detection to
count objects in a scene, determine their precise locations, and track them while
accurately labeling them.
Now let’s simplify this statement a bit with the help of the below image.
For now, consider that you can represent the location of the object in the image as
coordinates of these boxes. This box around the object is formally known as a
bounding box. This situation creates an image localization problem where you
receive a set of images and must identify where the object is present in each
image.
Example:
In this image, we have to locate the objects in the image but note that all the
objects are not dogs. Here we have a dog and a car. So we not only have to locate
the objects in the image but also classify the located object as a dog or Car. So
this becomes an object detection problem.
This article will also discuss a few points regarding image classification also. we
will discuss image classification v/s object detection.
In the case of object detection problems, we have to classify the objects in the
image and also locate where these objects are present in the image. But the image
classification problem had only one task where we had to classify the objects in
the image.
So, In the example below the image, we predict only the target class, and we refer
to such tasks as image classification problems. While in the second case, along
with predicting the target class, we also have to find the bounding box which
denotes the location of the object.
This is all about the object detection using machine learning problem. So broadly
we have three tasks for object detection problems:
Now the object detection problem can also be divided into multiple categories.
First is the case when you have images that have only one object. That is you can
have 1000 images in the data set, and all of these images will have only one
object. And if all these objects belong to a single class, that is all the objects are
cars, then this will be an image localization problem.
Another problem could be where you are provided with multiple images, and
within each of these images, you have multiple objects. Also, these objects can be
of the same class, or another problem can be that these objects are of different
classes.
So in case you have multiple objects in the image and all of the objects are of
different classes. you would have to not only locate the objects but also classify
these objects.
The next section will discuss the problem statement for object detection.
Now in this section, we’ll understand what the data would look like for an object
detection using deep learning task.
So, let’s first take an example from the classification problem. In the below
image, we have an input image and a target class against each of these input
images.
Now, suppose the task at hand is to detect the cars in the images. So in that case will not
only have an input image but along with a target variable that has the bounding box that
denotes the location of the object in the image.
So, in this case, this target variable has five values the value p denotes the
probability of an object being in the above image whereas the four values Xmin,
Ymin, Xmax, and Ymax denote the coordinates of the bounding box. Let us
understand how these coordinate values are calculated.
So, consider the x-axis and y-axis above the image there. In that case, the Xmin
and Ymin represent the top left corner of the bounding box, while Xmax and
Ymax represent the bottom right corner. Now, note that the target variable(P)
answers only two questions?
Answer: If an object is not present then p will be zero and when there is an object
present in the image p will be one.
Answer: You can find the object location using the coordinates of the bounding
box.
In case all the images have a single class that is just a car. What happens when
there are more classes? In that case, this is what the target variable would look
like.
So, if you have two classes which are an emergency vehicle and a non-emergency
vehicle, you’ll have two additional values c1 and c2 denoting which class does
the object present in the above image belong.
let’s say we build a model and get some predictions from the model, this is a
possible output that you can get from a model. The probability that an object is
present in this predicted bounding box is 0.8. You have the coordinates of this
blue bounding box, which are (40, 20) and (210, 180), along with the class values
of c1 and c2.
Before moving into depth, we need to know a few concepts regarding images
such that:
In this section, we are going to discuss a very interesting concept, which is the
intersection over the union(IoU). And we are going to use this, in order to
determine the target variable for the individual patches that we have created.
So, consider the following scenario. Here we have two bounding boxes, box1 and
box2. Now if I ask you which of these two boxes is more accurate, the obvious
answer is box1
Why? Because it has a major region of the WBC and has correctly detected the
WBC. But how can we find this out mathematically?
So, compare the actual, and the predicted bounding boxes. if we are able to find
out the overlap of the actual, and the predicted bounding box, we will be able to
make a decision as to which bounding box is a better prediction.
So the bounding box that has a higher overlap with the actual bounding box is a
better prediction. Now, this overlap is called the area of intersection for this first
box, which is box1. We can say that the area of intersection is about 70% of the
actual bounding box.
Whereas, if you consider box2, the area of intersection of the second bounding
box, and the actual bounding box is about 20 %.
So we can say that of these two bounding boxes obviously, box1 is a better
prediction. But having the area of intersection alone is not enough.
Scenarios: 1 Let’s consider another example: suppose we have created multiple
bounding boxes or patches of different sizes.
Here, the intersection of the left bounding box is certainly 100% whereas, in the
second image, the intersection of this predicted bounding box, or this particular
patch is just 70%. So at this stage, would you say that the bounding box on the
left is a better prediction? obviously not. The bounding box on the right is more
accurate.
So, to deal with such scenarios, we also consider the area of union, which is the
patch area, as well as the actual bounding box area.
So, higher this area of union(blue region) we can say that less accurate will be the
predicted bounding box, or the particular patch. Now, this is known as
intersection over the union(IoU).
So here we have the formula for the intersection over union, which is the area of the
intersection divided by the area of union.
Now, what would be the range of intersection? Let’s consider some extreme scenarios.
So in case we have our actual bounding box and predicted bounding box, and
both of these have no overlap at all, in that case, the area of the intersection will
be zero, whereas the area of union will be the sum of the area of this patch. So,
overall the IoU would be zero.
Scenario:- 2
Another possible scenario could be when both the predicted bounding box and
the actual bounding box completely overlap.
In that case, the area of the intersection will be equal to this overlap, and the area
of union will also be the same. Since the numerator and the denominator would
be the same in this case, the IoU would be 1.
So, basically, the range of IoU or intersection over union is between 0 and 1.
Whereas if the IoU is less than a particular threshold, we’ll say that
the predicted bounding box is nothing close to the actual bounding box.
This will also be helpful to understand the code for the intersection over the union
in the notebook. So in the last section, we discussed that in order to calculate the
IoU value. We need the area of intersection as well as the area of union.
Now the question is, how do we find out these two values? So to find out the area
of intersection, we need the area of this blue box. And we can calculate that using
the coordinates for this blue box.
So the coordinates will be Xmin, Ymin, Xmax and, Ymax using these
coordinates values will be easily able to calculate the area of intersection. So let’s
focus on determining the value of Xmin here.
In order to find out the value of Xmin, we are going to use the Xmin values for
these two bounding boxes, which are represented as X1min and X2min.
Now, as you can see above the diagram, the Xmin for this blue bounding box is
simply equivalent to X2min. We can also say that the Xmin for this blue box will
always be the maximum value out of these two values X1min and X2min.
Xmax for this blue bounding box, we are going to compare the values X1max
and X2max. We can see that the Xmax for this blue bounding box is equivalent to
X1max. It can also be written as the minimum of X1max and X2max.
Similarly in order to find out the value for Ymin and Ymax. We are going to
compare the Y1min and Y2min, and Y1max and Y2max. The value of Ymin will
simply be the maximum of Y1 minimum and Y2 minimum which you can see
here.
We can calculate the area of intersection by multiplying the length and the width
of this rectangle, which is the blue rectangle right here.
So to find out the length, we are going to subtract Xmax and Xmin. And to find
out the height, or the width here, we are going to find the difference between
Ymax and Ymin. Once we have the length and width, the area of the intersection
will simply be the length multiplied by width. So now we understand how to
calculate the area of intersection.
Area of union
Next, the focus is on calculating the area of union. So in order to calculate the
area of union, we are going to use the coordinate values of these two bounding
boxes which are the green bounding box and the red bounding box.
Now note that, when we are calculating the areas of box1 and box2, we are
actually counting this blue shaded region twice. So this is a part of the green
rectangle as well as the red rectangle. Since this part is counted twice we’ll have
to subtract it once, in order to get the area of union.
So the area of union finally will be the summation of the area of box1 and the
area of box2. After that I have to subtract the intersection area since this has been
counted twice.
So now we have the area of intersection for two bounding boxes and also have
the area of union for two bounding boxes. Now we can simply
calculate the intersection over union as the area of the intersection divided by the
area of union.
Now, we are going to discuss some popularly used evaluation metrics for object
detection using deep learning.
So we have previously discussed the intersection over the union and How it can
be used to evaluate the model performance by comparing the predicted bounding
boxes with the actual bounding boxes. Another popularly used metric is mean
average precision. So in this section, we will understand what is mean, average
precision and how it can be used.
Mean, Average Precision
Now, I’m sure you’re familiar with the metric precision, which simply takes into
account the number of true positives, and is divided by the true positives and
false positives. So this is basically the actual positive values upon the predicted
positive values.
So let’s understand this with an example of how it works that will give you a
better idea of what average precision is.
Example:
So we saw that in this above image example, we have five bounding boxes with
their IoU scores, and based on the IoU score we can define if this bounding box is
a true positive or a false positive. Now, we calculate the precision for this
particular scenario where we are only considering the bounding box1.
Let’s break down object detection for machine learning. We’re talking about how
well a system can spot objects in images. Now, let’s get into the numbers.
Imagine we’re looking at the first box around an object. If it’s correctly identified
(a true positive), we give it a score of one.
The bottom number of our precision calculation is the total of true positives and
false positives. In this case, it’s also one. So, the precision for this box is one.
Even if there’s a false positive, we keep the precision value the same. We repeat
this process for the other boxes. Say we’re checking the third box and find a true
positive. Now, we have two true positives in total. The sum of true positives and
false positives is three. So, the precision at this point is calculated as 2 divided by
3, which equals 0.66.
Similarly, we would calculate for all the bounding boxes. So for the fourth
bounding box, we’ll have three true positives and a total number of 4 true
positives and false positives. Hence, this value would be 3 by 4 or 0.75.
Once we calculate all the precision values for the bounding boxes, we will take
an average of these values, known as interpolated precision, to determine the
average precision.
Now, mean average precision is simply calculated across all the classes.
So let’s say we have multiple classes or let’s say we have k classes, then for each
individual class, we’ll calculate this average precision, and take an average across
all the classes. This would give you the mean average precision. So this is how
mean average precision is calculated for the object detection problems and is
used as an evaluation metric to compare and evaluate the performance of these
object detectors.
→ Introduction
→ Overview of architecture
Before we understand the overall architecture of HOG, here’s how it works. For a
particular pixel in an image, the histogram of the gradient is calculated by
considering the vertical and horizontal values to obtain the feature vectors. With
the help of the gradient magnitude and the gradient angles, we can get a clear
value for the current pixel by exploring the other entities in their horizontal and
vertical surroundings.
As shown in the above image representation, we’ll consider an image segment of
a particular size. The first step is to find the gradient by dividing the entire
computation of the image into gradient representations of 8×8 cells. With the help
of the 64 gradient vectors that are achieved, we can split each cell into angular
bins and compute the histogram for the particular area. This process reduces the
size of 64 vectors to a smaller size of 9 values.
Once we obtain the size of 9 point histogram values (bins) for each cell, we can
choose to create overlaps for the blocks of cells. The final steps are to form the
feature blocks, normalize the obtained feature vectors, and collect all the features
vectors to get an overall HOG feature. Check the following links for more
information about this.
→ Achievements of HOG
1. Creation of a feature descriptor useful for performing object detection.
3. Creation of a sliding window effect for the computation of each position.
→ Points to consider
1. Limitations – While the Histogram of Oriented Gradients (HOG) was quite
revolutionary in the beginning stages of object detection, there were a lot of
issues in this method. It’s quite time-consuming for complex pixel computation in
images, and ineffective in certain object detection scenarios with tighter spaces.
2. When to use HOG? – HOG should often be used as the first method of object
detection to test other algorithms and their respective performance. Regardless,
HOG finds significant use in most object detection and facial landmark
recognition with decent accuracy.
3. Example use cases – One of the popular use cases of HOG is in pedestrian
detection due to its smooth edges. Other general applications include object
detection of specific objects. For more information, refer to the following link.
→ Introduction
The working procedure of the selective search algorithm to select the most
important regional proposals is to ensure that you generate multiple
sub-segmentations on a particular image and select the candidate entries for your
task. The greedy algorithm can then be made use of to combine the effective
entries accordingly for a recurring process to combine the smaller segments into
suitable larger segments.
Once the selective search algorithm is successfully completed, our next tasks are
to extract the features and make the appropriate predictions. We can then make
the final candidate proposals, and the convolutional neural networks can be used
for creating an n-dimensional (either 2048 or 4096) feature vector as output. With
the help of a pre-trained convolutional neural network, we can achieve the task of
feature extraction with ease.
The final step of the R-CNN is to make the appropriate predictions for the image
and label the respective bounding box accordingly. In order to obtain the best
results for each task, the predictions are made by the computation of a
classification model for each task, while a regression model is used to correct the
bounding box classification for the proposed regions. For further reading and
information about this topic, refer to the following link.
1. Despite producing effective results for feature extraction with the pre-trained
CNN models, the overall procedure of extraction of all the region proposals, and
ultimately the best regions with the current algorithms, is extremely slow.
2. Another major drawback of the R-CNN model is not only the slow rate of
training but also the high prediction time. The solution requires the use of large
computational resources, increasing the overall feasibility of the process. Hence,
the overall architecture can be considered quite expensive.
3. Sometimes, bad candidate selections can occur at the initial step due to the lack
of improvements that can be made in this particular step. A lot of problems in the
trained model could be caused by this.
→ Points to consider
1. When To Use R-CNN? – R-CNN, similar to the HOG object detection method,
must be used as a first baseline for testing the performance of the object detection
models. The time taken for predictions of images and objects can take a bit longer
than anticipated, so usually the more modern versions of R-CNN are preferred.
2. Example use cases – There are several applications of R-CNN for solving
different types of tasks related to object detection. For example, tracking objects
from a drone-mounted camera, locating text in an image, and enabling object
detection in Google Lens. Check out the following link for more information.
3. Faster R-CNN
→ Introduction
While the R-CNN model was able to perform the computation of object detection
and achieve desirable results, there were some major lackluster elements,
especially the speed of the model. So, faster methods for tackling some of these
issues had to be introduced to overcome the problems that existed in R-CNN.
Firstly, the Fast R-CNN was introduced to combat some of the pre-existing issues
of R-CNN.
In the fast R-CNN method, the entire image is passed through the pre-trained
Convolutional Neural Network instead of considering all the sub-segments. The
region of interest (RoI) pooling is a special method that takes two inputs of the
pre-trained model and selective search algorithm to provide a fully connected
layer with an output. In this section, we will learn more about the Faster R-CNN
network, which is an improvement on the fast R-CNN model.
The Faster R-CNN model is one of the best versions of the R-CNN family and
improves the speed of performance tremendously from its predecessors. While
the R-CNN and Fast R-CNN model make use of a selective search algorithm to
compute the region proposals, the Faster R-CNN method replaces this existing
method with a superior region proposal network. The region proposal network
(RPN) computes images from a wide range and different scales to produce
effective outputs.
Faster R-CNN – Object Detection Algorithm
The regional proposal network reduces the margin computation time, usually 10
ms per image. This network consists of the convolutional layer from which we
can obtain the essential feature maps of each pixel. For each feature map, we
have multiple anchor boxes which have varying scales, different sizes, and aspect
ratios. For each anchor box, we make a prediction of the particular binary class
and generate a bounding box for the same.
The following information is then passed through the non-maximum suppression
to remove any unnecessary data since many overlaps are produced while creating
the feature maps. The output from the non-maximum suppression is passed
through the region of interest, and the rest of the process and computation is
similar to the working of Fast R-CNN.
→ Points to consider
1. Limitations – One of the main limitations of the Faster R-CNN method is the
amount of time delay in the proposition of different objects. Sometimes, the
speed depends on the type of system being used.
2. When To Use Faster R-CNN? – The time for prediction is faster compared to
other CNN methods. While R-CNN usually takes around 40-50 seconds for the
prediction of objects in an image, the Fast R-CNN takes around 2 seconds, but
the Faster R-CNN returns the optimal result in just about 0.2 seconds.
3. Example use cases – The examples of use cases for Faster R-CNN are similar to
the ones described in the R-CNN methodology. However, with Faster R-CNN, we
can perform these tasks optimally and achieve results more effectively.
→ Introduction
The single-shot detector for multi-box predictions is one of the fastest ways to
achieve the real-time computation of object detection tasks. While the Faster
R-CNN methodologies can achieve high accuracies of prediction, the overall
process is quite time-consuming and it requires the real-time task to run at about
7 frames per second, which is far from desirable.
The single-shot detector (SSD) solves this issue by improving the frames per
second to almost five times more than the Faster R-CNN model. It removes the
use of the region proposal network and instead makes use of multi-scale features
and default boxes.
→ Overview of architecture
The single-shot multibox detector architecture can be broken down into mainly
three components. The first stage of the single-shot detector is the feature
extraction step, where all the crucial feature maps are selected. This architectural
region consists of only fully convolutional layers and no other layers. After
extracting all the essential feature maps, the next step is the process of detecting
heads. This step also consists of fully convolutional neural networks.
However, in the second stage of detection heads, the task is not to find the
semantic meaning for the images. Instead, the primary goal is to create the most
appropriate bounding maps for all the feature maps. Once we have computed the
two essential stages, the final stage is to pass it through the non-maximum
suppression layers for reducing the error rate caused by repeated bounding boxes.
→ Limitations of SSD
1. The SSD, while boosting the performance significantly, suffers from decreasing
the resolution of the images to a lower quality.
2. The SSD architecture will typically perform worse than the Faster R-CNN for
small-scale objects.
→ Points to consider
1. When To Use SSD? – The single-shot detector is often the preferred method.
The main reason for using the single-shot detector is because we mainly prefer
faster predictions on an image for detecting larger objects, where accuracy is not
an extremely important concern. However, for more accurate predictions for
smaller and precise objects, other methods must be considered.
2. Example use cases – The Single-shot detector can be trained and experimented
on a multitude of datasets, such as PASCAL VOC, COCO, and ILSVRC datasets.
They can perform well on larger object detections like the detection of humans,
tables, chairs, and other similar entities.
You only look once (YOLO) is one of the most popular model architectures and
algorithms for object detection. Usually, the first concept found on a Google
search for algorithms on object detection is the YOLO architecture. There are
several versions of YOLO, which we will discuss in the upcoming sections. The
YOLO model uses one of the best neural network archetypes to produce high
accuracy and overall speed of processing. This speed and accuracy is the main
reason for its popularity.
The YOLO architecture utilizes three primary terminologies to achieve its goal of
object detection. Understanding these three techniques is quite significant to
know why exactly this model performs so quickly and accurately in comparison
to other object detection algorithms. The first concept in the YOLO model is
residual blocks. In the first architectural design, they have used 7×7 residual
blocks to create grids in the particular image.
Each of these grids acts as central points and a particular prediction for each of
these grids is made accordingly. In the second technique, each of the central
points for a particular prediction is considered for the creation of the bounding
boxes. While the classification tasks work well for each grid, it’s more complex
to segregate the bounding boxes for each of the predictions that are made. The
third and final technique is the use of the intersection of union (IOU) to calculate
the best bounding boxes for the particular object detection task.
→ Advantages of YOLO
1. The computation and processing speed of YOLO is quite high, especially in
real-time compared to most of the other training methods and object detection
algorithms.
2. Apart from the fast computing speed, the YOLO algorithm also manages to
provide an overall high accuracy with the reduction of background errors seen in
other methods.
3. The architecture of YOLO allows the model to learn and develop an
understanding of numerous objects more efficiently.
→ Limitations of YOLO
1. Failure to detect smaller objects in an image or video because of the lower recall
rate.
2. Can’t detect two objects that are extremely close to each other due to the
limitations of bounding boxes.
→ Versions of YOLO
The YOLO architecture is one of the most influential and successful object
detection algorithms. With the introduction of the YOLO architecture in 2016,
their consecutive versions YOLO v2 and YOLO v3 arrived in 2017 and 2018.
While there was no new release in 2019, 2020 saw three quick releases: YOLO
v4, YOLO v5, and PP-YOLO. Each of the newer versions of YOLO slightly
improved on their previous ones. The tiny YOLO was also released to ensure that
object detection could be supported on embedded devices.
Chapter 4
Critical aspects of the technical architecture include the model training pipeline,
data preprocessing workflows, system infrastructure, and integration with
external platforms such as APIs, camera feeds, or cloud storage. Each component
is meticulously designed to contribute to the system’s overall performance,
flexibility, and maintainability. This chapter will provide a comprehensive
overview of these elements, illustrating how they collectively enable the Object
Detection System to deliver reliable and intelligent visual analysis.
4.1Evolution from YOLO to YOLOv8:
One of the most, if not the most, well-known models in Artificial intelligence
(AI) is the “YOLO” model series.
YOLO (You Only Look Once) is a popular set of object detection models used
for real-time object detection and classification in computer vision.
In this article, we will be focusing on YOLOv8, the latest version of the YOLO
system developed by Ultralytics. We will discuss its evolution from YOLO to
YOLOv8, its network architecture, new features, and applications. Additionally,
we will provide a step-by-step guide on how to use YOLOv8, and lastly how to
use it to create model-assisted annotations with Encord Annotate.
Whether you’re a seasoned machine learning engineer or just starting out, this
guide will provide you with all the knowledge and tools you need to get started
with YOLOv8.
YOLOv1 was the first official YOLO model. It used a single convolutional neural
network (CNN) to detect objects in an image and was relatively fast compared to
other object detection models. However, it was not as accurate as some of the
two-stage models at that time.
YOLOv2 was released in 2016 and made several improvements over YOLOv1. It
used anchor boxes to improve detection accuracy and introduced the Upsample
layer, which improved the resolution of the output feature map.
YOLOv3 was introduced in 2018 with the goal of increasing the accuracy and
speed of the algorithm. The primary improvement in YOLOv3 over its
predecessors was the use of the Darknet-53 architecture, a variant of the ResNet
architecture specifically designed for object detection.
YOLO v3 also improved the anchor boxes, allowing different scales and aspect
ratios to better match the size and shape of the detected objects. The use
of Feature Pyramid Networks (FPN) and GHM loss function, along with a wider
range of object sizes and aspect ratios and improved accuracy and stability, were
also hallmarks of YOLO v3.
YOLOv5, introduced in 2020, builds upon the success of previous versions and
was released as an open-source project by Ultralytics. YOLOv5 used
the EfficientDet architecture, based on the EfficientNet network, and several new
features and improvements, to achieve improved object detection performance.
YOLOv5 became the world’s state-of-the-art repo for object detection back in
2020 given its flexible Pythonic structure and was also the first model we
incorporated for model-assisted learning at Encord.
YOLOv6 focused on making the system more efficient and reducing its memory
footprint. It made use of a new CNN architecture called SPP-Net (Spatial
Pyramid Pooling Network). This architecture is designed to handle objects of
different sizes and aspect ratios, making it ideal for object detection tasks.
The actual published paper has not been released yet but the creators of YOLOv8
promised that it will come out soon (To avoid the drama around YOLOv5).
Thus we do not have a good overview of the methodologies used during creation
nor do we have access to the ablation studies conducted by the team. We will
release an updated version as soon as it is published.
We won’t go too much into detail about the YOLOv8 architecture, but we will
cover some of the major differences from previous iterations.
The following layout was made by RangeKing on GitHub and is a great way of
visualizing the architecture.
Ancher-free Detections
Anchor boxes are a pre-defined set of boxes with specific heights and widths,
used to detect object classes with the desired scale and aspect ratio. They are
chosen based on the size of objects in the training dataset and are tiled across the
image during detection.
The network outputs probability and attributes like background, IoU, and offsets
for each tiled box, which are used to adjust the anchor boxes. Multiple anchor
boxes can be defined for different object sizes, serving as fixed starting points for
boundary box guesses.
There are a series of updates and new convolutions in the YOLOv8 architecture
according to the introductory post from Ultralytics:
Chapter 5 Results and Discussions
5.1Coding:
This is the Image of video given as an input to the Object Detection application.
Discussion
The feedback collected from the evaluation of the Object Detection project reveals
both the effectiveness of the current implementation and areas that warrant
further development. The model demonstrated strong performance in detecting
objects with high accuracy across diverse test scenarios, indicating the robustness
of the underlying architecture and data preprocessing techniques. The following
action points have been identified based on the results and user feedback:
● Model Refinement: While the current model performs well on standard objects,
certain classes showed lower precision and recall. Additional data augmentation,
class balancing, or fine-tuning with a more specialized dataset could improve
detection for underrepresented categories.
6.1 Conclusion
The Object Detection Project was initiated to address the increasing demand for
accurate, efficient, and scalable computer vision solutions capable of identifying
and localizing objects in real time. Through various stages of development,
training, testing, and validation, the system has demonstrated strong potential in
automating visual recognition tasks across multiple domains.
Key Achievements
• Domain Flexibility: The system was designed to be adaptable, with the ability
to retrain on different datasets for applications ranging from industrial inspection
to medical imaging, showcasing its broad applicability.
Summary:
In conclusion, the Object Detection Project has laid a strong foundation for the
development of intelligent visual recognition systems. Its high accuracy,
versatility, and real-time processing capabilities make it a promising solution for
a wide range of real-world applications. Guided by data-driven development and
user feedback, the project is well-positioned for future expansion and refinement.
Continued advancements in performance, customization, and usability will ensure
its relevance and value across industries embracing computer vision technologies.
The Object Detection Project aimed to build an intelligent system capable of
identifying and localizing objects within images.
The model achieved high accuracy in recognizing multiple object classes across
diverse environments. Robust preprocessing and data augmentation techniques
enhanced model generalization. Real-time inference was enabled through model
optimization and hardware acceleration.
The system demonstrated versatility across domains such as surveillance, retail,
and healthcare.
User-friendly deployment options were implemented via APIs and Docker
containers.
Edge-device compatibility ensured the system's applicability in low-latency
scenarios.
Custom training support allowed adaptation to specific industry use cases.
Challenges like detecting occluded or small objects were partially addressed and
noted for future work.
Feedback indicated strong performance but highlighted opportunities for UI and
speed improvements. Security and privacy concerns were acknowledged for
sensitive use cases.
Scalability was considered, with design support for high-volume image
processing.
The project sets a strong foundation for more advanced applications like instance
segmentation.
Overall, the system proves to be a reliable, adaptable, and promising solution for
modern visual detection needs.
6.1 Future Scope
While the Object Detection project has demonstrated promising results in
identifying and localizing objects across various environments, there remain
several avenues for further enhancement and exploration. The following outlines
key future directions that will help improve the system’s performance, scalability,
and applicability across diverse real-world scenarios.
Future Vision
The long-term vision for the Object Detection project is to evolve into a flexible,
real-time, and intelligent perception system capable of seamlessly integrating into
a wide range of applications, from autonomous systems to assistive
technologies—ensuring high accuracy, efficiency, and ethical reliability in every
deploy.
Conclusion
The Object Detection project has successfully demonstrated the capability to
accurately identify and localize multiple objects within diverse visual
environments. Through the application of advanced deep learning techniques, the
system has achieved strong performance in both controlled and real-world
scenarios. The model's effectiveness validates the robustness of the data
preprocessing, architecture selection, and training strategies employed.
Despite its achievements, the project also highlights opportunities for further
enhancement in areas such as real-time performance, precision in complex
scenes, and broader applicability across specialized domains. By continuing to
refine the model, expand its integration potential, and address deployment
challenges, the system can become a powerful tool in a variety of industries
including surveillance, healthcare, retail, and autonomous systems.
With a clear path for future development and a strong foundational framework,
the Object Detection project stands as a significant step toward building
intelligent, adaptable, and scalable computer vision solutions.
References