0% found this document useful (0 votes)

18 views14 pages

Yolo India

Uploaded by

cungpesmancity

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views14 pages

Yolo India

Uploaded by

cungpesmancity

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Object detection is a domain that has benefited immensely from the recent developments in deep

learning. Recent years have seen people develop many algorithms for object detection, some of

which include YOLO, SSD, Mask RCNN and RetinaNet.

For the past few months, I've been working on improving object detection at a research lab. One

of the biggest takeaways from this experience has been realizing that the best way to go about

learning object detection is to implement the algorithms by yourself, from scratch. This is exactly

what we'll do in this tutorial.

We will use PyTorch to implement an object detector based on YOLO v3, one of the faster

object detection algorithms out there.

The code for this tutorial is designed to run on Python 3.5, and PyTorch 0.4. It can be found in

it's entirety at this Github repo.

This tutorial is broken into 5 parts:

1. Part 1 (This one): Understanding How YOLO works

2. Part 2 : Creating the layers of the network architecture

3. Part 3 : Implementing the the forward pass of the network

4. Part 4 : Objectness score thresholding and Non-maximum suppression

5. Part 5 : Designing the input and the output pipelines

Prerequisites

 You should understand how convolutional neural networks work. This also includes

knowledge of Residual Blocks, skip connections, and Upsampling.

 What is object detection, bounding box regression, IoU and non-maximum suppression.

 Basic PyTorch usage. You should be able to create simple neural networks with ease.

I've provided the link at the end of the post in case you fall short on any front.

What is YOLO?

YOLO stands for You Only Look Once. It's an object detector that uses features learned by a

deep convolutional neural network to detect an object. Before we get out hands dirty with code,

we must understand how YOLO works.

A Fully Convolutional Neural Network

YOLO makes use of only convolutional layers, making it a fully convolutional network (FCN).

It has 75 convolutional layers, with skip connections and upsampling layers. No form of pooling

is used, and a convolutional layer with stride 2 is used to downsample the feature maps. This

helps in preventing loss of low-level features often attributed to pooling.

Being a FCN, YOLO is invariant to the size of the input image. However, in practice, we might

want to stick to a constant input size due to various problems that only show their heads when we

are implementing the algorithm.

A big one amongst these problems is that if we want to process our images in batches (images in

batches can be processed in parallel by the GPU, leading to speed boosts), we need to have all

images of fixed height and width. This is needed to concatenate multiple images into a large

batch (concatenating many PyTorch tensors into one)

The network downsamples the image by a factor called the stride of the network. For example, if

the stride of the network is 32, then an input image of size 416 x 416 will yield an output of size

13 x 13. Generally, stride of any layer in the network is equal to the factor by which the

output of the layer is smaller than the input image to the network.

Interpreting the output

Typically, (as is the case for all object detectors) the features learned by the convolutional layers

are passed onto a classifier/regressor which makes the detection prediction (coordinates of the

bounding boxes, the class label.. etc).

In YOLO, the prediction is done by using a convolutional layer which uses 1 x 1 convolutions.

Now, the first thing to notice is our output is a feature map. Since we have used 1 x 1

convolutions, the size of the prediction map is exactly the size of the feature map before it. In

YOLO v3 (and it's descendants), the way you interpret this prediction map is that each cell can

predict a fixed number of bounding boxes.

Though the technically correct term to describe a unit in the feature map would be
a neuron, calling it a cell makes it more intuitive in our context.

Depth-wise, we have (B x (5 + C)) entries in the feature map. B represents the number of

bounding boxes each cell can predict. According to the paper, each of these B bounding boxes

may specialize in detecting a certain kind of object. Each of the bounding boxes have 5 +

C attributes, which describe the center coordinates, the dimensions, the objectness score

and C class confidences for each bounding box. YOLO v3 predicts 3 bounding boxes for every

cell.
You expect each cell of the feature map to predict an object through one of it's bounding

boxes if the center of the object falls in the receptive field of that cell. (Receptive field is the

region of the input image visible to the cell. Refer to the link on convolutional neural networks

for further clarification).

This has to do with how YOLO is trained, where only one bounding box is responsible for

detecting any given object. First, we must ascertain which of the cells this bounding box belongs

to.

To do that, we divide the input image into a grid of dimensions equal to that of the final feature

map.

Let us consider an example below, where the input image is 416 x 416, and stride of the network

is 32. As pointed earlier, the dimensions of the feature map will be 13 x 13. We then divide the

input image into 13 x 13 cells.

Then, the cell (on the input image) containing the center of the ground truth box of an object is
chosen to be the one responsible for predicting the object. In the image, it is the cell which

marked red, which contains the center of the ground truth box (marked yellow).

Now, the red cell is the 7th cell in the 7th row on the grid. We now assign the 7th cell in the 7th

row on the feature map (corresponding cell on the feature map) as the one responsible for

detecting the dog.

Now, this cell can predict three bounding boxes. Which one will be assigned to the dog's ground

truth label? In order to understand that, we must wrap out head around the concept of anchors.

Note that the cell we're talking about here is a cell on the prediction feature map. We
divide the input image into a grid just to determine which cell of the prediction feature
map is responsible for prediction

Anchor Boxes

It might make sense to predict the width and the height of the bounding box, but in practice, that

leads to unstable gradients during training. Instead, most of the modern object detectors predict

log-space transforms, or simply offsets to pre-defined default bounding boxes called anchors.

Then, these transforms are applied to the anchor boxes to obtain the prediction. YOLO v3 has

three anchors, which result in prediction of three bounding boxes per cell.

Coming back to our earlier question, the bounding box responsible for detecting the dog will be

the one whose anchor has the highest IoU with the ground truth box.

Making Predictions

The following formulae describe how the network output is transformed to obtain bounding box

predictions.
bx, by, bw, bh are the x,y center co-ordinates, width and height of our prediction. tx, ty, tw, th is

what the network outputs. cx and cy are the top-left co-ordinates of the grid. pw and ph are

anchors dimensions for the box.

Center Coordinates

Notice we are running our center coordinates prediction through a sigmoid function. This forces

the value of the output to be between 0 and 1. Why should this be the case? Bear with me.

Normally, YOLO doesn't predict the absolute coordinates of the bounding box's center. It

predicts offsets which are:

 Relative to the top left corner of the grid cell which is predicting the object.

 Normalised by the dimensions of the cell from the feature map, which is, 1.

For example, consider the case of our dog image. If the prediction for center is (0.4, 0.7), then

this means that the center lies at (6.4, 6.7) on the 13 x 13 feature map. (Since the top-left co-

ordinates of the red cell are (6,6)).

But wait, what happens if the predicted x,y co-ordinates are greater than one, say (1.2, 0.7). This

means center lies at (7.2, 6.7). Notice the center now lies in cell just right to our red cell, or the

8th cell in the 7th row. This breaks theory behind YOLO because if we postulate that the red

box is responsible for predicting the dog, the center of the dog must lie in the red cell, and not in

the one beside it.

Therefore, to remedy this problem, the output is passed through a sigmoid function, which

squashes the output in a range from 0 to 1, effectively keeping the center in the grid which is

predicting.

Dimensions of the Bounding Box

The dimensions of the bounding box are predicted by applying a log-space transform to the

output and then multiplying with an anchor.

How the detector output is transformed to give the final prediction. Image

Credits. http://christopher5106.github.io/

The resultant predictions, bw and bh, are normalised by the height and width of the image.

(Training labels are chosen this way). So, if the predictions bx and by for the box containing the

dog are (0.3, 0.8), then the actual width and height on 13 x 13 feature map is (13 x 0.3, 13 x 0.8).

Objectness Score

Object score represents the probability that an object is contained inside a bounding box. It

should be nearly 1 for the red and the neighboring grids, whereas almost 0 for, say, the grid at the

corners.

The objectness score is also passed through a sigmoid, as it is to be interpreted as a probability.

Class Confidences

Class confidences represent the probabilities of the detected object belonging to a particular class

(Dog, cat, banana, car etc). Before v3, YOLO used to softmax the class scores.

However, that design choice has been dropped in v3, and authors have opted for using sigmoid

instead. The reason is that Softmaxing class scores assume that the classes are mutually

exclusive. In simple words, if an object belongs to one class, then it's guaranteed it cannot belong

to another class. This is true for COCO database on which we will base our detector.

However, this assumptions may not hold when we have classes like Women and Person. This is

the reason that authors have steered clear of using a Softmax activation.

Prediction across different scales.

YOLO v3 makes prediction across 3 different scales. The detection layer is used make detection

at feature maps of three different sizes, having strides 32, 16, 8 respectively. This means, with

an input of 416 x 416, we make detections on scales 13 x 13, 26 x 26 and 52 x 52.

The network downsamples the input image until the first detection layer, where a detection is

made using feature maps of a layer with stride 32. Further, layers are upsampled by a factor of 2

and concatenated with feature maps of a previous layers having identical feature map sizes.

Another detection is now made at layer with stride 16. The same upsampling procedure is

repeated, and a final detection is made at the layer of stride 8.

At each scale, each cell predicts 3 bounding boxes using 3 anchors, making the total number of

anchors used 9. (The anchors are different for different scales)

The authors report that this helps YOLO v3 get better at detecting small objects, a frequent

complaint with the earlier versions of YOLO. Upsampling can help the network learn fine-

grained features which are instrumental for detecting small objects.

Output Processing

For an image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10647

bounding boxes. However, in case of our image, there's only one object, a dog. How do we

reduce the detections from 10647 to 1?

Thresholding by Object Confidence

First, we filter boxes based on their objectness score. Generally, boxes having scores below a

threshold are ignored.

Non-maximum Suppression

NMS intends to cure the problem of multiple detections of the same image. For example, all the

3 bounding boxes of the red grid cell may detect a box or the adjacent cells may detect the same

object.
If you don't know about NMS, I've provided a link to a website explaining the same.

Our Implementation

YOLO can only detect objects belonging to the classes present in the dataset used to train the

network. We will be using the official weight file for our detector. These weights have been

obtained by training the network on COCO dataset, and therefore we can detect 80 object

categories.

That's it for the first part. This post explains enough about the YOLO algorithm to enable you to

implement the detector. However, if you want to dig deep into how YOLO works, how it's
trained and how it performs compared to other detectors, you can read the original papers, the

links of which I've provided below.

That's it for this part. In the next part, we implement various layers required to put together the

detector.

Part 2

Object Detection Week 2 YOLOv1-YOLOv8
100% (1)
Object Detection Week 2 YOLOv1-YOLOv8
264 pages
YOLO Is The State-Of-The-Art, Real Time System Built On Deep Learning For Solving Object Detection Problems
50% (2)
YOLO Is The State-Of-The-Art, Real Time System Built On Deep Learning For Solving Object Detection Problems
8 pages
Yolov 3
No ratings yet
Yolov 3
42 pages
Computer Vision - Compressed
No ratings yet
Computer Vision - Compressed
46 pages
ch10 Object
No ratings yet
ch10 Object
46 pages
"Object Detection With Yolo": A Seminar On
No ratings yet
"Object Detection With Yolo": A Seminar On
14 pages
Yolo V3:: - Introduction - Bounding Box Prediction - Class Prediction - Anchor Boxes - Feature Extraction
No ratings yet
Yolo V3:: - Introduction - Bounding Box Prediction - Class Prediction - Anchor Boxes - Feature Extraction
14 pages
Yolo
No ratings yet
Yolo
32 pages
Documents 2025-0 (v3) - Object Detection Object Detection-L3 v3
No ratings yet
Documents 2025-0 (v3) - Object Detection Object Detection-L3 v3
170 pages
YOLO V2 For Object Detection
No ratings yet
YOLO V2 For Object Detection
38 pages
Yolo 220209212833
No ratings yet
Yolo 220209212833
17 pages
Constructon
No ratings yet
Constructon
10 pages
Review of YOLO - Drawback and Improvement From v1 To v3
No ratings yet
Review of YOLO - Drawback and Improvement From v1 To v3
6 pages
What's New in YOLO v3 - A Review of The YOLO v3 Object - by Ayoosh Kathuria - Towards Data Science
No ratings yet
What's New in YOLO v3 - A Review of The YOLO v3 Object - by Ayoosh Kathuria - Towards Data Science
14 pages
C11240283S19
No ratings yet
C11240283S19
4 pages
Algoritm For MOD
No ratings yet
Algoritm For MOD
32 pages
Yolo
No ratings yet
Yolo
13 pages
YOLOv 5
No ratings yet
YOLOv 5
10 pages
YOLO
No ratings yet
YOLO
43 pages
Real-Time Face Detection Based On YOLO
No ratings yet
Real-Time Face Detection Based On YOLO
4 pages
Overview of YOLO ObjectDetectionAlgorithm
No ratings yet
Overview of YOLO ObjectDetectionAlgorithm
7 pages
10 - CPU Based YOLO A Real Time Object Detection Algorithm
No ratings yet
10 - CPU Based YOLO A Real Time Object Detection Algorithm
4 pages
Signature Object Detection Based On YOLOv3
No ratings yet
Signature Object Detection Based On YOLOv3
4 pages
Yolo Algorithm
No ratings yet
Yolo Algorithm
37 pages
YOLO Object Detection Explained - A Beginner's Guide - DataCamp
No ratings yet
YOLO Object Detection Explained - A Beginner's Guide - DataCamp
14 pages
Seminar 201202175023
No ratings yet
Seminar 201202175023
16 pages
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
No ratings yet
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
8 pages
MJEER-Volume 30-Issue 1 - Page 52-57
No ratings yet
MJEER-Volume 30-Issue 1 - Page 52-57
6 pages
YOLO
No ratings yet
YOLO
10 pages
Object Detection Technique (YOLO)
No ratings yet
Object Detection Technique (YOLO)
19 pages
Stockpile Calculations
100% (2)
Stockpile Calculations
6 pages
Object Detection With YOLO
No ratings yet
Object Detection With YOLO
18 pages
Yolo Report
No ratings yet
Yolo Report
6 pages
Ex No 06
No ratings yet
Ex No 06
4 pages
YOLO v2
No ratings yet
YOLO v2
9 pages
Project
100% (1)
Project
30 pages
Yolopdf
No ratings yet
Yolopdf
10 pages
YOLO Based Detection and Classification of Objects in Video Records
No ratings yet
YOLO Based Detection and Classification of Objects in Video Records
5 pages
Yolo
No ratings yet
Yolo
20 pages
Object Detection Method Based On YOLOv3 Using - Deep Learning Networks
No ratings yet
Object Detection Method Based On YOLOv3 Using - Deep Learning Networks
4 pages
Object Detection Document
No ratings yet
Object Detection Document
4 pages
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
No ratings yet
YOLO-LITE: A Real-Time Object Detection Algorithm Optimized For Non-GPU Computers
8 pages
Object Detection Method Based On Yolov3 Using Deep Learning Networks
No ratings yet
Object Detection Method Based On Yolov3 Using Deep Learning Networks
4 pages
Unified Real-Time Object Detection
No ratings yet
Unified Real-Time Object Detection
36 pages
Object Detection Using Yolo
No ratings yet
Object Detection Using Yolo
42 pages
Helper
No ratings yet
Helper
1 page
Yolo Paper
No ratings yet
Yolo Paper
10 pages
YOLO
No ratings yet
YOLO
7 pages
Object Detection and Classification Using Yolov3 IJERTV10IS020078
No ratings yet
Object Detection and Classification Using Yolov3 IJERTV10IS020078
6 pages
Deep Learning YOLOv2
No ratings yet
Deep Learning YOLOv2
3 pages
CI Object Detection and Localization
No ratings yet
CI Object Detection and Localization
27 pages
Analytical Study On Object Detection Using Yolo Algorithm
No ratings yet
Analytical Study On Object Detection Using Yolo Algorithm
3 pages
You Only Look Once - Unified, Real-Time Object Detection
No ratings yet
You Only Look Once - Unified, Real-Time Object Detection
10 pages
Yolov3: An Incremental Improvement: Joseph Redmon, Ali Farhadi
No ratings yet
Yolov3: An Incremental Improvement: Joseph Redmon, Ali Farhadi
6 pages
YOLO V3 ML Project
No ratings yet
YOLO V3 ML Project
15 pages
Yolo
No ratings yet
Yolo
10 pages
Yolo Algorithm
No ratings yet
Yolo Algorithm
4 pages
Acer Aspire v5-572p Quanta ZQK DAOZQKMB8E0 Rev1A Schematic
100% (1)
Acer Aspire v5-572p Quanta ZQK DAOZQKMB8E0 Rev1A Schematic
46 pages
Compact Plus en 20100806
100% (1)
Compact Plus en 20100806
112 pages
Ch2100X - Spare Parts
100% (1)
Ch2100X - Spare Parts
195 pages
Investment Analysis & Portfolio Management Antim Prahar
0% (1)
Investment Analysis & Portfolio Management Antim Prahar
26 pages
QC-041 00 Operation and Calibration of GC Agilent 7890A
No ratings yet
QC-041 00 Operation and Calibration of GC Agilent 7890A
10 pages
Pharmacy Proposal
25% (4)
Pharmacy Proposal
20 pages
Atm System FINAL
No ratings yet
Atm System FINAL
77 pages
ASTM International Constructuring Smooth Hot Mix Asphalt 2003 PDF
100% (1)
ASTM International Constructuring Smooth Hot Mix Asphalt 2003 PDF
274 pages
Earle Brown - Compositional Process
100% (1)
Earle Brown - Compositional Process
19 pages
J.E. Maintenance Manual 2011 07
No ratings yet
J.E. Maintenance Manual 2011 07
8 pages
02 Chem30 Exemplars 2009 10
No ratings yet
02 Chem30 Exemplars 2009 10
94 pages
Apptitude + HR Qa
No ratings yet
Apptitude + HR Qa
252 pages
B.Sc. II Year (Biotechnology), NEP (Session - 2024-25) Paper I (Major)
No ratings yet
B.Sc. II Year (Biotechnology), NEP (Session - 2024-25) Paper I (Major)
25 pages
M1L3 LN
No ratings yet
M1L3 LN
7 pages
PHY210 CHAPTER 5 - THERMAL PHYSICS Students PDF
No ratings yet
PHY210 CHAPTER 5 - THERMAL PHYSICS Students PDF
34 pages
Failures Related To Heat Treating Operations PDF
No ratings yet
Failures Related To Heat Treating Operations PDF
32 pages
RMR DOKU V20 E L
100% (1)
RMR DOKU V20 E L
133 pages
(Question) Mat 491 Final Assessment 6aug2021 3PM-5PM
No ratings yet
(Question) Mat 491 Final Assessment 6aug2021 3PM-5PM
4 pages
Lab. Activity 6 Boolean Algebra and Simplification of Logic Equations
No ratings yet
Lab. Activity 6 Boolean Algebra and Simplification of Logic Equations
5 pages
EST 1 (Last Minute) (16-5-2024)
No ratings yet
EST 1 (Last Minute) (16-5-2024)
13 pages
Generalized Minimum Miscibility Pressure Correlation: SPE, Petroleum Technology Research LNST
No ratings yet
Generalized Minimum Miscibility Pressure Correlation: SPE, Petroleum Technology Research LNST
10 pages
IC693ALG223
No ratings yet
IC693ALG223
17 pages
Matlab Programmming Previous Papers
No ratings yet
Matlab Programmming Previous Papers
4 pages
Cases of Pronoun
No ratings yet
Cases of Pronoun
3 pages
IE210 Int. To Systems and Mathematical Modeling For Ind. Eng
No ratings yet
IE210 Int. To Systems and Mathematical Modeling For Ind. Eng
15 pages
Large
No ratings yet
Large
15 pages
Homework 03
No ratings yet
Homework 03
4 pages
What Is A Charts?: Practical Work 6 MS Excel. Spreadsheets&modelling
No ratings yet
What Is A Charts?: Practical Work 6 MS Excel. Spreadsheets&modelling
2 pages
Protein Metabolism: Department of Biochemistry Medical Faculti of Hasanuddin University
No ratings yet
Protein Metabolism: Department of Biochemistry Medical Faculti of Hasanuddin University
80 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Voxel: Exploring the Depths of Computer Vision with Voxel Technology
From Everand
Voxel: Exploring the Depths of Computer Vision with Voxel Technology
Fouad Sabry
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet