0% found this document useful (0 votes)
4 views

Object Detection and Tracking

Uploaded by

tahatarek7770
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Object Detection and Tracking

Uploaded by

tahatarek7770
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

OBJECT

DETECTION &
TRACKING
MEET THE INSTRUCTOR

Nezar Ahmed
Machine Learning Lead
Synapse Analytics
Master’s Student
Computer Communication and Engineering
Cairo University
AI Instructor
ITI / Epsilon AI / AMIT
DISCLAIMER AND ACKNOWLEDGMENT
Some of the slides are taken from:
Various courses, articles, and tutorials on computer vision as
Tutorial point, PyImageSearch, AnalyticsVidhya, Medium and
Towards Datascience.
OBJECT DETECTION
The classification task we have gone through was is mainly an image containing a
specific object or state where your network tells you the probability of each class of
your predefined classes concerning this image where the class with the highest
probability is the class representing this image.
What if we want not just to say what is the class of the object in the image, what if we
want also to detect its position in the image. Is that all ? No what if there are multiple
objects in the image and you want to detect all the object regarding their position then
classify each object. Here comes some terminologies like localization and detection.
Localization: it refers for having a single object in the image so we have an object in the
image and want to detect its position by drawing a boundary box around the object and
classify it.
Detection: it refers for having multiple objects in the same image (objects could be of
same class or different classes) and your task is to determine the position of each
object by drawing a boundary box around each object and classify it.
OBJECT DETECTION
OBJECT DETECTION
How to solve object detection problem?
Let us take the following example and start from the most naive idea in our heads and
keep forming on it until reaching the end to end solution.
Assuming we are building a pedestrian detection system for autonomous driving system
and the car captures the following image and our target is to detect those pedestrians:
OBJECT DETECTION
Approach 1: Naive way (Divide and conquer)
The simplest approach is to divide our image into four parts as shown then pass them
to a classifier to tell if this part has a pedestrian or not.
OBJECT DETECTION
If the classifier found that the one of the 4 parts is having a pedestrian, hence the box is
the whole upper part.

Good as starting point but still we need more precise bounding boxes around the
objects.
OBJECT DETECTION
How can we enhance the previous idea?
Approach 2: Increase the number and sizes of divisions
Instead of having 4 patches only with fixed sizes to be given to our trained classifier, let
us make them several patches with different sizes to be passed and to store the patch
size and position to act as bounding boxes as shown:
OBJECT DETECTION
A lot of boxes!! but the good thing that now we have some boxes closer to the two
objects we have than the previous method. So we are somehow much closer to find
more precise and accurate bounding boxes.
Can you suggest the next step?
Approach 3: Structured divisions with boxes of different aspect ratio
for having more structured division, let us divide our image into a grid cell (say 10×10):
OBJECT DETECTION
Define the center of each grid then for each center take several patches (say 3) of
different aspect ratios as shown:
OBJECT DETECTION
Pass every patch of every grid cell to the classifier to get predictions and save the
boundary boxes of the classified patches and let us see how they look on the image as
shown:

Less boundary box than the previous approach and more close to the right boundary
boxes needed. We are more steps closer to the precise boundary boxes
Can you suggest the next enhancement?
OBJECT DETECTION
Approach 4: More structured divisions with more suggested patches:
Instead of 10x10 make it 20x20 and instead of 3 patches make them 9 for more aspect
ratios and more sizes for example:

This leads to more computations so we need a methodology to choose the best


candidates only to pass them to the classifier. (will be taken later this session)
OBJECT DETECTION
Approach 5: Try using the deep learning to find an end-to-end approach
Deep learning has so much potential in the object detection space. Can you recommend
where and how can we leverage it for our problem? I have listed a couple of
methodologies below:
● Instead of taking patches from the original image, we can pass the original image
through a neural network to reduce the dimensions
● We could also use a neural network to suggest selective patches
● We can reinforce a deep learning algorithm to give predictions as close to the
original bounding box as possible. This will ensure that the algorithm gives more
tighter and finer bounding box predictions
Now instead of training different neural networks for solving each individual problem,
we can take a single deep neural network model which will attempt to solve all the
problems by itself. The advantage of doing this, is that each of the smaller components
of a neural network will help in optimizing the other parts of the same neural network.
This will help us in jointly training the entire deep model.
FC LAYERS TO CONV LAYERS
Targeting the problem of high computational cost of sliding windows:
The problem that we mentioned before of having high computational cost as iteratively
because of doing several sliding windows with different sizes and each sliding window is
moving along the whole image where each time you apply the cropped part to a convnet
which is an expensive task to be done.

Before getting in how to solve this problem, let us conquer another idea then get back
to the solution for this problem that this idea will be used in.
FC LAYERS TO CONV LAYERS
Turning Fully Connected (FC) layer into convolutional layer
Assume having a network as shown:

Now we want to have a fully connected layer but in form of convolutional layer? What
can we do?
FC LAYERS TO CONV LAYERS
We will convolve the last layer before the required FC layer with n filters each of the
same dimension (and channels of course) of this last layer before required FC layer
where n is the length of FC layer so in our example the layer before the first FC is of
dimension 5×5×16 so we will convolve it with 400 filters each filter is of dimension
5×5×16 so each convolution will result in a 1×1 result (single value) and since we have
400 filters so the final output will be of dimension 1×1×400 as shown:
FC LAYERS TO CONV LAYERS
What about the next FC layer ? Similarly, we will look at the layer before it so its size is
1×1×400 and the size needed for FC is also 400 so we will convolve this layer with 400
filter seach of them is of size 1×1×400 so that the output now is of size 1×1×400 as
shown:

Similarly with the softmax layer which has 4 outputs (4 classes) so to form this layer we
should convolve the layer before it with 4 filters each of dimension 1×1×400 to end up
with an output of dimension 1×1×4 to have our final network as shown in the next slide:
FC LAYERS TO CONV LAYERS
Final Network
OVERFEAT
Why would we need the previous idea?
Any object detection method is having mainly 2 stages which are the training and
testing stage and the inference stage. In the training stage there will be no problem as it
will be trained normally except that we will replace the fully connected layers by the
conv layers as we explained earlier.
However this wouldn’t make any enhancements, where is the enhancement that this
idea proposed?
The enhancement is in the inference stage where the sliding window takes place to loop
over the image and crop the region of interest and pass it to the classifier part that is
trained as we have said before. The idea proposed by converting the FC layers to conv
layers will make us apply all the sliding windows at once using the convolutions as we
normally do instead of passing each cropped part along the classification network each
time.
OVERFEAT
Let us assume that our sliding window that we use in inference is 14x14 and that our
test image is of size 16×16 andwe want to apply this 14 by 14 sliding window with
stride=2 (means that every time we slide the window over the testing image by shifting 2
pixels) so by doing that we will find that we have 4 possible windows to be applied as
shown on the test image:
OVERFEAT
So what we were doing is that each window of this 4 windows will crop the test image
and passes the cropped photo to the convnet to decide whether there is an object of
the 4 classes or not but this is expensive computation to do as we will repeat this 4
times for each of the 4 windows taking into consideration that we are giving a simple
example where there is only 4 possible windows but if stride=1 there will be 16 possible
windows and if the image is larger and sliding window size is smaller there will be much
much more
What overfeat method says is to apply the 16×16 test image as it is without any cropping
to the same pre-trained network and the output will contain all the 4 possible outcomes
of the 4 windows internally so the final output will be 2×2 instead of 1×1 so we will have
the 4 vector output where each one represents the output of one of the windows as
shown in the next slide.
OVERFEAT

Can you tell me what this output means?


Each one of the 2x2 values if corresponding to one of the 4 windows that we have shown
before.
What is the number 4 means in the 2x2x4 in the last layer?
This is the number of classes that we classify between them.

Think of it as if each 1x1x4 is the classifier of one of the 4 windows.


OVERFEAT
Similarly if the test images is 28×28 and we have the same sliding 14×14 window that is
slided with stride=2 so now we have 64 possible windows so instead of doing 64 paths
through the trained network each with the cropped window of the 64 windows now we
pass it once and get the whole output at once as shown:
OVERFEAT
All what we have discussed so far is classification where the bounding box is the spatial
location of the sliding window, however sliding windows is still a not very efficient way
for predicting an accurate bounding box as there could be no bounding box that
perfectly match the object, Can you suggest a way?
The answer is in this image:
OVERFEAT
If you removed the regressor box from the previous pipeline, it will turn on to be
classification network. What we will do is adding regression for localization where
regression is about returning a number instead of a class, in our case we're going to
return 4 numbers (x , y, width, height) that are related to a bounding box. You train this
system with an image and a ground truth bounding box, and use L2 distance to
calculate the loss between the predicted bounding box and the ground truth.
How this part is trained? Training pipeline
● The image classification network is trained first as we have shown previously
● The top part of classification (convnets used instead of FC) is replaced by
regression ones. Regression ones mean convnets with last layer having any
activation function rather than softmax.
● The weights of the first part is fixed and we train only the new replaced regression
part using l2 loss (rather than the cross-entropy loss) between the prediction the
network produces and the ground truth that we pass to the network.
OVERFEAT
Inference pipeline
● Apply classification at each location using the overfeat method.
● Perform Localization on all classified regions generated by the classifier.
● Merge bounding boxes with sufficient overlap from localization and pick the box
that has highest confidence. (highest classification score)
What we have said here is missing a very important thing? Can you guess what it is?
The scale!!! If I kept passing the image with the same size on network of the same
convolution and pooling sizes, this means that I can only discover at this scale!!
Can you suggest a solution for this problem?
Overfeat used the idea of image pyramids where we the input image is passed at 6
different scales to be able to get larger or smaller objects.

Check the following 2 slides.


OVERFEAT
OVERFEAT
OVERFEAT
As we have seen in the previous 2 slides that as the image scale gets larger, the
convolution is done with smaller part of the image (detect smaller objects) which
results in a different outputs size (we call it spatial output) where each output is for
certain scale. Don’t forget that we have 5 numbers as output which is 1 for classification
and 4 for the bounding box and each one is got from its relevant part (classification part
or regression part).
Notes
Each element in the spatial output outputs 1 localization (class + bounding box) which
results in explosion of boxes as shown below:
- This is solved by IoU to keep the most prominent
boxes (will be discussed later)
- In detection, we train negative samples
(Backgrounds)
RCNN
RCNN was introduced after the overfeat in 2014 by Microsoft research team.
Algorithm
● The region proposals
● The feature extraction step using CNNs
● The classification/regression step.

Concept
● The idea of selective search

What is the selective search?


It is an idea where each image doesn't need more than 2000 proposal or windows to
perform convolutional operations on to extract features of these 2000 regions.

To illustrate how this is done, we need to talk about basic image segmentation method.
RCNN
Image segmentation
This means that each point in the image is classified to be related to 1 class.
Felzenszwalb and Huttenlocher (2004) proposed an algorithm for segmenting an image
into similar regions using a graph-based approach. It is also the initialization method
for Selective Search (a popular region proposal algorithm) that we are gonna discuss
later.
There are 2 approaches to create a graph (like computational graphs) out of an image:
● Grid Graph: Each pixel is only connected with surrounding neighbours (8 other
cells in total). The edge weight is the absolute difference between the intensity
values of the pixels.
● Nearest Neighbor Graph: Each pixel is a point in the feature space (x, y, L, a, b), in
which (x, y) is the pixel location and (L, a, b) is the color values in L*a*b space. The
weight is the Euclidean distance(revise it) between two pixels’ feature vectors (the
feature vector here is as mentioned ((x, y, L, a, b))) ⇒ This will be used here.
The original method used R, G, B initially however L*a*b space make more sense.
RCNN
Connected Components
A connected component (or just component) of an undirected graph is a subgraph in
which any two vertices are connected to each other by paths, and which is connected to
no additional vertices in the supergraph. For example, the graph shown in the
illustration has three connected components. A vertex (pixel) with no incident edges is
itself a connected component. A graph that is itself connected has exactly one
connected component, consisting of the whole graph. The connected components of the
graph are taken to be the segments in the image segmentation (so we will refer to
segments and connected components interchangeably).
RCNN
Algorithm steps of the segmentation

Note that MInt is called the minimum internal difference which is the threshold that we
compare the weight to.
Weight is computed using any distance as Euclidean distance
RCNN
How selective search works?
● Apply Felzenszwalb and Huttenlocher’s graph-based image segmentation
algorithm to create regions to start with.
● Apply hierarchical clustering on the segmented image:
○ First the similarities between all neighbouring regions are calculated.
○ o The two most similar regions are grouped together, and new similarities
are calculated between the resulting region and its neighbours.
● The process of grouping the most similar regions (Step 2) is repeated until the
whole image becomes a single region. Hence start to propose the 2000 region
proposals as shown in the figure in the next slide.
RCNN
RCNN
In the hierarchical clustering, what are the similarities that are used to merge regions?
Given two regions (ri , rj ), selective search proposed four complementary similarity
measures:
● Color similarity
● Feature/Texture: Use algorithm that works well for material recognition such as
SIFT/LBP.
● Size: Small regions are encouraged to merge early.
● Shape: Ideally one region can fill the gap of the other.
What is after 2000 region proposal are generated?
Each region proposal is processed to get a bounding box around it covering the whole
segment. Affine image wrapping is then applied to compute fixed size box (224x224).
Input ⇒ 2000 proposed regions which consist of background and the object classes
Output ⇒ 2000 warped images of fixed size
RCNN
After we extract our region proposal bounding boxes, we also have to label them for
training. Therefore, the authors label all the proposals having IOU of at least 0.5 with
any of the ground-truth bounding boxes with their corresponding classes. However, all
other region proposals that have an IOU of less than 0.3 are labelled as background.
What is IoU?
IoU is intersection over union which measure the percentage of overlapping between 2
bounding boxes by dividing the intersected area between two boxes (Yellow box) over
the union of the area of the two boxes (Area of purple box + Area of red box - Area of
yellow box).
RCNN
Feature extraction using CNN
This was done by AlexNet where we pass the image by the whole network except for the
softmax layer. Hence a feature vector of size 4096 is generated to represent this
proposal region.

Classification and Regression step


The 4096-dimensional vector generated is passed through SVM classifier to be trained
for the classification part. The regressor used here is not mentioned in the paper but
they said that it is a simple regressor. They are trained as 2 network independently.

Note: In the inference stage, after passing the proposals to the CNN and then passing
the feature vector through the classification and regression step. Many bounding boxes
exist while a lot of them would be redundant and overlapping bounding boxes which
need to be removed. To accomplish that Non-maximum suppression algorithm is used.
RCNN
Non-Max Suppression
Non-max suppression is used for finding only one box for each object as there will be
many bounding boxes pointing on the same object as shown below:
Noting that the values on the boxes is the highest probability among the classes of the
network (Pc)
RCNN
Algorithm
1-Discard all boxes associated with low probabilities (P c < threshold).
2-If there are remaining boxes (boxes with P c > 0.6) so loop over the following:
● Take the box with highest probability(P c ) among all the remaining boxes and
keep it.
● Perform IoU calculation between this kept box and all the remaining boxes and
discard any box of the remaining ones with IoU > threshold (say 0.5) because this
means that most probably these boxes are detecting the same object of the kept
box.
● Repeat the 2 previous steps again on the non-discarded boxes (IoU<0.5) because
these boxes are most probably belonging to another object in the image.
RCNN
Now we will work on the example that we saw 2 slides ago, we will pick the highest
probability among them all (0.9) and keep it (in white), then now we apply IoU with all
the remaining 4 boxes so we found out that 2 of the remaining boxes (0.6 and 0.7 boxes)
are having high IoU with this box in white so we will discard them.
Now we have remaining 2 non-discarded boxes (on the left) so we will take the highest
among these 2 boxes (0.8 box in white) then perform IoU with the remaining 1 box (0.7
box on the left) so we found out that it has high IoU so we will discard it. Eventually we
have no remaining boxes so final output is the 2 white boxes with 0.9 and 0.8 class
probability.
RCNN
Problems with RCNN
● It still takes a huge amount of time to train the network as you would have to
classify 2000 region proposals per image and then pass it to bounding box
regressor.
● It cannot be inferred in real time as it takes around 40-50 seconds for each test
image.
● The selective search algorithm is a fixed algorithm. Therefore, no learning is
happening at that stage. This could lead to the generation of bad candidate region
proposals.

The idea was enhanced by Fast RCNN. It used the spatial pyramid pooling (SPP-Net) so
let’s come across it first.
SPP-NET
SPP-Net
SPP-Net tried to solve the problem of having fixed input size (224x224) to make the input
have any size. What do we mean by this? We mean that instead of cropping each
proposal and fix its size and then pass through the CNN, we will eliminate the part of
fixing its size and hence no need to enter the 2000 proposal sequentially but will be
done all at once. This can be done by applying new pooling operation called spatial
pyramid pooling that is applied after the last conv layer and before the fully connected
layers (Spatial pyramid pooling is applied in between them).
Spatial pyramid pooling (aka RoI pooling)
It is a pooling that is applied JUST that section of the feature maps of last conv layer
that corresponds to the proposal region. The rectangular section of conv layer
corresponding to a region can be calculated by projecting the region on conv layer by
taking into account the downsampling happening in the intermediate layers.

Check the following 2 slides to get the idea visually


SPP-NET
Since each region proposal will output different size of feature maps from last conv,
however we need the final output to be of same size 4096. Noting that the channels of
the last conv layer is 256. To output the 4096-dimensional vector while having 256
channels (feature maps) then we need the output be 4x4 so that when we flatten it, it
turns to 4x4x256. How this can be applied?
Let the feature map input to the SPP layer be of size (13x13)
We need the output size for single map be (4x4)
Window of pooling = ceiling(13/4) ⇒ 4x4 pooling
Stride = floor(13/4) ⇒ 3

Now on a (13 x 13) map, apply the above window (4 x 4) and stride 3, we get the output
to be (4 x 4). This operation is applied to all the feature maps (256) so for 256 feature
maps we get an output of (4x4x256 ⇒ 4096)
SPP-NET
What if the region proposal is rectangular not square hence the input to the SPP is not
nxn but nxm which is the most common case?
We apply the ceiling and flooring of window and strides respectively on both width and
height separately.

SPP-Net proposed to make the feature vector of size 5376 as shown in the next slide:
● each feature map is pooled to become one value (grey), thus 256-d vector is
formed.
● Then, each feature map is pooled to have 4 values (green), and form a 4×256-d
vector.
● Similarly, each feature map is pooled to have 16 values (blue), and form a 16×256-d
vector.
● The above 3 vectors are concatenated to form a 1-d vector.
● Finally, this 1-d vector is going into FC layers as usual.
Don’t Forget that we get the pooling window size and stride for each one of the 3.
SPP-NET
FAST R-CNN
Fast RCNN is a successor of RCNN which is much faster.

Concepts
● It used the idea of spatial pyramid pooling to create fixed-length feature vector
from variable input.
● Compared to R-CNN, which has trained multiple stages for feature extraction,
classification and Regression, Fast R-CNN builds a network that can train feature
extraction. classification and regression simultaneously at the same time in the
same network.
● Fast R-CNN shares computations (i.e. convolutional layer calculations) across all
proposals (i.e. ROIs) rather than doing the calculations for each proposal
independently. This is done by using the new ROI Pooling layer, which makes Fast
R-CNN faster than R-CNN.
FAST RCNN
How is it become a single stage combining all training together?
We are not going to train the networks independently for classification and regression
but we added the bounding box regression to the neural network training itself. So, now
the network had two heads, classification head, and bounding box regression head. This
multitask objective is a salient feature of Fast-RCNN as it no longer requires training of
set of networks independently for classification(SVMs) and localization(Boundary box
regressors). This change along with the RoI pooling idea reduce the overall training time
and increase the accuracy in comparison to RCNN net because of the end to end
learning of CNN.
FAST RCNN
Problems with Fast RCNN
It still uses selective search as a proposal method to find the Regions of Interest, which
is a slow and time consuming process. It takes around 2 seconds per image to detect
objects, which is much better compared to RCNN. But when we consider large real-life
datasets, then even a Fast RCNN doesn’t look so fast anymore. Let's see what Faster
R-CNN would do to solve this problem.
FASTER RCNN
Faster RCNN is the successor of Fast RCNN which mainly solved the problem of selective
algorithm that propose region proposals in ~2 seconds.
Concept
● Introduced the idea of Region Proposal Network (RPN) to generate the proposal
regions.
Algorithm
● We take an image as input and pass it to the ConvNet which returns the feature
maps for that image.
● Region proposal network is applied on these feature maps. This returns the object
proposals along with their objectness score.
● An RoI pooling layer is applied on these proposals to bring down all the proposals
to the same size.
● Finally, the proposals are passed to a fully connected layer which has a softmax
layer and a linear regression layer at its top, to classify and output the bounding
boxes for objects.
FASTER RCNN
FASTER RCNN
How RPN works?
● At the last layer of an initial CNN, a 3x3 sliding window moves across the feature
map and maps it to a lower dimension (e.g. 256-d)
● For each sliding-window location, it generates multiple possible regions based on
k fixed-ratio anchor boxes (default bounding boxes) ⇒ 9 anchor boxes at each
sliding window.
● Each region proposal consists of
○ an “objectness” score for that region to tell whether there is an object or not.
○ Coordinates representing the bounding box of the region.
FASTER RCNN
FASTER RCNN
Anchor boxes
Anchor boxes are 9 anchor boxes that covers wide(1:2), tight(2:1) and square(1:1) boxes as
shown below:
YOLO
You Only Look Once (YOLO) is one of the best balance between the accuracy and the
speed where it is not as accurate as RCNN and its variants however it is much faster
than them which makes it a good choice for the real-time object detection. YOLO made
all the learning in one shot by making the network proposal free (doesn’t need the step
of region proposals (RPN or selective method)). It can run at very high speed (at this
time) reaching 45 FPS.
Algorithm
● Divide your image to SxS grid cells such that each object that is present on the
image, one grid cell is said to be responsible for predicting it (based on its center).
● Each grid cell predicts N bounding boxes to cover all objects in this cell where
each box is composed of one confidence score, and 4 numbers representing the
box (x, y, w, h).
● Non-max suppression applied to remove highly overlapped boxes.
YOLO
Let us take this image as an input:
YOLO
YOLO then divides the image to SxS grids such that each cell can predict only 1 object.
For each object that is present on the image, one grid cell is said to be responsible for
predicting it. That is the cell where the center of the object falls into. For example, the
yellow grid cell below tries to predict the “person” object whose center (the blue dot)
falls inside the grid cell.
YOLO
Each grid cell predicts a fixed number of boundary boxes(say 2). In this example, the
yellow grid cell makes two boundary box predictions (blue boxes) to locate where the
person is.
YOLO
What is the output of each cell assuming having 2 predicted boxes per cell?
Each boundary box contains 5 elements: (x, y, w, h) and a box confidence score. Formally
we define confidence as Pr(Object) × IoU(pred, truth) . If no object exists in that cell, the
confidence score should be zero and if an object exists, it will be defined as the IoU
(Pr(Object) = 1). Each cell (not each bounding box) has 20 conditional class probabilities.
The conditional class probability is the probability that the detected object belongs to a
particular class (one probability per category for each cell) so that f no object is present
on the grid cell, the loss function will not penalize it for a wrong class prediction. So,
YOLO’s prediction has a shape of (S, S, B×5 + C) = (7, 7, 2×5 + 20) = (7, 7, 30)
Why did I bolded the word Each cell (not each bounding box)?
Because if we assume having 2 bounding box per cell they will both take the same class
probability which means that if 2 object of different classes having their centers at the
same bounding box, we will only be able to predict box of them of one class. It detects
one object only regardless of the number of boxes B. This one of the main limitations in
YOLOv1.
YOLO
How is the 4 number represents the box in YOLO?
We normalize the bounding box width w and height h by the image width and height. x
and y are offsets to the center of corresponding cell. Hence, x, y, w and h are all between
0 and 1.
YOLO
Architecture
YOLO
Notes
1-The architecture was crafted for use in the Pascal VOC dataset, where the authors used
S=7, B=2 and C=20. This explains why the final feature maps are 7x7, and also explains
the size of the output (7x7x(2*5+20)). Use of this network with a different grid size or
different number of classes might require tuning of the layer dimensions.
2-The authors mention that there is a fast version of YOLO, with only 9 convolutional
layers called Tiny-YOLO. The table above, however, display the full version.
3-The sequences of 1x1 reduction layers and 3x3 convolutional layers were inspired by
the GoogleNet (Inception) model.
4-The final layer uses a linear activation function. All other layers use a leaky RELU.
5- Since each cell produces B bounding boxes of the same class, hence in the inference
we choose the one with the highest confidence and discard the other.
YOLO
Loss function
Loss function in YOLO depends on the sum of squared error idea and is composed of 3
parts:
● The classification loss.
● The localization loss (errors between the predicted boundary box and the ground
truth).
● The confidence loss (the objectness of the box). ⇒ This has 2 parts

For classification loss:


If an object is detected,
the classification loss at each cell
is the squared error of the class
Conditional probabilities for each class.
YOLO
For localization loss: The localization loss measures the errors in the predicted boundary
box locations and sizes. We only count the box responsible for detecting the object.
(Since each cell predicts 2 bounding box, only the box with the highest confidence score
(highest IoU) is taken into consideration here in the equation.

Why to put square root with w and h?


YOLO
We do not want to weight absolute errors in large boxes and small boxes equally. i.e. a
2-pixel error in a large box is the same for a small box. To address this, YOLO predicts
the square root of the bounding box width and height instead of the width and height.
If you don't get the previous paragraph, let us take an example:
Assume 2 bounding boxes, one of size 4×4 and the other of size 100×100 so assume that
the predicted bounding box was of size 2×2 for the first case, and 98×98 in the second
one. now the error in the first case is 2 pixels and the error in the second case is also 2
pixels, however, the first case the 2 pixels represents 50% error (as the ground truth one
is 4) while in second case the 2 pixels represents only 2% (as the ground truth one is
100) so here comes the role of the "square root to relate the error to the original size.
Error in large box = (sqrt(100) - sqrt(98))2 = 0.0101
Error in small box = (sqrt(4) - sqrt(2))2 = 0.343
This now makes sense as the small box has more error relative to large box.
YOLO
For confidence loss
If an object is detected in the box, the confidence loss (measuring the objectness of the
box) is:

If an object is not detected in the box, the confidence loss is:


YOLO
Note:
Most boxes do not contain any objects. This causes a class imbalance problem, i.e. we
train the model to detect background more frequently than detecting objects. To remedy
this, we weight this loss down by a factor λ noobj (default: 0.5).
The Final Loss
YOLO
Revise on Non-Max suppression
YOLO can make duplicate detections for the same object. To fix this, YOLO applies non-
maximal suppression to remove duplicates with lower confidence. It increase the mAp
by 2%.
Here is one of the possible non-maximal suppression implementation:
● Sort the predictions by the confidence scores.
● Start from the top scores, ignore any current prediction if we find any previous
predictions that have the same class and IoU > 0.5 with the current prediction.
● Repeat previous step until all predictions are checked.
YOLO
What YOLOv2 and YOLOv3 adds?
Batch normalization:
Add batch normalization in all convolution layers. This removes the need for dropouts
and pushes mAP up 2%.
High resolution classifier:
After trained by 224×224 images, YOLOv2 also uses 448×448 images for fine-tuning the
classification network for 10 epochs on ImageNet. This gives the network time to adjust
its filters to work better on higher resolution input. We then fine tune the resulting
network on detection. This makes the detector training easier and moves mAP up by 4%.
More complex networks:
YOLO v3 used a more complex CNN architecture with 53 convolutional layers instead of
19 in the last version.
YOLO
Anchor boxes
A much more impactful addition to the YOLO algorithm, as proposed by YOLOv2, was the
addition of anchor boxes. YOLO, as we know, predicts a single object per grid cell. While
this makes the built model simpler, it creates issues when a single cell has more than
one object, as YOLO can only assign a single class to the cell.
YOLOv2 gets rid of this limitation by allowing the prediction of multiple bounding boxes
from a single cell where each bounding box can have its own class. This is achieved by
making the network predict 5 bounding boxes for each cell.

The final output now is S x S x no. of anchor boxes x (5 + no. of classes C)


Notes:
● For the higher versions of YOLO, the grid cell became 19x19 rather 7x7 in YOLOv1.
● There are more improvements but I stressed on the most important ones here.
MEAN AVERAGE PRECISION (mAP)
CHECK THIS PART FROM MY PDF THAT I UPLOADED BEFORE. IF YOU HAVE ANY QUESTION
CONTACT ME.
MULTI-OBJECT TRACKING
One of the important computer vision tasks that gained much attention in the last
decade is the object tracking which needs a pre-phase which is the object detection.
Object tracking is the algorithm that tracks the displacement of one or several objects in
a sequence of frames. This is mainly done by locating multiple objects in the frame,
maintaining their identities and yielding their individual trajectories in a sequence of
frames.
MULTI-OBJECT TRACKING
MULTI-OBJECT TRACKING
MULTI-OBJECT TRACKING
MULTI-OBJECT TRACKING
Multi-object tracking can be categorized based on different criteria as follows:

● Initialization method
○ Detection-based tracking
○ Detection-free tracking
● Processing mode
○ Online tracking
○ Offline (Batch) tracking
● Output type
○ Deterministic ones
○ Probabilistic ones
KALMAN FILTER
Before proceeding in demystifying the ideas of SORT and its improvement Deep SORT.
we need firstly to understand some mathematical concepts to build our knowledge on.

One of the biggest challenges of tracking and control systems is providing accurate and
precise estimation of the hidden variables in the presence of uncertainty. In GPS
receivers, the measurement uncertainty depends on many external factors such as
thermal noise, atmospheric effects, slight changes in satellite positions, receiver clock
precision and many more.

The Kalman Filter is one of the most important and common estimation algorithms. The
Kalman Filter produces estimates of hidden variables based on inaccurate and uncertain
measurements. Also, the Kalman Filter provides a prediction of the future system state
based on past estimations.
KALMAN FILTER
The tracking radar sends a pencil beam in the direction of the target. Assume a track
cycle of 5 seconds. In other words, every 5 seconds, the radar revisits the target by
sending a dedicated track beam in the direction of the target.

After sending the beam, the radar estimates the current target position and velocity.
Also, the radar estimates (or predicts) the target position at the next track beam.
KALMAN FILTER
What is the first methodology that would come to your mind to get the future target
position? Think of Math of Secondary School.

Exactly it is the Newton's motion equations:


KALMAN FILTER
Newton’s equation can be generalized in 3-D to form the same equation in x, y and z:
KALMAN FILTER
We call the position (x, y, z), velocity (vx, vy, vz) and the acceleration (ax, ay, az) by the
system state. The current state is the input to the prediction algorithm and the next
state (the target parameters at the next time interval) is the output of the algorithm.

The equations we mentioned in the previous slide is called Dynamic model or State
Space Model.

Hence we can say that predicting the next state is easily done by this set of equations
which can be done by just knowing the current state to predict the future state, right?

Wrong !! the problem is not in predicting the future state only, but in predicting it
accurately. So why do you think that those equations are not sufficient to accurately
predict the next target state in the real world? Think of the answer before proceeding.
KALMAN FILTER
This set of equations predicts the next state theoretically which is not always (almost
not) the case in the real world where we have two main types of noise:

● Measurement noise as for example the radar measurement is not absolute. It


includes a random error (or uncertainty). The error magnitude depends on many
parameters, such as radar calibration, the beam width, the magnitude of the
return echo, etc.
● Process noise as the target motion is not strictly aligned to motion equations due
to external factors such as wind, air turbulence, pilot maneuvers, etc.

Due to Measurement Noise and Process Noise, the estimated target position can be far
away from the real target position.In order to improve the radar tracking performance,
we need a prediction algorithm that takes into account process uncertainty and
measurement uncertainty where Kalman filter comes to action.
BACKGROUND BREAK
Before proceeding, we need to remember the meaning of some terminologies:

Mean and expected value are closely related termes, however there is a difference, can
you tell what is the difference between them?

The difference is in the state of the variable where if the variable is not hidden where
we use its exact values with the entire population, we call it mean. If the variable is
hidden, hence we call it expected value.

For example, if we have 5 coins with 5 values and we want to get their average, we call it
mean as the values are known and taken from the whole population. However, if we
have 5 different measurements of weights of the same person, noting that they are
different due to random measurement error, hence we get the average by getting the
expected mean as we don’t know the true value of weight.
BACKGROUND BREAK
Most of us know the definition of variance which is the measure of spread (dispersion)
of data around its mean where we can have two set of points where both of them are
having the same mean but different variance due to the high dispersion of data around
its mean.
BACKGROUND BREAK
An Estimate is about evaluating the hidden state of the system. The aircraft true
position is hidden from the observer. We can estimate the aircraft position using
sensors, such as radar. The estimate can be significantly improved by using multiple
sensors and applying advanced estimation and tracking algorithms (such as the Kalman
Filter). Every measured or computed parameter is an estimate.

Estimates can be defined by accuracy and precision but firstly what is the difference
between accuracy and precision?

Accuracy indicates how close the measurement is to the true value.

Precision describes how much variability there is in a number of measurements of the


same parameter. Accuracy and precision form the basis of the estimate.
BACKGROUND BREAK
BACKGROUND BREAK
Low-accuracy systems are called biased systems, since their measurements have a
built-in systematic error (bias).

The influence of the variance can be significantly reduced by averaging or smoothing


measurements. For example, if we measure temperature using a thermometer with a
random measurement error, we can make multiple measurements and average them.
Since the error is random, some of the measurements would be above the true value
and others below the true value. The estimate would be close to a true value. The more
measurements we make, the closer the estimate would be.

On the other hand, if the thermometer is biased, the estimate will include a constant
systematic error.

All examples in this presentation assume unbiased systems.


KALMAN FILTER
Let us now start by our very first simple example:
we are going to estimate the state of a static system. A static system is a system that
doesn't change its state over a reasonable period of time. For instance, the static system
could be a tower, and the state would its height.
In this example, we are going to estimate the weight of a gold bar. We have unbiased
scales, i.e. the scales’ measurements don't have a systematic error, but the
measurements do include random noise.
In this example, the system is the gold bar and the system's state is the weight of the
gold bar. The system is static system since we assume that the weight doesn't change
over short periods of time.
KALMAN FILTER
KALMAN FILTER
At the time N, the estimate xN,N would be the average of all previous measurements:
KALMAN FILTER
The dynamic model in this example is constant, therefore xn+1,n = xn,n .
To estimate the x^N,N , we need to memorize all historical measurements. Instead, we
want to use the previous estimate and just add a small adjustment (in a real-life
application, we want to save computer memory). We can do that with a small
mathematical trick in the next slide:
KALMAN FILTER
KALMAN FILTER

The above equation is one of the five Kalman filter equations.


The factor mentioned was 1/N in our previous example, however, we will make a specific
equation to get its value later. This factor is called Kalman Gain which is represented as
αn. Hence the equation can be stated as follows:
KALMAN FILTER
What does it mean that as we go for more measurements (N increases)?
This means that in the beginning, we don't have enough information about the current
state, thus we base the estimation on the measurements. As we continue, each
successive measurement has less weight in the estimation process, since 1/N decreases.
At some point, the contribution of the new measurements would become negligible.

How can we make our first guess?


The prior knowledge of the system state so for our example, it could be the weight
written on the gold bar (System state initial guess).
KALMAN FILTER
Estimation Algorithm:

Numerical example:
https://www.kalmanfilter.net/alphabeta.html#:~:text=and%20estimation%20process.-,TH
E%20NUMERICAL%20EXAMPLE,-ITERATION%20ZERO
KALMAN FILTER

We can see that our estimation algorithm has a smoothing effect on the measurements,
and it converges towards the true value.
KALMAN FILTER
The problem with the first example is that it is a little bit dummy as the state is static
and not changing over time. Let us now take an example if there was change in the
velocity over time. we are going to track a constant-velocity aircraft in one dimension.
KALMAN FILTER
We all know that velocity is the distance covered per unit time. Hence we can say that:

Therefore:
KALMAN FILTER
The previous system of equations is called State Exploration Equation which is the
second equation of kalman filter equations.
Similarly to what we have done before in the state update equation, we can deduce
that:

The only change is the β and Δt where β is similar to α in the static model but for
velocity. While Δt is obligatory for the unit check.
KALMAN FILTER
Hence the equations will be:
KALMAN FILTER
KALMAN FILTER
Can you guess the effect of α, β if they are high or low and when to use them high or
low?

The value of α and β shall depend on the measurement precision. If we use very precise
equipment, like a laser radar, we would prefer a high α and β that follow
measurements. In this case, the filter would quickly respond to a velocity change of the
target. On the other hand, if measurement precision is low, we would prefer a low α and
β. In this case, the filter will smooth the uncertainty (errors) in the measurements.
However, the filter reaction to target velocity changes will be much slower.
KALMAN FILTER
Kalman Gain Equation:

By rewriting the state update equation:


KALMAN FILTER
What can we deduce from the previous state update equation?
When the measurement uncertainty is very large and the estimate uncertainty is very
small, the Kalman Gain is close to zero. Hence we give a big weight to the estimate and a
small weight to the measurement.
On the other hand, when the measurement uncertainty is very small and the estimate
uncertainty is very large, the Kalman Gain is close to one. Hence we give a small weight
to the estimate and a big weight to the measurement.
If the measurement uncertainty is equal to the estimate uncertainty, then the Kalman
gain equals 0.5.
HUNGARIAN ALGORITHM
Hungarian Algorithm (The Munkres Assignment Algorithm) is mainly used to find the
minimum cost in assignment problems that involves assigning people to different
activities for example. Assume matrix size is n.
Steps:
Step 1: Subtract row minima
For each row, find the lowest element and subtract it from each element in that row.
Step 2: Subtract column minima
Similarly, for each column, find the lowest element and subtract it from each element in
that column.
Step 3: Cover all zeros with a minimum number of lines
Cover all zeros in the resulting matrix using a minimum number of horizontal and
vertical lines. If n lines are required, an optimal assignment exists among the zeros. The
algorithm stops. If less than n lines are required, continue with Step 4.
HUNGARIAN ALGORITHM
Step 4: Create additional zeros
Find the smallest element (call it k) that is not covered by a line in Step 3. Subtract k
from all uncovered elements, and add k to all elements that are covered twice.
Let us now take an example to understand how the steps are going on:
We consider an example where four jobs (J1, J2, J3, and J4) need to be executed by four
workers (W1, W2, W3, and W4), one job per worker. The matrix below shows the cost of
assigning a certain worker to a certain job. The objective is to minimize the total cost of
the assignment.
HUNGARIAN ALGORITHM
Step1: Subtract row minima

Step2: Subtract column minima


HUNGARIAN ALGORITHM
Step3: Cover all 0’s with the minimum number of lines (horizontal or vertical).

Since size of matrix is 4 while the number of lines needed is 3 hence less than size of
matrix so we continue to step 4.
HUNGARIAN ALGORITHM
Step4: Create additional zeros

Now repeat steps 3 and 4 in iterative manner until reaching number of lines coving
zeros equivalent to the size of the matrix so hence the algorithm stops.
HUNGARIAN ALGORITHM
The following zeros are representing the optimal workers for these jobs noting that J2
and J4 are having only 1 zero in their columns hence the worker is assigned but for job 1
and job 3, we are having 2 candidates for each one so we eliminate workers which are
taken by other jobs and take the remaining ones.
INTERSECTION OVER UNION (IOU)
IOU(Intersection over Union) is a term used to describe the extent of overlap of two
boxes. The greater the region of overlap, the greater the IOU.

What is IOU mainly used in?


IOU is mainly used in applications related to object detection, where we train a model to
output a box that fits perfectly around an object. For example in the image in the next
slide, we have a green box, and a blue box. The green box represents the correct box,
and the blue box represents the prediction from our model. The aim of this model
would be to keep improving its prediction, until the blue box and the green box
perfectly overlap, i.e the IOU between the two boxes becomes equal to 1.
INTERSECTION OVER UNION (IOU)
INTERSECTION OVER UNION (IOU)
INTERSECTION OVER UNION (IOU)
SIMPLE ONLINE REAL-TIME TRACKING (SORT)
Simple online tracking is a detection-based online probabilistic multi-object tracking
noting that it relies greatly on the detection quality in the previous stage with a gain of
18.9% with changing the detector.
Have you noticed how can we combine the 3 ideas we have mentioned to perform
tracking?
An estimation model is used which is the Kalman filter we have discussed to propagate
the target’s frame id to the next frame. A simple version of kalman filter is used which is
depending only on the velocity model in a linear way not taking into consideration
camera motion or any other objects. They decided to use the kalman filter as
combination of 8 variables [u, v, s, h, u* , v*, s*, h* ] where u and v are the center of the
target, and s is the scale (aspect ratio) and h is the height of the box. While the whole
superscript * values are the velocities respectively. Data association is then used by the
Hungarian algorithm where the cost here is represented as the IOU between the
detected boxes and the predicted ones using the kalman filter.
SIMPLE ONLINE REAL-TIME TRACKING (SORT)
What if the target is not having an associated detection?
Its state is simply predicted without correction.
What if the associated boxes (detected and predicted) are actually far but the best
among the hungarian matrix?
IOUmin threshold is put to make sure that with the best cost assignment matrix, we are
correlating right boxes together not only best boxes together.
How is creation and deletion taking place?
When objects enter and leave the image, unique identities need to be created or
destroyed accordingly. For creating trackers, we consider any detection with an overlap
less than IOUmin to signify the existence of an untracked object. The tracker is
initialised using the geometry of the bounding box with the velocity set to zero.
SIMPLE ONLINE REAL-TIME TRACKING (SORT)
Tracks are terminated if they are not detected for n frames. This prevents an unbounded
growth in the number of trackers and localisation errors caused by predictions over long
durations without corrections from the detector.
PROBLEMS OF SORT ALGORITHM
Can you guess from the shown table, what would be the problem of SORT?
PROBLEMS OF SORT ALGORITHM
The main problem regarding the tracking algorithms generally is the problem of identity
switch between different objects in addition to multiple tracks before and after
occlusions.
While achieving overall good performance in terms of tracking precision and accuracy,
SORT returns a relatively high number of identity switches. This is, because the
employed association metric is only accurate when state estimation uncertainty is low.
Therefore, SORT has a deficiency in tracking through occlusions as they typically appear
in frontal-view camera scenes.
PROBLEMS OF SORT ALGORITHM
DEEP ASSOCIATIVE METRIC (DEEP SORT)
Dee sort was implemented as an enhancement over the simple sort algorithm to induce
deep learning in the methodology, how can you expect to use deep learning in such
case?
Instead of depending on motion information only, we will put appearance information
as well such that we overcome this issue by replacing the association metric with a
more informed metric that combines motion and appearance information.

How can we get the appearance information? Body embeddings. Before proceeding let
us now take a look at the network architecture where they employ a wide residual
network with two convolutional layers followed by six residual blocks. The global
feauture map of dimensionality 128 is computed in dense layer 10. A final batch and l2
normalization projects features onto the unit hypersphere to be compatible with our
cosine appearance metric.
DEEP ASSOCIATIVE METRIC (DEEP SORT)
IDENTIFICATION: MAIN PROTOCOLS

Closed-set
01 Training and Testing are on
same classes. No unknowns.

Open-set
02 Training and Testing are on
different classes. Unknowns exist.

Closed-set Open-set
OBJECTIVE
Minimize Intra-class distance

Maximize Inter-class distance


SOFTMAX LOSS
Softmax loss can be defined as follow:

Where
● → Number of samples
● → Input features for sample i
● → Number of classes
● → Weights of target class y for input
● → Weights of class j
DRAWBACKS OF SOFTMAX LOSS
● The softmax loss function does not explicitly optimise the feature embedding to
enforce higher similarity for intra-class samples and diversity for inter-class
samples, which results in a performance gap for deep recognition under large
intra-class appearance variations (e.g. pose variations and age gaps and
large-scale test scenarios [1].
● The learned features are separable for closed-set classification problem but not
discriminative enough for the open-set recognition task [1]

[1] ArcFace: Additive Angular Margin Loss for Deep Face Recognition.
WHY SOFTMAX FAILS?

Intrinsic Angular
Distribution
Converting the recognition tasks from the
Euclidean space to the angular and cosine
spaces for further improvements in
Angular
Losses
minimizing intra-class variations and
maximizing inter-class distances.
REMEMBER: DOT PRODUCT OF VECTORS

Dot Product between 2 vectors is based on the

projection of one vector into another.

Remember: unit vector is the vector divided by its magnitude


REMEMBER: EUCLIDEAN vs COSINE SIMILARITY
Similar to using ruler to Similar to using
measure distances Goniometers to measure
between data points from differences in rotations
bird-view between data points

Euclidean Cosine

Cosine similarity is generally used as a metric for measuring


distance when the magnitude of the vectors does not matter.
LARGE-MARGIN SOFTMAX LOSS
The key intuition is that the separability between sample and parameter can be
factorized into amplitude ones and angular ones so we can convert the softmax loss to
be as follow:
LARGE-MARGIN SOFTMAX LOSS
In order to classify X to be in class 1 we need:

However, we want to make the classification more rigorous in order to produce a


decision margin. So we instead require

So the new classification criteria is a stronger requirement to correctly classify x,


producing a more rigorous decision boundary for class 1
LARGE-MARGIN SOFTMAX LOSS
● The L-Softmax loss utilizes a simple
modification over the original softmax loss,
achieving a classification angle margin
between classes.

● By assigning different values for m we define


a flexible learning task with adjustable
difficulty for CNNs.

● With m = 1, the L-Softmax loss becomes


identical to the original softmax loss.
SphereFace

SphereFace (Aka A-Softmax) contribution is mainly in normalizing the weights and


incorporating it with the large margin softmax loss (L-softmax) to produce a new
decision boundary as follows (assuming a binary case):
SphereFace

Their contribution by normalizing the weights made the optimization, thus prediction,
only depends on the angle between Weights and feature vector rather than optimizing
the dot product.

By adding the idea of the multiplicative angular margin, The constraint will be as
follows:

Assuming class 1 is the target class ⇒ ⇒

As seen from the constraints, A margin is added in the angle space.


SphereFace
To visualize the effect of converting to an angular distribution with margins, let’s
observe the following figure:

Note: Modified softmax loss is referring to the softmax loss in angular definition with
Weights normalized to 1 and biases=0 with no multiplicative angular margin.
EFFECT OF MARGIN (m) HYPER-PARAMETER
What do you expect by increasing the value of m?
COSINE SOFTMAX CLASSIFIER
Cosine softmax classifier is the modification of softmax classifier that was used in the
training process of the CNN used in producing body embeddings. The only difference
between it and the A-softmax we have taken is that we will also normalize the features
not only weights to ensure the representation is unit length. Hence now the loss is only
defined in angular way without any introduction in the Euclidean space. Hence the
classifier loss equation will be:
TRAINING SETUP
The CNN has been trained on a large-scale person re-identification dataset [1] that
contains over 1,100,000 images of 1,261 pedestrians, making it well suited for deep
metric learning in a people tracking context.

In total, the network has 2,800,864 parameters and one forward pass of 32 bounding
boxes takes approximately 30 ms on an Nvidia GeForce GTX 1050 mobile GPU. Thus, this
network is well suited for online tracking, provided that a modern GPU is available.

[1] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “MARS: A video benchmark for
large-scale person re-identification,” in ECCV, 2016.
THE ASSOCIATION METRIC IN DEEP SORT
A conventional way to solve the association between the predicted Kalman states and
newly arrived measurements is to build an assignment problem that can be solved
using the Hungarian algorithm. Into this problem formulation we integrate motion and
appearance information through combination of two appropriate metrics.
The first was using Hungarian algorithm where we need to find the best candidate in the
existing tracks against the newly detected objects but this time they went to the
Mahalanobis distance (squared distance) where they put threshold between the
bounding box j and the ith track to see how much close are the detection box far away
from the mean track location which excluded unlikely associations.
The second one which is the most important one is the one dealing with body
embeddings vector where they used cosine distance to measure this distance with
remembering that the descriptor vector is normalized (|| rj || = 1).
THE ASSOCIATION METRIC IN DEEP SORT
To make it more feasible in associating people with the same descriptors, for each track
they kept n past descriptors (default 100) that were successfully associated with this
tracker so that when they are associating the newly detection box, they get the smallest
cosine distance among the gallery of n past descriptors then comparing it to a
threshold that was got by trials and error that best separates the descriptors of
different people.
THE ASSOCIATION METRIC IN DEEP SORT
In combination, both metrics complement each other by serving different aspects of the
assignment problem. On the one hand, the Mahalanobis distance provides information
about possible object locations based on motion that are particularly useful for
short-term predictions. On the other hand, the cosine distance considers appearance
information that are particularly useful to recover identities after long term occlusions,
when motion is less discriminative. To build the association problem we combine both
metrics using a weighted sum:
REFERENCES
● SIMPLE ONLINE AND REALTIME TRACKING (https://arxiv.org/pdf/1602.00763.pdf)
● SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP ASSOCIATION METRIC
(https://www.uni-koblenz.de/~agas/Documents/Wojke2017SOA.pdf)
● Deep Cosine Metric Learning for Person Re-Identification
(https://elib.dlr.de/116408/1/WACV2018.pdf)
● https://www.kalmanfilter.net/default.aspx
● https://www.hungarianalgorithm.com/hungarianalgorithm.php
Github Repos:
● https://github.com/nwojke/deep_sort
● https://github.com/abewley/sort
● https://github.com/nwojke/cosine_metric_learni
● https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch
Angular losses lecture:
https://drive.google.com/file/d/1jh_tqilJO7NVcPhzQZOeNOXBBUtfsuhL/view?usp=sharing

You might also like