Object Detection and Tracking
Object Detection and Tracking
DETECTION &
TRACKING
MEET THE INSTRUCTOR
Nezar Ahmed
Machine Learning Lead
Synapse Analytics
Master’s Student
Computer Communication and Engineering
Cairo University
AI Instructor
ITI / Epsilon AI / AMIT
DISCLAIMER AND ACKNOWLEDGMENT
Some of the slides are taken from:
Various courses, articles, and tutorials on computer vision as
Tutorial point, PyImageSearch, AnalyticsVidhya, Medium and
Towards Datascience.
OBJECT DETECTION
The classification task we have gone through was is mainly an image containing a
specific object or state where your network tells you the probability of each class of
your predefined classes concerning this image where the class with the highest
probability is the class representing this image.
What if we want not just to say what is the class of the object in the image, what if we
want also to detect its position in the image. Is that all ? No what if there are multiple
objects in the image and you want to detect all the object regarding their position then
classify each object. Here comes some terminologies like localization and detection.
Localization: it refers for having a single object in the image so we have an object in the
image and want to detect its position by drawing a boundary box around the object and
classify it.
Detection: it refers for having multiple objects in the same image (objects could be of
same class or different classes) and your task is to determine the position of each
object by drawing a boundary box around each object and classify it.
OBJECT DETECTION
OBJECT DETECTION
How to solve object detection problem?
Let us take the following example and start from the most naive idea in our heads and
keep forming on it until reaching the end to end solution.
Assuming we are building a pedestrian detection system for autonomous driving system
and the car captures the following image and our target is to detect those pedestrians:
OBJECT DETECTION
Approach 1: Naive way (Divide and conquer)
The simplest approach is to divide our image into four parts as shown then pass them
to a classifier to tell if this part has a pedestrian or not.
OBJECT DETECTION
If the classifier found that the one of the 4 parts is having a pedestrian, hence the box is
the whole upper part.
Good as starting point but still we need more precise bounding boxes around the
objects.
OBJECT DETECTION
How can we enhance the previous idea?
Approach 2: Increase the number and sizes of divisions
Instead of having 4 patches only with fixed sizes to be given to our trained classifier, let
us make them several patches with different sizes to be passed and to store the patch
size and position to act as bounding boxes as shown:
OBJECT DETECTION
A lot of boxes!! but the good thing that now we have some boxes closer to the two
objects we have than the previous method. So we are somehow much closer to find
more precise and accurate bounding boxes.
Can you suggest the next step?
Approach 3: Structured divisions with boxes of different aspect ratio
for having more structured division, let us divide our image into a grid cell (say 10×10):
OBJECT DETECTION
Define the center of each grid then for each center take several patches (say 3) of
different aspect ratios as shown:
OBJECT DETECTION
Pass every patch of every grid cell to the classifier to get predictions and save the
boundary boxes of the classified patches and let us see how they look on the image as
shown:
Less boundary box than the previous approach and more close to the right boundary
boxes needed. We are more steps closer to the precise boundary boxes
Can you suggest the next enhancement?
OBJECT DETECTION
Approach 4: More structured divisions with more suggested patches:
Instead of 10x10 make it 20x20 and instead of 3 patches make them 9 for more aspect
ratios and more sizes for example:
Before getting in how to solve this problem, let us conquer another idea then get back
to the solution for this problem that this idea will be used in.
FC LAYERS TO CONV LAYERS
Turning Fully Connected (FC) layer into convolutional layer
Assume having a network as shown:
Now we want to have a fully connected layer but in form of convolutional layer? What
can we do?
FC LAYERS TO CONV LAYERS
We will convolve the last layer before the required FC layer with n filters each of the
same dimension (and channels of course) of this last layer before required FC layer
where n is the length of FC layer so in our example the layer before the first FC is of
dimension 5×5×16 so we will convolve it with 400 filters each filter is of dimension
5×5×16 so each convolution will result in a 1×1 result (single value) and since we have
400 filters so the final output will be of dimension 1×1×400 as shown:
FC LAYERS TO CONV LAYERS
What about the next FC layer ? Similarly, we will look at the layer before it so its size is
1×1×400 and the size needed for FC is also 400 so we will convolve this layer with 400
filter seach of them is of size 1×1×400 so that the output now is of size 1×1×400 as
shown:
Similarly with the softmax layer which has 4 outputs (4 classes) so to form this layer we
should convolve the layer before it with 4 filters each of dimension 1×1×400 to end up
with an output of dimension 1×1×4 to have our final network as shown in the next slide:
FC LAYERS TO CONV LAYERS
Final Network
OVERFEAT
Why would we need the previous idea?
Any object detection method is having mainly 2 stages which are the training and
testing stage and the inference stage. In the training stage there will be no problem as it
will be trained normally except that we will replace the fully connected layers by the
conv layers as we explained earlier.
However this wouldn’t make any enhancements, where is the enhancement that this
idea proposed?
The enhancement is in the inference stage where the sliding window takes place to loop
over the image and crop the region of interest and pass it to the classifier part that is
trained as we have said before. The idea proposed by converting the FC layers to conv
layers will make us apply all the sliding windows at once using the convolutions as we
normally do instead of passing each cropped part along the classification network each
time.
OVERFEAT
Let us assume that our sliding window that we use in inference is 14x14 and that our
test image is of size 16×16 andwe want to apply this 14 by 14 sliding window with
stride=2 (means that every time we slide the window over the testing image by shifting 2
pixels) so by doing that we will find that we have 4 possible windows to be applied as
shown on the test image:
OVERFEAT
So what we were doing is that each window of this 4 windows will crop the test image
and passes the cropped photo to the convnet to decide whether there is an object of
the 4 classes or not but this is expensive computation to do as we will repeat this 4
times for each of the 4 windows taking into consideration that we are giving a simple
example where there is only 4 possible windows but if stride=1 there will be 16 possible
windows and if the image is larger and sliding window size is smaller there will be much
much more
What overfeat method says is to apply the 16×16 test image as it is without any cropping
to the same pre-trained network and the output will contain all the 4 possible outcomes
of the 4 windows internally so the final output will be 2×2 instead of 1×1 so we will have
the 4 vector output where each one represents the output of one of the windows as
shown in the next slide.
OVERFEAT
Concept
● The idea of selective search
To illustrate how this is done, we need to talk about basic image segmentation method.
RCNN
Image segmentation
This means that each point in the image is classified to be related to 1 class.
Felzenszwalb and Huttenlocher (2004) proposed an algorithm for segmenting an image
into similar regions using a graph-based approach. It is also the initialization method
for Selective Search (a popular region proposal algorithm) that we are gonna discuss
later.
There are 2 approaches to create a graph (like computational graphs) out of an image:
● Grid Graph: Each pixel is only connected with surrounding neighbours (8 other
cells in total). The edge weight is the absolute difference between the intensity
values of the pixels.
● Nearest Neighbor Graph: Each pixel is a point in the feature space (x, y, L, a, b), in
which (x, y) is the pixel location and (L, a, b) is the color values in L*a*b space. The
weight is the Euclidean distance(revise it) between two pixels’ feature vectors (the
feature vector here is as mentioned ((x, y, L, a, b))) ⇒ This will be used here.
The original method used R, G, B initially however L*a*b space make more sense.
RCNN
Connected Components
A connected component (or just component) of an undirected graph is a subgraph in
which any two vertices are connected to each other by paths, and which is connected to
no additional vertices in the supergraph. For example, the graph shown in the
illustration has three connected components. A vertex (pixel) with no incident edges is
itself a connected component. A graph that is itself connected has exactly one
connected component, consisting of the whole graph. The connected components of the
graph are taken to be the segments in the image segmentation (so we will refer to
segments and connected components interchangeably).
RCNN
Algorithm steps of the segmentation
Note that MInt is called the minimum internal difference which is the threshold that we
compare the weight to.
Weight is computed using any distance as Euclidean distance
RCNN
How selective search works?
● Apply Felzenszwalb and Huttenlocher’s graph-based image segmentation
algorithm to create regions to start with.
● Apply hierarchical clustering on the segmented image:
○ First the similarities between all neighbouring regions are calculated.
○ o The two most similar regions are grouped together, and new similarities
are calculated between the resulting region and its neighbours.
● The process of grouping the most similar regions (Step 2) is repeated until the
whole image becomes a single region. Hence start to propose the 2000 region
proposals as shown in the figure in the next slide.
RCNN
RCNN
In the hierarchical clustering, what are the similarities that are used to merge regions?
Given two regions (ri , rj ), selective search proposed four complementary similarity
measures:
● Color similarity
● Feature/Texture: Use algorithm that works well for material recognition such as
SIFT/LBP.
● Size: Small regions are encouraged to merge early.
● Shape: Ideally one region can fill the gap of the other.
What is after 2000 region proposal are generated?
Each region proposal is processed to get a bounding box around it covering the whole
segment. Affine image wrapping is then applied to compute fixed size box (224x224).
Input ⇒ 2000 proposed regions which consist of background and the object classes
Output ⇒ 2000 warped images of fixed size
RCNN
After we extract our region proposal bounding boxes, we also have to label them for
training. Therefore, the authors label all the proposals having IOU of at least 0.5 with
any of the ground-truth bounding boxes with their corresponding classes. However, all
other region proposals that have an IOU of less than 0.3 are labelled as background.
What is IoU?
IoU is intersection over union which measure the percentage of overlapping between 2
bounding boxes by dividing the intersected area between two boxes (Yellow box) over
the union of the area of the two boxes (Area of purple box + Area of red box - Area of
yellow box).
RCNN
Feature extraction using CNN
This was done by AlexNet where we pass the image by the whole network except for the
softmax layer. Hence a feature vector of size 4096 is generated to represent this
proposal region.
Note: In the inference stage, after passing the proposals to the CNN and then passing
the feature vector through the classification and regression step. Many bounding boxes
exist while a lot of them would be redundant and overlapping bounding boxes which
need to be removed. To accomplish that Non-maximum suppression algorithm is used.
RCNN
Non-Max Suppression
Non-max suppression is used for finding only one box for each object as there will be
many bounding boxes pointing on the same object as shown below:
Noting that the values on the boxes is the highest probability among the classes of the
network (Pc)
RCNN
Algorithm
1-Discard all boxes associated with low probabilities (P c < threshold).
2-If there are remaining boxes (boxes with P c > 0.6) so loop over the following:
● Take the box with highest probability(P c ) among all the remaining boxes and
keep it.
● Perform IoU calculation between this kept box and all the remaining boxes and
discard any box of the remaining ones with IoU > threshold (say 0.5) because this
means that most probably these boxes are detecting the same object of the kept
box.
● Repeat the 2 previous steps again on the non-discarded boxes (IoU<0.5) because
these boxes are most probably belonging to another object in the image.
RCNN
Now we will work on the example that we saw 2 slides ago, we will pick the highest
probability among them all (0.9) and keep it (in white), then now we apply IoU with all
the remaining 4 boxes so we found out that 2 of the remaining boxes (0.6 and 0.7 boxes)
are having high IoU with this box in white so we will discard them.
Now we have remaining 2 non-discarded boxes (on the left) so we will take the highest
among these 2 boxes (0.8 box in white) then perform IoU with the remaining 1 box (0.7
box on the left) so we found out that it has high IoU so we will discard it. Eventually we
have no remaining boxes so final output is the 2 white boxes with 0.9 and 0.8 class
probability.
RCNN
Problems with RCNN
● It still takes a huge amount of time to train the network as you would have to
classify 2000 region proposals per image and then pass it to bounding box
regressor.
● It cannot be inferred in real time as it takes around 40-50 seconds for each test
image.
● The selective search algorithm is a fixed algorithm. Therefore, no learning is
happening at that stage. This could lead to the generation of bad candidate region
proposals.
The idea was enhanced by Fast RCNN. It used the spatial pyramid pooling (SPP-Net) so
let’s come across it first.
SPP-NET
SPP-Net
SPP-Net tried to solve the problem of having fixed input size (224x224) to make the input
have any size. What do we mean by this? We mean that instead of cropping each
proposal and fix its size and then pass through the CNN, we will eliminate the part of
fixing its size and hence no need to enter the 2000 proposal sequentially but will be
done all at once. This can be done by applying new pooling operation called spatial
pyramid pooling that is applied after the last conv layer and before the fully connected
layers (Spatial pyramid pooling is applied in between them).
Spatial pyramid pooling (aka RoI pooling)
It is a pooling that is applied JUST that section of the feature maps of last conv layer
that corresponds to the proposal region. The rectangular section of conv layer
corresponding to a region can be calculated by projecting the region on conv layer by
taking into account the downsampling happening in the intermediate layers.
Now on a (13 x 13) map, apply the above window (4 x 4) and stride 3, we get the output
to be (4 x 4). This operation is applied to all the feature maps (256) so for 256 feature
maps we get an output of (4x4x256 ⇒ 4096)
SPP-NET
What if the region proposal is rectangular not square hence the input to the SPP is not
nxn but nxm which is the most common case?
We apply the ceiling and flooring of window and strides respectively on both width and
height separately.
SPP-Net proposed to make the feature vector of size 5376 as shown in the next slide:
● each feature map is pooled to become one value (grey), thus 256-d vector is
formed.
● Then, each feature map is pooled to have 4 values (green), and form a 4×256-d
vector.
● Similarly, each feature map is pooled to have 16 values (blue), and form a 16×256-d
vector.
● The above 3 vectors are concatenated to form a 1-d vector.
● Finally, this 1-d vector is going into FC layers as usual.
Don’t Forget that we get the pooling window size and stride for each one of the 3.
SPP-NET
FAST R-CNN
Fast RCNN is a successor of RCNN which is much faster.
Concepts
● It used the idea of spatial pyramid pooling to create fixed-length feature vector
from variable input.
● Compared to R-CNN, which has trained multiple stages for feature extraction,
classification and Regression, Fast R-CNN builds a network that can train feature
extraction. classification and regression simultaneously at the same time in the
same network.
● Fast R-CNN shares computations (i.e. convolutional layer calculations) across all
proposals (i.e. ROIs) rather than doing the calculations for each proposal
independently. This is done by using the new ROI Pooling layer, which makes Fast
R-CNN faster than R-CNN.
FAST RCNN
How is it become a single stage combining all training together?
We are not going to train the networks independently for classification and regression
but we added the bounding box regression to the neural network training itself. So, now
the network had two heads, classification head, and bounding box regression head. This
multitask objective is a salient feature of Fast-RCNN as it no longer requires training of
set of networks independently for classification(SVMs) and localization(Boundary box
regressors). This change along with the RoI pooling idea reduce the overall training time
and increase the accuracy in comparison to RCNN net because of the end to end
learning of CNN.
FAST RCNN
Problems with Fast RCNN
It still uses selective search as a proposal method to find the Regions of Interest, which
is a slow and time consuming process. It takes around 2 seconds per image to detect
objects, which is much better compared to RCNN. But when we consider large real-life
datasets, then even a Fast RCNN doesn’t look so fast anymore. Let's see what Faster
R-CNN would do to solve this problem.
FASTER RCNN
Faster RCNN is the successor of Fast RCNN which mainly solved the problem of selective
algorithm that propose region proposals in ~2 seconds.
Concept
● Introduced the idea of Region Proposal Network (RPN) to generate the proposal
regions.
Algorithm
● We take an image as input and pass it to the ConvNet which returns the feature
maps for that image.
● Region proposal network is applied on these feature maps. This returns the object
proposals along with their objectness score.
● An RoI pooling layer is applied on these proposals to bring down all the proposals
to the same size.
● Finally, the proposals are passed to a fully connected layer which has a softmax
layer and a linear regression layer at its top, to classify and output the bounding
boxes for objects.
FASTER RCNN
FASTER RCNN
How RPN works?
● At the last layer of an initial CNN, a 3x3 sliding window moves across the feature
map and maps it to a lower dimension (e.g. 256-d)
● For each sliding-window location, it generates multiple possible regions based on
k fixed-ratio anchor boxes (default bounding boxes) ⇒ 9 anchor boxes at each
sliding window.
● Each region proposal consists of
○ an “objectness” score for that region to tell whether there is an object or not.
○ Coordinates representing the bounding box of the region.
FASTER RCNN
FASTER RCNN
Anchor boxes
Anchor boxes are 9 anchor boxes that covers wide(1:2), tight(2:1) and square(1:1) boxes as
shown below:
YOLO
You Only Look Once (YOLO) is one of the best balance between the accuracy and the
speed where it is not as accurate as RCNN and its variants however it is much faster
than them which makes it a good choice for the real-time object detection. YOLO made
all the learning in one shot by making the network proposal free (doesn’t need the step
of region proposals (RPN or selective method)). It can run at very high speed (at this
time) reaching 45 FPS.
Algorithm
● Divide your image to SxS grid cells such that each object that is present on the
image, one grid cell is said to be responsible for predicting it (based on its center).
● Each grid cell predicts N bounding boxes to cover all objects in this cell where
each box is composed of one confidence score, and 4 numbers representing the
box (x, y, w, h).
● Non-max suppression applied to remove highly overlapped boxes.
YOLO
Let us take this image as an input:
YOLO
YOLO then divides the image to SxS grids such that each cell can predict only 1 object.
For each object that is present on the image, one grid cell is said to be responsible for
predicting it. That is the cell where the center of the object falls into. For example, the
yellow grid cell below tries to predict the “person” object whose center (the blue dot)
falls inside the grid cell.
YOLO
Each grid cell predicts a fixed number of boundary boxes(say 2). In this example, the
yellow grid cell makes two boundary box predictions (blue boxes) to locate where the
person is.
YOLO
What is the output of each cell assuming having 2 predicted boxes per cell?
Each boundary box contains 5 elements: (x, y, w, h) and a box confidence score. Formally
we define confidence as Pr(Object) × IoU(pred, truth) . If no object exists in that cell, the
confidence score should be zero and if an object exists, it will be defined as the IoU
(Pr(Object) = 1). Each cell (not each bounding box) has 20 conditional class probabilities.
The conditional class probability is the probability that the detected object belongs to a
particular class (one probability per category for each cell) so that f no object is present
on the grid cell, the loss function will not penalize it for a wrong class prediction. So,
YOLO’s prediction has a shape of (S, S, B×5 + C) = (7, 7, 2×5 + 20) = (7, 7, 30)
Why did I bolded the word Each cell (not each bounding box)?
Because if we assume having 2 bounding box per cell they will both take the same class
probability which means that if 2 object of different classes having their centers at the
same bounding box, we will only be able to predict box of them of one class. It detects
one object only regardless of the number of boxes B. This one of the main limitations in
YOLOv1.
YOLO
How is the 4 number represents the box in YOLO?
We normalize the bounding box width w and height h by the image width and height. x
and y are offsets to the center of corresponding cell. Hence, x, y, w and h are all between
0 and 1.
YOLO
Architecture
YOLO
Notes
1-The architecture was crafted for use in the Pascal VOC dataset, where the authors used
S=7, B=2 and C=20. This explains why the final feature maps are 7x7, and also explains
the size of the output (7x7x(2*5+20)). Use of this network with a different grid size or
different number of classes might require tuning of the layer dimensions.
2-The authors mention that there is a fast version of YOLO, with only 9 convolutional
layers called Tiny-YOLO. The table above, however, display the full version.
3-The sequences of 1x1 reduction layers and 3x3 convolutional layers were inspired by
the GoogleNet (Inception) model.
4-The final layer uses a linear activation function. All other layers use a leaky RELU.
5- Since each cell produces B bounding boxes of the same class, hence in the inference
we choose the one with the highest confidence and discard the other.
YOLO
Loss function
Loss function in YOLO depends on the sum of squared error idea and is composed of 3
parts:
● The classification loss.
● The localization loss (errors between the predicted boundary box and the ground
truth).
● The confidence loss (the objectness of the box). ⇒ This has 2 parts
● Initialization method
○ Detection-based tracking
○ Detection-free tracking
● Processing mode
○ Online tracking
○ Offline (Batch) tracking
● Output type
○ Deterministic ones
○ Probabilistic ones
KALMAN FILTER
Before proceeding in demystifying the ideas of SORT and its improvement Deep SORT.
we need firstly to understand some mathematical concepts to build our knowledge on.
One of the biggest challenges of tracking and control systems is providing accurate and
precise estimation of the hidden variables in the presence of uncertainty. In GPS
receivers, the measurement uncertainty depends on many external factors such as
thermal noise, atmospheric effects, slight changes in satellite positions, receiver clock
precision and many more.
The Kalman Filter is one of the most important and common estimation algorithms. The
Kalman Filter produces estimates of hidden variables based on inaccurate and uncertain
measurements. Also, the Kalman Filter provides a prediction of the future system state
based on past estimations.
KALMAN FILTER
The tracking radar sends a pencil beam in the direction of the target. Assume a track
cycle of 5 seconds. In other words, every 5 seconds, the radar revisits the target by
sending a dedicated track beam in the direction of the target.
After sending the beam, the radar estimates the current target position and velocity.
Also, the radar estimates (or predicts) the target position at the next track beam.
KALMAN FILTER
What is the first methodology that would come to your mind to get the future target
position? Think of Math of Secondary School.
The equations we mentioned in the previous slide is called Dynamic model or State
Space Model.
Hence we can say that predicting the next state is easily done by this set of equations
which can be done by just knowing the current state to predict the future state, right?
Wrong !! the problem is not in predicting the future state only, but in predicting it
accurately. So why do you think that those equations are not sufficient to accurately
predict the next target state in the real world? Think of the answer before proceeding.
KALMAN FILTER
This set of equations predicts the next state theoretically which is not always (almost
not) the case in the real world where we have two main types of noise:
Due to Measurement Noise and Process Noise, the estimated target position can be far
away from the real target position.In order to improve the radar tracking performance,
we need a prediction algorithm that takes into account process uncertainty and
measurement uncertainty where Kalman filter comes to action.
BACKGROUND BREAK
Before proceeding, we need to remember the meaning of some terminologies:
Mean and expected value are closely related termes, however there is a difference, can
you tell what is the difference between them?
The difference is in the state of the variable where if the variable is not hidden where
we use its exact values with the entire population, we call it mean. If the variable is
hidden, hence we call it expected value.
For example, if we have 5 coins with 5 values and we want to get their average, we call it
mean as the values are known and taken from the whole population. However, if we
have 5 different measurements of weights of the same person, noting that they are
different due to random measurement error, hence we get the average by getting the
expected mean as we don’t know the true value of weight.
BACKGROUND BREAK
Most of us know the definition of variance which is the measure of spread (dispersion)
of data around its mean where we can have two set of points where both of them are
having the same mean but different variance due to the high dispersion of data around
its mean.
BACKGROUND BREAK
An Estimate is about evaluating the hidden state of the system. The aircraft true
position is hidden from the observer. We can estimate the aircraft position using
sensors, such as radar. The estimate can be significantly improved by using multiple
sensors and applying advanced estimation and tracking algorithms (such as the Kalman
Filter). Every measured or computed parameter is an estimate.
Estimates can be defined by accuracy and precision but firstly what is the difference
between accuracy and precision?
On the other hand, if the thermometer is biased, the estimate will include a constant
systematic error.
Numerical example:
https://www.kalmanfilter.net/alphabeta.html#:~:text=and%20estimation%20process.-,TH
E%20NUMERICAL%20EXAMPLE,-ITERATION%20ZERO
KALMAN FILTER
We can see that our estimation algorithm has a smoothing effect on the measurements,
and it converges towards the true value.
KALMAN FILTER
The problem with the first example is that it is a little bit dummy as the state is static
and not changing over time. Let us now take an example if there was change in the
velocity over time. we are going to track a constant-velocity aircraft in one dimension.
KALMAN FILTER
We all know that velocity is the distance covered per unit time. Hence we can say that:
Therefore:
KALMAN FILTER
The previous system of equations is called State Exploration Equation which is the
second equation of kalman filter equations.
Similarly to what we have done before in the state update equation, we can deduce
that:
The only change is the β and Δt where β is similar to α in the static model but for
velocity. While Δt is obligatory for the unit check.
KALMAN FILTER
Hence the equations will be:
KALMAN FILTER
KALMAN FILTER
Can you guess the effect of α, β if they are high or low and when to use them high or
low?
The value of α and β shall depend on the measurement precision. If we use very precise
equipment, like a laser radar, we would prefer a high α and β that follow
measurements. In this case, the filter would quickly respond to a velocity change of the
target. On the other hand, if measurement precision is low, we would prefer a low α and
β. In this case, the filter will smooth the uncertainty (errors) in the measurements.
However, the filter reaction to target velocity changes will be much slower.
KALMAN FILTER
Kalman Gain Equation:
Since size of matrix is 4 while the number of lines needed is 3 hence less than size of
matrix so we continue to step 4.
HUNGARIAN ALGORITHM
Step4: Create additional zeros
Now repeat steps 3 and 4 in iterative manner until reaching number of lines coving
zeros equivalent to the size of the matrix so hence the algorithm stops.
HUNGARIAN ALGORITHM
The following zeros are representing the optimal workers for these jobs noting that J2
and J4 are having only 1 zero in their columns hence the worker is assigned but for job 1
and job 3, we are having 2 candidates for each one so we eliminate workers which are
taken by other jobs and take the remaining ones.
INTERSECTION OVER UNION (IOU)
IOU(Intersection over Union) is a term used to describe the extent of overlap of two
boxes. The greater the region of overlap, the greater the IOU.
How can we get the appearance information? Body embeddings. Before proceeding let
us now take a look at the network architecture where they employ a wide residual
network with two convolutional layers followed by six residual blocks. The global
feauture map of dimensionality 128 is computed in dense layer 10. A final batch and l2
normalization projects features onto the unit hypersphere to be compatible with our
cosine appearance metric.
DEEP ASSOCIATIVE METRIC (DEEP SORT)
IDENTIFICATION: MAIN PROTOCOLS
Closed-set
01 Training and Testing are on
same classes. No unknowns.
Open-set
02 Training and Testing are on
different classes. Unknowns exist.
Closed-set Open-set
OBJECTIVE
Minimize Intra-class distance
Where
● → Number of samples
● → Input features for sample i
● → Number of classes
● → Weights of target class y for input
● → Weights of class j
DRAWBACKS OF SOFTMAX LOSS
● The softmax loss function does not explicitly optimise the feature embedding to
enforce higher similarity for intra-class samples and diversity for inter-class
samples, which results in a performance gap for deep recognition under large
intra-class appearance variations (e.g. pose variations and age gaps and
large-scale test scenarios [1].
● The learned features are separable for closed-set classification problem but not
discriminative enough for the open-set recognition task [1]
[1] ArcFace: Additive Angular Margin Loss for Deep Face Recognition.
WHY SOFTMAX FAILS?
Intrinsic Angular
Distribution
Converting the recognition tasks from the
Euclidean space to the angular and cosine
spaces for further improvements in
Angular
Losses
minimizing intra-class variations and
maximizing inter-class distances.
REMEMBER: DOT PRODUCT OF VECTORS
Euclidean Cosine
Their contribution by normalizing the weights made the optimization, thus prediction,
only depends on the angle between Weights and feature vector rather than optimizing
the dot product.
By adding the idea of the multiplicative angular margin, The constraint will be as
follows:
Note: Modified softmax loss is referring to the softmax loss in angular definition with
Weights normalized to 1 and biases=0 with no multiplicative angular margin.
EFFECT OF MARGIN (m) HYPER-PARAMETER
What do you expect by increasing the value of m?
COSINE SOFTMAX CLASSIFIER
Cosine softmax classifier is the modification of softmax classifier that was used in the
training process of the CNN used in producing body embeddings. The only difference
between it and the A-softmax we have taken is that we will also normalize the features
not only weights to ensure the representation is unit length. Hence now the loss is only
defined in angular way without any introduction in the Euclidean space. Hence the
classifier loss equation will be:
TRAINING SETUP
The CNN has been trained on a large-scale person re-identification dataset [1] that
contains over 1,100,000 images of 1,261 pedestrians, making it well suited for deep
metric learning in a people tracking context.
In total, the network has 2,800,864 parameters and one forward pass of 32 bounding
boxes takes approximately 30 ms on an Nvidia GeForce GTX 1050 mobile GPU. Thus, this
network is well suited for online tracking, provided that a modern GPU is available.
[1] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “MARS: A video benchmark for
large-scale person re-identification,” in ECCV, 2016.
THE ASSOCIATION METRIC IN DEEP SORT
A conventional way to solve the association between the predicted Kalman states and
newly arrived measurements is to build an assignment problem that can be solved
using the Hungarian algorithm. Into this problem formulation we integrate motion and
appearance information through combination of two appropriate metrics.
The first was using Hungarian algorithm where we need to find the best candidate in the
existing tracks against the newly detected objects but this time they went to the
Mahalanobis distance (squared distance) where they put threshold between the
bounding box j and the ith track to see how much close are the detection box far away
from the mean track location which excluded unlikely associations.
The second one which is the most important one is the one dealing with body
embeddings vector where they used cosine distance to measure this distance with
remembering that the descriptor vector is normalized (|| rj || = 1).
THE ASSOCIATION METRIC IN DEEP SORT
To make it more feasible in associating people with the same descriptors, for each track
they kept n past descriptors (default 100) that were successfully associated with this
tracker so that when they are associating the newly detection box, they get the smallest
cosine distance among the gallery of n past descriptors then comparing it to a
threshold that was got by trials and error that best separates the descriptors of
different people.
THE ASSOCIATION METRIC IN DEEP SORT
In combination, both metrics complement each other by serving different aspects of the
assignment problem. On the one hand, the Mahalanobis distance provides information
about possible object locations based on motion that are particularly useful for
short-term predictions. On the other hand, the cosine distance considers appearance
information that are particularly useful to recover identities after long term occlusions,
when motion is less discriminative. To build the association problem we combine both
metrics using a weighted sum:
REFERENCES
● SIMPLE ONLINE AND REALTIME TRACKING (https://arxiv.org/pdf/1602.00763.pdf)
● SIMPLE ONLINE AND REALTIME TRACKING WITH A DEEP ASSOCIATION METRIC
(https://www.uni-koblenz.de/~agas/Documents/Wojke2017SOA.pdf)
● Deep Cosine Metric Learning for Person Re-Identification
(https://elib.dlr.de/116408/1/WACV2018.pdf)
● https://www.kalmanfilter.net/default.aspx
● https://www.hungarianalgorithm.com/hungarianalgorithm.php
Github Repos:
● https://github.com/nwojke/deep_sort
● https://github.com/abewley/sort
● https://github.com/nwojke/cosine_metric_learni
● https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch
Angular losses lecture:
https://drive.google.com/file/d/1jh_tqilJO7NVcPhzQZOeNOXBBUtfsuhL/view?usp=sharing