718 PDF
718 PDF
718 PDF
more complexity to the network. frames that object i exists in the video[1].
This project employs a detection based approach, that is
1. Introduction assumes a separate detector provides the hypothesis for the
objects in the image itself and that this detector works at
This project aims to solve the multi-object tracking 100% accuracy.
problem assuming a perfect object detector. Traditionally This project also uses online-tracking, that is for current
object-detection is done using deep-learning approaches time t, the input can only include frames and object
while tracking is done using more traditional geometrical observed up to t and not any future observations. This
and mathematical analysis. This project only picks the requires gradually extending the object trajectory contrary
tracking problem and runs several deep-learning to an offline approach which would look at the entire
techniques to accurately track multiple objects in high video to propose the best trajectories across all frames
resolution surveillance videos with static cameras. The combined.
intuition behind deep learning was to see if some
non-linear models could learn the mathematical and
1
Existing Sequential methods [6][7][8], also referred to as
1.2. Input/Output Definition online algorithms, handle frames in a step by step method
where only information up to the current frame can be
The system takes in set of consecutive frames of a
used to track objects. On the other hand, methods which
surveillance video along with all the objects detected in
employ both past and future frame knowledge [9][10][11]
each frame described by the bounding box coordinates for
are known as offline methods. For our problem, we study
each frame.
only online methods of tracking.
The data is then preprocessed for the model that is
Kalman Filter [12] and Kernel tracking algorithms like
being run and then the processed data is passed to one of
a mean shift tracker [13] are examples of DBT trackers
the models (KNN, LSTM, Siamese) being run.
which require object detection in each frame. However,
The model outputs the confidence of how similar the
they are not precise enough to handle very dense frames
objects are in consecutive frames and this is then run
with multiple objects simultaneously. A lot of domain
through a post-processor to assign the object an existing id
knowledge and expertise goes in building trackers on the
suggesting it to continue tracking an old object or a new id
results of object detection to handle challenges like object
suggesting the presence of a new object in the video.
occlusion, interaction and overlap, illumination changes,
The final output is a file containing for each frame, all
sensor noise, etc. Multi-Hypothesis Tracking (MHT) [14]
the objects in the frame with their bounding boxes and the
and Joint Probabilistic Data Association Filters (JPDAF)
id assigned to the object by the post-processing unit
[15] are two representative methods of tracking objects.
represented by unique colors.
Recent papers make use of contextual models [16][17][18]
to avoid losing track of the object. However, often times
there is not enough training data for these models, and we
often want to track objects for a variety of contexts.
Another high performing models which use this approach
of avoiding errors is of min-cost flow framework for
global optimal data association [9]. We distinguish our
work from the existing work that has been done by
developing new deep learning trackers which do not
require any prior domain knowledge or expertise of the
environment.
3. Methods
The pipeline for a simple tracking system consists of a
frame coming into the stream as an input. An object
detector breaks the frame down to objects we wish to
track. A tracker will then assign the objects to existing
trajectories from previous frames, or tag the object as a
2. Related Works newly detected object and start a fresh trajectory with it.
Several different approaches to Multiple Object We assume perfect object detection, and consider three
Tracking have been employed. Most approaches can be different approaches to correlating objects across frames
categorized into two types: Detector Based Tracking into trajectories: k-Nearest Neighbor, an LSTM based
(DBT) approaches and Detector Free Tracking (DFT) tracker, and a Siamese Network for Similarity Detection.
Approaches [2]. In DBT approaches, a detector is
employed to get object hypothesis [3][4] and then a tracker 3.1. k-Nearest Neighbours
is used to correlate objects across frames into trajectories. k-Nearest Neighbor takes in a list of objects detected in
Most DBT approaches use a deep learning approach for the current frame, and the last known location of objects
their detectors before the performance of a DBT tracker is being tracked as inputs. For every possible pair of current
highly dependent on the tracker’s performance. DFT object location and previously detected object location, an
trackers are initialized with starting locations of objects Intersection over Union (IoU) score is calculated.
which they track in subsequent frames [5]. For our
problem, we focus our work on DBT approaches. 3.1.1 Intersection over Union
Furthermore, tracking has a problem has been tackled in IoU, also known as, Jaccard index, is an evaluation
a sequential manner as well as in batch processing. metric which calculates the common area of two bounding
2
boxes, and divides it by the area of the union of the two.
The current object is assigned to the trajectory with
which it has the highest IoU score for the last known
location of the object of that trajectory, unless, the IoU
score with all trajectories is below a certain threshold. In
that case, the object is tagged as a newly detected object,
and a fresh trajectory is started with it.
3
people across frames. follows:
● 7 Training Videos
3.3.1 Transfer Learning ● 7 Test Videos
● Total Training Frames: 3579
The bounding boxes around an object specify a crop of ● Total Testing Frames: 4725
an image which is resized and passed to a CNN network.
We use the VGG architecture [22] for the CNN network to This is a high density dataset with very high overlap.
learn the representations of the image. However, since we Almost all videos are very high resolution (1920x1080)
do not have enough data to properly train such a deep and are mostly recorded at 30 frames per second.
network architecture for learning representations, we In this dataset, each video is provided as a set of images
initialize the VGG network with weights trained on the where each image represents a specific frame of the video
Imagenet dataset [23], which are further fine tuned by sequence. With that it provides a ground truth (GT) file in
training on the MOTC dataset.. This process is known as the form of a csv with the following columns:
transfer-learning [24]. Transfer learning helps in ● Frame Number
improving the performance in a new task by transferring ● Object Id
the knowledge it has learnt in performing a similar task. In ● Left Coordinate of Bounding Box
our case, the VGG network had learnt to extract features ● Top Coordinate of Bounding Box
or representations from an image on the imagenet dataset, ● Width of Bounding Box
on top of which a softmax classifier operated. In our ● Height of Bounding Box
problem domain, the VGG network will learn the image ● Confidence Score (a flag in case of GT file)
representations on the bounding box crops of people, and a ● Class (Type of object)
sigmoid classifier will output if two of these ● Visibility ratio
representations belong to the same person or not. The Only 6 videos were used for training and 6 for testing as
network architecture of the full siamese network is shown one of them had non-static cameras. Different models
below. required different forms of feature extraction and
pre-processions.
Image crop resizing was done in some cases which
required the extraction of each object from the frame and
resizing it to a fixed size while retaining all the three
layers. This could then be run through a feature extractor
like VGG. Objects were matched across frames using an
IOU technique where every object outputted by the model
was correlated to the ground truth by finding exact/closes
matches with the bounding boxes and ensuring the object
assigned by the model over consecutive frame was the
same.
5. Experiments
We had 6 training files and 7 testing files available to
us. We used 5 of the 6 training files for training, and the
6th for validation purposes.
5.3 Siamese Network for similarity of objects The Multi-Object tracking accuracy is a standard metric
used to account for all the object configuration errors. It
In training the siamese network, the objects from each combines the false positive, misses and mismatch rates
image frame were extracted and reshaped to size 224 by over all the frames.
224. Training batches were created by taking same objects For this system, since their is a perfect detector, the
from consecutive frames and negative examples by taking false positives and misses are all zero as the detector give
different objects from consecutive frames. Since, the exactly the same number of object hypothesis as the
number of negative examples was roughly 50 times higher ground truth.
than positive examples, the negative examples were Mismatch rate is defined as follows:
( )
m ids(x) t(x)
downsampled to create a 1:9 ratio of positive to negative ∑ max ∑ ∑ 1(oix =id)
x=1 id i=1
examples for better training. The weights of the VGG M ismatch Rate = 1 − m [25]
∑ t(y)
model were initialized by training on the Imagenet dataset y=1
5
6.2. Result/Discussion
ID Switches MOTA
Testing 0.03 0.78 We ran some experiments on synthetic data to see if the
network was able to learn simple trajectories. We
Plotting the outputs into the frames showed the initialized the LSTM with an object moving in a straight
classifier work very well because of the extremely high line for 9 consecutive frames (shown in the blue line in
overlap between objects in consecutive frames. The Figure 7). Scores were checked for two possible future
perform did deteriorate when run on every 6th frame of objects: one in a straight line (Orange Dot), and a random
the video (to match realistic real time object detection point (Green dot). Intuitively, the orange dot should
speeds). Moreover, the mismatches that did occur where belong to the blue trajectory and therefore, its score should
mostly when two individuals walking in opposite be much higher. However, the network predicted very
directions cross paths. The classifier could not figure out if similar low scores for both these points and hence, would
they were same, new objects and either assigned new ids initialize both these points as new objects being tagged.
or switched their ids. The reason for the poor performance is not very clear,
This made sense, as the KNN only worked only on the but training on the image crop or VGG features of the crop
bounding box coordinates of two consecutive and had no might have performed better and would be something
knowledge of the trajectory or the contents of the object in worth testing [25].
consideration.
6.2.3 Siamese Network
6.2.2 LSTM The Siamese was run with every consecutive frame in
The LSTM was run with a sliding window of 10 frames the video sequence and had the following results:
at a time, however it performed really poorly with the
Table 3.Siamese Network Results
results as follows:
ID Switches MOTA
Table 2. LSTM Results