718 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Abstract geometrical approaches currently used without any

domain knowledge or expertise. The applications for this


This project aims at solving the multi-object tracking problem span from simple ideas like security and
problem using a deep-learning approach. It builds a surveillance to other commercial applications like
detection-based online tracker that has perfect object inventory management, automated checkouts etc.
detection for each individual video frame and
simultaneously has no access to any future frames, hence 1.1. Problem Statement
building the trajectory of the objects one frame at a time. Multi-object tracking is a multi-variable estimation
It uses a KNN classifier as a baseline to compare other problem where the following is given:
deep-learning models tried for tracking. Two different
approaches were taken to tracking, one involving an ● Sequence of frames S 1:t = {S 1 , S 2 , ..., S t }
LSTM trained over the bounding boxes of the objects over ● Each frame S x = {s1x , s2x , ..., smx }
the consecutive frames of a video and the other involving a ●
y
sx represents object y in frame x represented by a
Siamese network trained on the VGG features of the crops bounding box coordinates in the frame
of the objects of every consecutive frame in the video
sequence. While the LSTM did not perform well, the The objective of a multi-tracking is to produce
Siamese network came close to matching KNN and had sii :i = {sii , sii +1 , ..., sii −1 , sii } , that is the sequence of
the potential for further improvement by adding some s e s s e e

more complexity to the network. frames that object i exists in the video[1].
This project employs a detection based approach, that is
1. Introduction assumes a separate detector provides the hypothesis for the
objects in the image itself and that this detector works at
This project aims to solve the multi-object tracking 100% accuracy.
problem assuming a perfect object detector. Traditionally This project also uses online-tracking, that is for current
object-detection is done using deep-learning approaches time t, the input can only include frames and object
while tracking is done using more traditional geometrical observed up to t and not any future observations. This
and mathematical analysis. This project only picks the requires gradually extending the object trajectory contrary
tracking problem and runs several deep-learning to an offline approach which would look at the entire
techniques to accurately track multiple objects in high video to propose the best trajectories across all frames
resolution surveillance videos with static cameras. The combined.
intuition behind deep learning was to see if some
non-linear models could learn the mathematical and

1
Existing Sequential methods [6][7][8], also referred to as
1.2. Input/Output Definition online algorithms, handle frames in a step by step method
where only information up to the current frame can be
The system takes in set of consecutive frames of a
used to track objects. On the other hand, methods which
surveillance video along with all the objects detected in
employ both past and future frame knowledge [9][10][11]
each frame described by the bounding box coordinates for
are known as offline methods. For our problem, we study
each frame.
only online methods of tracking.
The data is then preprocessed for the model that is
Kalman Filter [12] and Kernel tracking algorithms like
being run and then the processed data is passed to one of
a mean shift tracker [13] are examples of DBT trackers
the models (KNN, LSTM, Siamese) being run.
which require object detection in each frame. However,
The model outputs the confidence of how similar the
they are not precise enough to handle very dense frames
objects are in consecutive frames and this is then run
with multiple objects simultaneously. A lot of domain
through a post-processor to assign the object an existing id
knowledge and expertise goes in building trackers on the
suggesting it to continue tracking an old object or a new id
results of object detection to handle challenges like object
suggesting the presence of a new object in the video.
occlusion, interaction and overlap, illumination changes,
The final output is a file containing for each frame, all
sensor noise, etc. Multi-Hypothesis Tracking (MHT) [14]
the objects in the frame with their bounding boxes and the
and Joint Probabilistic Data Association Filters (JPDAF)
id assigned to the object by the post-processing unit
[15] are two representative methods of tracking objects.
represented by unique colors.
Recent papers make use of contextual models [16][17][18]
to avoid losing track of the object. However, often times
there is not enough training data for these models, and we
often want to track objects for a variety of contexts.
Another high performing models which use this approach
of avoiding errors is of min-cost flow framework for
global optimal data association [9]. We distinguish our
work from the existing work that has been done by
developing new deep learning trackers which do not
require any prior domain knowledge or expertise of the
environment.

3. Methods
The pipeline for a simple tracking system consists of a
frame coming into the stream as an input. An object
detector breaks the frame down to objects we wish to
track. A tracker will then assign the objects to existing
trajectories from previous frames, or tag the object as a
2. Related Works newly detected object and start a fresh trajectory with it.
Several different approaches to Multiple Object We assume perfect object detection, and consider three
Tracking have been employed. Most approaches can be different approaches to correlating objects across frames
categorized into two types: Detector Based Tracking into trajectories: k-Nearest Neighbor, an LSTM based
(DBT) approaches and Detector Free Tracking (DFT) tracker, and a Siamese Network for Similarity Detection.
Approaches [2]. In DBT approaches, a detector is
employed to get object hypothesis [3][4] and then a tracker 3.1. k-Nearest Neighbours
is used to correlate objects across frames into trajectories. k-Nearest Neighbor takes in a list of objects detected in
Most DBT approaches use a deep learning approach for the current frame, and the last known location of objects
their detectors before the performance of a DBT tracker is being tracked as inputs. For every possible pair of current
highly dependent on the tracker’s performance. DFT object location and previously detected object location, an
trackers are initialized with starting locations of objects Intersection over Union (IoU) score is calculated.
which they track in subsequent frames [5]. For our
problem, we focus our work on DBT approaches. 3.1.1 Intersection over Union
Furthermore, tracking has a problem has been tackled in IoU, also known as, Jaccard index, is an evaluation
a sequential manner as well as in batch processing. metric which calculates the common area of two bounding

2
boxes, and divides it by the area of the union of the two.
The current object is assigned to the trajectory with
which it has the highest IoU score for the last known
location of the object of that trajectory, unless, the IoU
score with all trajectories is below a certain threshold. In
that case, the object is tagged as a newly detected object,
and a fresh trajectory is started with it.

Figure 5: Smooth Trajectories of objects in the data

We can model our sequence of frames as a time-series


in which the object is moving through. The intuition
Figure 3: Computing the Intersection of Union is as simple as dividing
the area of overlap between the bounding boxes by the area of union [19]
behind using an LSTM is that with enough training data,
the model should learn what a smooth trajectory of an
3.2 LSTM based Tracking object moving in a video over time looks like, and
therefore, can be used to pairwise determine if a detected
LSTM’s are a standard deep learning method for object can belong to a trajectory or not. If the detected
learning time dependencies in data. LSTM’s improve over object does not cross a predetermined threshold for all
RNNs by introducing four new gates, which control how trajectories, it is considered a newly detected object and a
much prior information is forgotten and retained. These fresh trajectory is started for it.
gates make sure that the time dependencies are learnt over The LSTM is initialized with the the past known
long periods of time as well, and are not forgotten. The bounding box coordinates of a particular trajectory in
update equations of the four gates of the LSTM are shown sequential order. The bounding box coordinate of the
below. current box is given to the LSTM and the output of the
LSTM is fed through a sigmoid classifier to see if the
current bounding box will belong to the specified
trajectory or not. This operation is done over all pairs of
previous tracked trajectories and detected objects in the
current frame, to either assign the object to a previous
trajectory or classify it as a new object being tracked.

3.3 Siamese Network for similarity of objects


Various trackers use similarity measures such as
chi-square and nearest neighbor on the blobs shown by the
object detector [21]. We propose pairwise running the
objects in the previous frames and the objects detected in
the current frame through a siamese network with a
sigmoid classifier on top which learns a similarity measure
for blobs which represent the same person through
Figure 4: LSTM update equations (Source: CS224N)
multiple frames and distinguishing between different

3
people across frames. follows:
● 7 Training Videos
3.3.1 Transfer Learning ● 7 Test Videos
● Total Training Frames: 3579
The bounding boxes around an object specify a crop of ● Total Testing Frames: 4725
an image which is resized and passed to a CNN network.
We use the VGG architecture [22] for the CNN network to This is a high density dataset with very high overlap.
learn the representations of the image. However, since we Almost all videos are very high resolution (1920x1080)
do not have enough data to properly train such a deep and are mostly recorded at 30 frames per second.
network architecture for learning representations, we In this dataset, each video is provided as a set of images
initialize the VGG network with weights trained on the where each image represents a specific frame of the video
Imagenet dataset [23], which are further fine tuned by sequence. With that it provides a ground truth (GT) file in
training on the MOTC dataset.. This process is known as the form of a csv with the following columns:
transfer-learning [24]. Transfer learning helps in ● Frame Number
improving the performance in a new task by transferring ● Object Id
the knowledge it has learnt in performing a similar task. In ● Left Coordinate of Bounding Box
our case, the VGG network had learnt to extract features ● Top Coordinate of Bounding Box
or representations from an image on the imagenet dataset, ● Width of Bounding Box
on top of which a softmax classifier operated. In our ● Height of Bounding Box
problem domain, the VGG network will learn the image ● Confidence Score (a flag in case of GT file)
representations on the bounding box crops of people, and a ● Class (Type of object)
sigmoid classifier will output if two of these ● Visibility ratio
representations belong to the same person or not. The Only 6 videos were used for training and 6 for testing as
network architecture of the full siamese network is shown one of them had non-static cameras. Different models
below. required different forms of feature extraction and
pre-processions.
Image crop resizing was done in some cases which
required the extraction of each object from the frame and
resizing it to a fixed size while retaining all the three
layers. This could then be run through a feature extractor
like VGG. Objects were matched across frames using an
IOU technique where every object outputted by the model
was correlated to the ground truth by finding exact/closes
matches with the bounding boxes and ensuring the object
assigned by the model over consecutive frame was the
same.

5. Experiments
We had 6 training files and 7 testing files available to
us. We used 5 of the 6 training files for training, and the
6th for validation purposes.

5.1 k-Nearest Neighbor

The kNN algorithm simply remembered the data it was


fed in, and used it to classify subsequent objects detected
in the future. There was no explicit “training” phase of the
algorithm. We used the validation file to tune our
Figure 6: Siamese network architecture threshold parameter, i.e. at which threshold should we
assign an object to an existing trajectory vs. tagging it as a
4. Data & Features new object.
The Multi-Object Tracking Benchmark 2016 [20]
dataset was used for this project. The dataset details are as 5.1.1 Threshold Parameter
4
Threshold values were searched in the space of 0.0 to 6.1.1 ID Switches
1.0 in intervals of 0.1. A finer search was conducted
The ID switches metric counts the number of times the
between the interval which minimized the switch rate on
id for a single object switches across all its frames of
the validation file.
existence. This describes how many times the system
misses an object by tagging it by a new id or simply an
5.2 LSTM based Tracking
incorrect one.
● For all objects O = {O1 , O1 , ..., Om }
The LSTM network was passed in bounding boxes of x
● Each object O = {oix , oi+1
x
, ..., oxj } , that is all the
the object in multiple frames as time-series. Positive
examples were constructed by taking bounding boxes of frames i through j, the object exists for
an object from successive frames upto length t, where ● where otx is the object id of object x at frame t
optimal length t was decided by validation on the 6th file. Then:
m t(x)−1
Negative examples were similarly constructed, however ∑ ∑ 1(oix =/ oxi+1 )
the bounding box of the last frame belonged to a different I D Switches =
x=1 i=1
m [25]
object than those of the first t -1 frames. ∑ t(y)
y=1
Adam optimizer was used to train the model. Validation
on the dev file was conducted to find the optimal hidden Here, ID Switches is the weighted average for the ID
size of the LSTM cell. 5 values for the LSTM size were switches across each object, where the ID switches of each
searched: 32, 64, 128, 256, 512, with the one with the least object is defined as the number of switches in the id of a
switching rate on the dev set chosen for the final model. A given object divided by the number of frames, the object
threshold on the confidence score to distinguish between was existing in the entire video sequence. This simplifies
new objects and existing objects was determined in the to the above formula where m is the total number of true
same manner as for kNN described above. objects and t(m) represents the number of frames the
Experiments on synthetic data to see if the network was object existed for in the given video sequence.
learning simple trajectories were also conducted. Their
results are discussed in the section below. 6.1.2 Multi-Object Tracking Accuracy (MOTA)

5.3 Siamese Network for similarity of objects The Multi-Object tracking accuracy is a standard metric
used to account for all the object configuration errors. It
In training the siamese network, the objects from each combines the false positive, misses and mismatch rates
image frame were extracted and reshaped to size 224 by over all the frames.
224. Training batches were created by taking same objects For this system, since their is a perfect detector, the
from consecutive frames and negative examples by taking false positives and misses are all zero as the detector give
different objects from consecutive frames. Since, the exactly the same number of object hypothesis as the
number of negative examples was roughly 50 times higher ground truth.
than positive examples, the negative examples were Mismatch rate is defined as follows:
( )
m ids(x) t(x)
downsampled to create a 1:9 ratio of positive to negative ∑ max ∑ ∑ 1(oix =id)
x=1 id i=1
examples for better training. The weights of the VGG M ismatch Rate = 1 − m [25]
∑ t(y)
model were initialized by training on the Imagenet dataset y=1

and were shared across the two parallel layers of the


siamese network. Adam optimizer was used to train the Here, mismatch rate is the weighted for the mismatch
network, with a decaying learning rate. The threshold for rate of each object, where the mismatch rate of each object
the confidence score for distinguishing between a new is defined as one minus the mismatch accuracy. Mismatch
object and an existing one were chosen the same way as in accuracy is the ratio of the largest number of frames the
the case of the kNN algorithm described above. object was given the same id over the total frames the
object existed for in the video sequence.
6. Results/Discussion
6.1.3 Robustness
6.1. Error Evaluation Metrics
Robustness is defined as the ability of the system to
Three different error metrics are conventionally used to handle occlusion. Since this system only compared object
evaluate multi-object tracking systems pairs and hence was incapable of handling occlusion.

5
6.2. Result/Discussion

6.2.1 K-Nearest Neighbours


This was suppose to be the base case to see how a
simple classier would perform on a dense dataset. It did
really well, with the following results.

Table 1. KNN Results

ID Switches MOTA

Training 0.03 0.78 Figure 7:Synthetic Data Experiment

Testing 0.03 0.78 We ran some experiments on synthetic data to see if the
network was able to learn simple trajectories. We
Plotting the outputs into the frames showed the initialized the LSTM with an object moving in a straight
classifier work very well because of the extremely high line for 9 consecutive frames (shown in the blue line in
overlap between objects in consecutive frames. The Figure 7). Scores were checked for two possible future
perform did deteriorate when run on every 6th frame of objects: one in a straight line (Orange Dot), and a random
the video (to match realistic real time object detection point (Green dot). Intuitively, the orange dot should
speeds). Moreover, the mismatches that did occur where belong to the blue trajectory and therefore, its score should
mostly when two individuals walking in opposite be much higher. However, the network predicted very
directions cross paths. The classifier could not figure out if similar low scores for both these points and hence, would
they were same, new objects and either assigned new ids initialize both these points as new objects being tagged.
or switched their ids. The reason for the poor performance is not very clear,
This made sense, as the KNN only worked only on the but training on the image crop or VGG features of the crop
bounding box coordinates of two consecutive and had no might have performed better and would be something
knowledge of the trajectory or the contents of the object in worth testing [25].
consideration.
6.2.3 Siamese Network
6.2.2 LSTM The Siamese was run with every consecutive frame in
The LSTM was run with a sliding window of 10 frames the video sequence and had the following results:
at a time, however it performed really poorly with the
Table 3.Siamese Network Results
results as follows:
ID Switches MOTA
Table 2. LSTM Results

ID Switches MOTA Training 0.11 0.81

Training 0.51 0.27 Testing 0.15 0.76

Testing 0.57 0.22

Even though the models training accuracy was fairly


high (around 98%), it never seemed to learn anything as it
mispredicted both the training and test examples when run
through the entire pipeline. The high training accuracy for
the model was due to class imbalance in the favor of
negative results in the training dataset. Downsampling the (a) (b) (c)
number of negative examples did not help improve the Figure 8: Crop of object at Frames 1, 40 and 80 resized to 224x224x3
final performance of the model.
The accuracy was very similar to that of a KNN,
6
however the reason for the accuracy was very different. window of previous frames. This will be something to try
Unlike the KNN, Siamese looked at the actual image and in the future.
its VGG features instead of the bounding box coordinates. Another challenge would be to predict the future
This made it look for similarity and overlap in the image bounding box instead of the current approach of matching
crop. Hence, it found images Figure 5.b and Figure 5.c to detected object to previously known objects. Not only will
be similar with a very high confidence, but failed to this help handle object occlusion but potentially allow the
recognize Figure 8.a since it was occluded by a different system to run with a detection-free-system where the
object. detection does not have to happen every frame.
Moreover, considering the time complexity of The overall robustness of the system also needs to be
extracting each crop of each object in a frame, resizing and tested in the future by connecting a real object detector to
running it through a VGG network even before running see how the errors from the detector compound onto the
through the actual Siamese network, makes this algorithm predictions made by the tracker.
extremely slow and not feasible for any real time We also plan to create a better visualization pipeline to
environments. have better demos as well as increase our own
understanding of the models behaviour.
7. Conclusion
Tracking with deep-learning is still a relatively new 9. References
problem and even though a simple KNN classifier [1] B. Yang, C. Huang, and R. Nevatia, “Learning affinities and
achieved fairly good results, it has a lot of scope for dependencies for multi-target tracking using a CRF model,”
improvement. This is because even the fastest object in
detectors operate at a speed of 5 frames per second [26]. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Hence, in real time, you can run your tracking algorithm Recognit., 2011,
only on every 6th frame, instead of on every frame. And pp. 1233–1240.
[2] B. Yang and R. Nevatia, “Online learned discriminative
while kNNs high pixel overlap in consecutive frames partbased appearance models for multi-human tracking,” in
makes it perform really well, the performance drops Proc. Eur. Conf. Comput. Vis., 2012, pp. 484–498.
significantly when it is run every 6th frame. This is where [3] B. Bose, X. Wang, and E. Grimson, “Multi-class object
deep learning models can outperform simple geometric tracking algorithm that handles fragmentation and
algorithms. Siamese networks do not show significant grouping,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
drop in accuracy when they are run every 6th frame to Pattern Recognit., 2007, pp. 1–8.
track objects, because they are based on matching similar [4] B. Song, T.-Y. Jeng, E. Staudt, and A. K. Roy-Chowdhury,
“A stochastic graph evolution framework for robust
objects. multi-target tracking,” in Proc. Eur. Conf. Comput. Vis.,
The failure of the LSTM was quite unexpected and 2010, pp. 605–619
needs deeper analysis before eliminating it as a viable [5] M. Yang, T. Yu, and Y. Wu, “Game-theoretic multiple
option for tracking. Siamese networks came close to KNN target tracking,” in Proc. IEEE Int. Conf. Comput. Vis.,
with a very simple yet computationally expensive pipeline 2007, pp. 1–8.
and would have the potential to solve this problem in a [6] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z.
real world setting. Zhang, “Single and multiple object tracking using
log-euclidean riemannian subspace and block-division
appearance model,” IEEE Trans. Pattern Anal. Mach. Intel.,
vol. 34, no. 12, pp. 2420–2440, Dec. 2012
8. Future Work [7] L. Zhang and L. van der Maaten, “Structure preserving
Tracking with deep-learning is a challenging problem object tracking,” in Proc. IEEE Comp
with lot more scope for experimentation. We plan to try [8] J. Zhang, L. L. Presti, and S. Sclaroff, “Online multi-person
the following in the near future tracking by tracker hierarchy,” in Proc. IEEE Int. Conf.
A deeper analysis needs to be performed on the LSTM Advanced Video Signal-Based Surveillance, 2012, pp.
379–385.
to understand the reasons for its poor performance. [9] D. Sugimura, K. M. Kitani, T. Okabe, Y. Sato, and A.
Additionally, increase the complexity of the model by Sugimoto, “Using individuality to track individuals:
passing it both the bounding box coordinates as well as the Clustering individual trajectories in crowds using local
object crop features to make the network learn both the appearance and frequency trait,” in Proc. IEEE Int. Conf.
image content as well as the position and velocity over Comput. Vis., 2009, pp. 1467–1474.
time. [10] C.-H. Kuo, C. Huang, and R. Nevatia, “Multi-target
Siamese network performed well with just two parallel tracking by on-line learned discriminative appearance
models,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
pipelines looking at consecutive frames. We think it might Pattern Recognit., 2010, pp. 685– 692.
do much better with a 5-10 parallel pipelines looking at a [11] J. F. Henriques, R. Caseiro, and J. Batista, “Globally
7
optimal solution to multi-object tracking with merged
measurements,” in Proc. IEEE Int. Conf. Comput. Vis.,
2011, pp. 2470–2477.
[12] Isard, M., Blake, A.: Condensation - Conditional Density
Propagation for Visual Tracking. International Journal of
Computer Vision (1998)
[13] Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based Object
Tracking. IEEE Trans. on Pattern Analysis and Machine
Intelligence (May 2003)
[14] Reid, D.: An Algorithm for Tracking Multiple Targets.
IEEE Trans. Automatic Control 24(6), 843–854 (1979)
[15] Bar-Shalom, Y., Fortmann, T.: Tracking and Data
Association. Academic Press, London (1988)
[16] Babenko, B., Yang, M., Belongie, S.: Visual Tracking with
Online Multiple Instance Learning. In: IEEE CVPR (2009)
[17] Li, Y., Huang, C., Nevatia, R.: Learning to Associate:
HybridBoosted Multi-Target Tracker for Crowded Scene.
In: IEEE CVPR (2009)
[18] Yang, M., Wu, Y., Hua, G.: Context-Aware Visual
Tracking. IEEE Trans. on Pattern Analysis and Machine
Intelligence (July 2009)
[19] "Intersection over Union (IoU) for Object Detection."
PyImageSearch. N.p., 27 Sept. 2016. Web. 11 June 2017.
<​http://www.pyimagesearch.com/2016/11/07/intersection-o
ver-union-iou-for-object-detection/​>.
[20] "MOT Challenge." MOT Challenge. N.p., n.d. Web. 11
June 2017. <​https://motchallenge.net/​>.
[21] Signal & Image Processing : An International Journal
(Sipij) Vol.7, No.3, June 2016. SURVEILLANCE VIDEO
USING COLOR AND HU MOMENTS (n.d.): n. pag. Web.
[22] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.
Return of the devil in the details: Delving deep into
convolutional nets. In Proc. BMVC., 2014.
[23] Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause,
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.
Berg and Li Fei-Fei. (* = equal contribution) ImageNet
Large Scale Visual Recognition Challenge. ​International
Journal of Computer Vision​, 2015.
[24] S. J. Pan and Q. Yang, "A Survey on Transfer Learning," in
IEEE Transactions on Knowledge and Data Engineering,
vol. 22, no. 10, pp. 1345-1359, Oct. 2010.
[25] Bernardin, Keni, Alexander Elbs, and Rainer Stiefelhagen.
"Multiple object tracking performance metrics and
evaluation in a smart room environment." Sixth IEEE
International Workshop on Visual Surveillance, in
conjunction with ECCV. Vol. 90. 2006.
[26] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time
object detection with region proposal networks." Advances
in neural information processing systems. 2015.

You might also like