00 (2020) Sun Chen - A Survey of Multiple Pedestrian Tracking Based On Tracking-By-Detection Framework
00 (2020) Sun Chen - A Survey of Multiple Pedestrian Tracking Based On Tracking-By-Detection Framework
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
1
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
2
MOT. As this survey was one of its kinds, there was no directions in the TBD-based algorithms for MPT. We aim
scope to discuss the TBD framework and deep learning- to provide meaningful insight for the development of new
based MOT algorithms that gain significant attention in tracking methods in TBD-based algorithms for MPT.
later years. The rest of the paper is organized as follows. Section II
• In 2015, Qiu et al. [2] presented a survey on motion- reviews the milestones of existing TBD-based methods with
based MOT algorithms. Their study was focused on the a timeline. Four main steps and two processing techniques
radar-related research area. However, the discussion on in the procedure of TBD framework are described in Sec-
the visual information was missing. Moreover, from the tion III. Section IV introduces the evaluation metrics, publicly
TBD framework point of view, the authors discussed the available datasets and analyze performance of existing TBD-
data association module, which is only one of the four based algorithms on these datasets. Existing issues at present
steps in TBD framework. Other steps, such as, object and future research directions of TBD-based algorithms are
localization, feature extraction, and track management discussed in Section V. Finally, conclusions are drawn in
cannot be ignored in the TBD framework. Section VI.
• Later, Camplani et al. [3] provided a survey on MPT
algorithms associated with RGB-D data. This survey
II. M ILESTONES OF EXISTING TBD- BASED METHODS
filled the gaps in the visual information that was not
fully covered in [2]. However, the discussion on TBD Through the unremitting efforts of the researchers, the
framework was still missing. TBD-based approaches have achieved remarkable success
• One of the highly related surveys that outlines the MPT in different aspects. We have reviewed the existing TBD-
algorithms is presented by Zhou et al. [4]. The four stages based algorithms in past decade and found that researchers
of the MPT algorithms are discussed. However, one of the mainly focused on the following four aspects: (a) design the
major limitations is that the detailed performance analysis association methods, (b) joint other vision tasks, (c) apply
was not performed on the existing MPT algorithms. deep learning to MPT, and (d) multi-modality-based MPT.
• Note that none of the previous surveys in [1]–[4] cover To the ease of understanding, we select the first proposed
the deep-learning methods in this topic. In 2019, Xu et work in each aspect to serve as milestone and describe the
al. [5] summarized and analyzed the deep learning-based motivation and principle of the proposed work. We illustrate
MOT algorithms. In particular, they focused on the ap- a timeline to introduce these milestones for the existing TBD-
plication of deep learning techniques in MOT algorithms, based methods in Fig. 2.
and the TBD framework was not thoroughly investigated.
Besides, Ciaparrone et al. [6] provided a review on deep A. Association methods
learning-based MOT algorithms. They mainly focused
As discussed above, the core of TBD framework is data
on the application of deep learning techniques in four
association. Researchers have proposed some classical associ-
steps in MOT algorithm. However, they ignored the
ation methods and many of them are still used as basic algo-
track management module which is an important module
rithms. For example, Hungarian method was used to find the
in TBD-based algorithms. Although the deep-learning
optimal association to an assignment problem in 2008 [7]. It is
techniques are gaining a significant attention in recent
a local optimal method that assigns the identity label to each
years, the traditional algorithms still play major role in
detection in every frame. Due to fast processing, Hungarian
MOT.
method was widely used in MPT in past decade [29]–[31].
From the above discussion, it is evident that TBD is Although the Hungarian is fast, the accuracy is not upto
becoming main framework in MPT. However, none of the the mark due to it is local optimal nature. Some researchers
surveys [1]–[6] particularly focused on it. Therefore, it is have tried to model data association as global optimal solution.
necessary to summarize and analyze the existing TBD-based In 2008, Zhang et al. [19] introduced a network flow (NF)
MPT algorithms to pave the way of study TBD-based methods method for the MPT, which models the association as disjoint
for MPT further. In this paper, our aim is to provide a survey flow paths in a cost-flow network. This is one of the earlier
for specially introducing the TBD framework in MPT in detail. work that applied NF to MPT. By a global optimal solution,
The main contributions of this survey are summarized as this approach achieved a good performance. Inspired by [19],
follows: many researchers have proposed improved algorithms, such
• We illustrate a timeline to introduce milestones of existing as Lagrangian relaxation-based NF [22], pair-wise cost-based
TBD-based algorithms and discuss the main steps in a NF [24], and bi-level optimization-based NF [32].
TBD framework. These can help the researchers to under- In the same year of 2008, Shafique et al. [33] used maximum
stand the development of existing TBD-based algorithms weighted stable set (MWIS) for the data association problem.
and main models used in each step in a TBD framework. They formulated the data association problem as a maximum
• Furthermore, we present the experimental results of TBD- weight problem and obtained pedestrian trajectories by an
based algorithms on publicly available MOT challenge global optimal solution. It is the first time when MWIS was
datasets and analyze the characteristic of each tracker in used for solving data association in TBD framework. However,
detail. We also discuss the major factors that affect the the availability of reliable tracklets cannot be guaranteed
tracking performance in MPT. in [33]. Thereafter, many researchers have tried to solve this
• Finally, this survey outlines the open issues and future problem by introducing several approaches, such as MWIS
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
3
NF-based Joint SOT and MPT CRF-based GMCP-based Deep learning-based Multi-modality-based
Zhang et al. [19] Xing et al. [47] Yang et al. [35] Zamir et al. [42] Kim et al. [54] Zhang et al. [59]
Fig. 2: Milestones of existing TBD-based MPT algorithms over the past decade. We combed through TBD-based algorithm with
four parts: data association methods (dotted rectangle), joint other vision tasks (dotted rounded rectangle), deep learning-based
(solid rounded rectangle), and multi-modality-based (solid rectangle)
based on rank-constrained continuous relaxation [33] and are estimated from the multiple SOT tracking models. In 2009,
polynomial-time MWIS [34]. Xing et al. [47] used a SOT to generate initial tracklets in the
It is worthwhile to note that both NF and MWIS methods local stage. In addition, many researchers have used SOT to
share the assumption that the associations are independen- predict the location of pedestrian in MPT to the handle missing
t of each other. However, association dependencies cannot detection problems in raw detection result at a crowded scene
be totally ignored. In 2011, Yang et al. [35] proposed a (e.g., KCF [48], SiamFC [49], and Siamese-RPN [50]).
conditional random filed (CRF) model for tracking multiple Another vision task that is widely used in MPT is seg-
pedestrians. They formulated data association as CRF with the mentation. In brief, segmentation can predict the location by
explicit use of association dependencies. A huge experimen- pixel and discussed the level-set framework to track contours
tal results have demonstrated the importance of association of multiple pedestrians with mutual occlusions in real-time.
dependencies. On the basis of the idea, many improved CRF In fact, tracking and segmentation are closely related and
algorithms have been proposed, for example, pair-wise model- they can help each other. For example, object segmentation
based CRF [14], mixed discrete-continuous CRF [36] and deep would separate person from other targets and background,
continuous CRF [37]. Besides, Andriyenko [38] considered which will be useful for locating the target in every frame.
that the number of possible trajectories over time is large Many researchers pay attention to the multi-task topic which
in NF method. Basically, the proposed continuous energy combines tracking and segmentation [51]–[53].
minimization (CEM) method considered limitation of the state
space. Since, CEM-based method is a global optimal solution, C. Deep learning in MPT
good performance is obtained. Based on the CEM, several
Appearance feature is the important cue for calculating the
improvements are discussed in discrete CEM [39], pair-wise
similarity between two detection boxes. Note that the low-level
label cost CEM [40], and sparse representation CEM [41].
handcraft features have been widely used in MPT before 2015.
As most of the existing association methods address all of Interestingly, the convolutional neural networks (CNN) have
the objects simultaneously, computation complexity is very been applied in several vision tasks and outperformed hand
high. To deal with this issue, in 2012, Zamir et al. [42] engineered features. In 2015, Kim et al. [54] utilized CNN to
proposed a global framework to track multiple pedestrians extract the 4096-dimensional feature for each detection box.
by utilizing generalized minimum clique graphs (GMCP). On This work is the first time when CNN was used to extract a
the basis of GMCP, Dehghan et al. [43] formulated data high-level feature in MPT. Since then, CNN has been widely
association as a generalized maximum multi clique problem used in MPT. Several CNN models have been used to design
(GMMCP). In fact, NF, MWIS, CRF, and GMCP are graph- more robust with distinct features, such as VGGNet [46], [55]
based association methods by presenting the pedestrian trajec- and GoogleNet [50], [56].
tory in a graph. On the contrary, Tang et al [44] formulated Obviously, the modules in MPT result in high complexity.
the association as a minimum cost subgraph multicut problem Some researchers tried to design an end-to-end model in a
(MCSM) that links and performs clustering for the multiple single CNN framework. In 2017, Milan et al. [57] proposed,
plausible person detection jointly over time and space. The for the first time, an end-to-end model for online MPT. They
number and size of tracks are not specified as constraint rather cast the classical Bayesian state estimation, data association
they are obtained by the solution. Based on MCSM, several re- as well as track initiation and termination tasks as a recurrent
search works have been proposed, such as, DeepMatching [45] neural net, allowing for full end-to-end learning of the model.
and minimum cost lifted multicut formulation [46]. Inspired by this article, many researchers have proposed some
good end-to-end model, such as end-to-end tracklet association
B. Joint other vision tasks module [32] and end-to-end transportation network [58].
Several researchers have leveraged other vision tasks to
improve the tracking performance. For example, some re- D. Multi-modality-based MPT
searchers argued that MPT is a generalized single object Generally, single type of data has been used as input for
tracking (SOT) problem [47] where the locations of targets tracking in a traditional TBD-based algorithm. However, this
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
4
Pre-Processing Post-Processing
1
2 3 4 56
1 2
3
4
5
6
Video sequences Object localization Feature extraction Data association Track management Tracking results
Fig. 3: The main procedures of TBD framework, which consists of four core components and two processing techniques. The
four core components contain pedestrian localisation, feature extraction, data association and track management. Processing
techniques contain pre-processing and post-processing techniques.
method cannot preserve the reliability. To address the problem, 1) Detection results from MPT datasets: Earlier, histograms
Zhang et al. [59] introduced a multi-modality MOT framework of oriented gradients (HOG) detector [62], deformable part-
for tracking objects. To solve the MOT problem with multi- based model (DPM) detection method [63] and background
modality, the authors used image and point cloud feature subtraction method [64] are widely used to detect the pedes-
extractor in feature extract phase. By extensive experiments, trian in previous tracking datasets (such as PETS1 , TUD2 ,
it has been observed that the multi-modality-based algorithms and ETHMS3 ). Later, aggregated channel features (ACF) [65]
can improve both reliability and accuracy. Inspired by this detection algorithm is used to detect pedestrian for the images
work, some researchers pay attention to multi-modality-based in MOT20154 dataset [66] released by MOT Challenge in
algorithms. For example, Gautam et al. [60] introduced a prac- 2015. Interestingly, the deformable parts model v5 (DPM) [67]
tical and lightweight tracking system, termed as SDVTracker, that outperforms other detectors has been used on MOT20165
a multi-sensor tracker with both LiDAR and detections as dataset. Later, faster R-CNN (FRCNN) [68], DPM, and scale-
asynchronous input. Later, Kuang et al [61] proposed a general dependent pooling detector (SDP) [69] are used on MOT2017
6
multi-modality cascaded fusion framework which combines dataset. Most recently, the pedestrian detection results are
detection and LiDAR information. obtained using an improved FRCNN [70] with ResNet101
backbone for the CVPR19 training sequences on MOT2019
III. T HE MAIN PROCEDURE OF TBD FRAMEWORK dataset7 .
The goal of MPT is to detect multiple pedestrians in each 2) Other localization methods: In addition to the detection
frame and maintain their identity information across frames. results provided by the public MPT datasets, many researchers
However, few discussions focused on the main procedure of used other localization methods to locate the pedestrian in
the TBD framework. Despite a considerable variety of TBD- TBD-based algorithms. The main localization methods in-
based approaches discussed in the literature, the majority of clude: filter-based, motion model-based, other computer vision
the TBD-based algorithms consider either a part or all of these tasks-based and deep learning-based. We summarize these
following steps: object localization, feature extraction, data methods in Table I.
association, and track management, as shown in Fig. 3. In • Firstly, filter-based methods can be used to predict the
addition to these four main steps, many TBD-based algorithms location of pedestrian in next frame, where the current
may contain pre-processing and post-processing techniques. In object state only depends on the previous states. If the
the following, we discuss the basic models and approaches pedestrian detection is missing, filter-based approaches
involved in each one of the steps in a TBD framework. can predict the position in the next time. The most
common filters used in a MPT may include: Kalman
A. Pedestrian localization filter (KF) [30], extended Kalman filter (EKF) [38], and
As the detection result is used as input to the data asso- particle filter (PF) [88].
ciation, it has a significant impact on the association results • Secondly, kinetic characteristic in motion to predict the
as well as the final tracking result. Generally, the detection location is simple and fast with as assumption that the
results were provided by the publicly available MPT datasets pedestrian moves consistently over a short time interval.
for a fair comparison. Some state-of-the-art detectors often Moreover, assuming the constant velocity, a constant
miss detection due to occlusion in crowded scenes. Hence, to
better track the pedestrian, many researchers used some other 1 PETS2009: http://www.cvg.reading.ac.uk/PETS2009/a.html
2 TUD:https://www.d2.mpi-inf.mpg.de/node/428
object localization methods to locate the missing detection.
3 ETHMS: http://www.vision.ee.ethz.ch/en/datasets/
These localization methods can assist public detection results. 4 MOT2015: https://motchallenge.net/data/2D MOT 2015/
On the whole, in object localization step, detection results were 5 MOT2016: https://motchallenge.net/data/MOT16/
mainly obtained by two ways: provided by MPT datasets and 6 MOT2017: https://motchallenge.net/data/MOT17/
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
5
velocity model was used to predict the prediction at the but it ignores the spatial distribution of pixel values. Generally,
next frame [81], [82]. a histogram of oriented gradient (HOG) is a feature descriptor
• Thirdly, other vision tasks can be used in MPT to locate used for pedestrian detection in computer vision [23], [88],
the missing detection. Joint segmentation and tracking is [94], [95]. It provides the histogram of the gradient direction
the one of the popular methods [51]–[53]. Specifically, of the local area to form the feature. As shown in Fig. 4(b),
object segmentation separate person from other target and it can be kept invariant on geometric and optical deformation,
background, which will be useful for locating the person but it is hard to handle occlusion. Moreover, optical flow (OF)
in every frame. Now, during occlusion, the pixel labels in can be regarded as a local feature if we take the image pixel
the visible part of the pedestrian guide a tracker to find as unit [94], [96], [97]. It is suitable for crowded scenarios,
the correct location of the pedestrian. Another method is however, with a higher computing complexity. The example of
to use SOT to locate the pedestrian in the next frame. OF is illustrated in Fig. 4(c). Furthermore, local binary pattern
It uses the appearance template of the previous frame (LBP) is used to describe the local texture of images [98],
of the pedestrian to search at the next frame to find [99]. As shown in Fig. 4(d), it can keep invariant on grayscale
the position with the highest probability. Note that SOT invariance and rotation.
not only can handle inaccurate detection results, but also 2) High-level feature extracted by CNN: With the upsurge
reduce identity switches in MPT [29], [47], [89]. of deep learning and the attractive performance of features ex-
• At last, deep learning techniques can be used to predict tracted by the CNN in visual fields, high-level feature extracted
the location of pedestrian in the next frame. Some re- by CNN have been widely used in MPT (see Fig. 4(e)). To
searchers used long short-term memory (LSTM) to learn the best of our knowledge, Kim et al [54] introduced a high-
a complex dynamic model for predicting pedestrian in the level feature extraction in MPT. Afterward, several high-level
next frame [57], [85]. Generative Adversarial Network features have been extracted by the CNN in MPT.
(GAN) architecture is also used to predict pedestrian At present, the high-level features used in MPT can be
localization, which overcomes issues related to occlusion grouped into two categories: spatial and spatial-temporal
and noisy detection [90]. In addition, recurrent neural feature. Basically, a spatial feature is the feature of bounding
network (RNN) [86] and deep reinforcement learning box extracted by the various CNN models in single im-
(DRL) [87] can be used to predict the pedestrian location. age, such as SiameseNet [49], [100], VGGNet [46], [55],
[59], AlexNet [31], ResNet [25], [30], [32], [101], and
GoogleNet [56], [102]. Besides, as an another high-level
B. Feature extraction
feature, spatial-temporal features also have became popular
To find the similarity between two pedestrians in MPT, in recent years. Compared to spatial feature, spatial-temporal
the appearance feature is extracted before the data association feature depicts the appearance characteristics of pedestrians
phase in the TBD-based algorithm. It is an essential cue for on multiple frames. It combines the information in time and
affinity computation in MPT. Appearance features are broadly space domain and it more robust. Note that ResNet [103],
categorized into low-level handcrafted and high-level features. LSTM [73], [85], [104], VGGNet [72], FaceNet [105], and
1) Handcrafted feature: Before the popularity of deep GoogleNet [50] are widely used CNN models to extract the
learning in MOT, the handcrafted feature is the main feature spatial-temporal feature.
that was used to distinguish pedestrians. The handcrafted
feature is based on raw pixel template representation for
simplicity. There are four popular handcrafted features used C. Data association
in MPT, including color histogram, gradient feature-based, Data association is one of the core components in MPT.
optical flow, and local binary pattern. The goal of data association is to identify a correspondence
The color histogram (CH) is the most widely used visual between a collection of new detection and previously detected
feature in image retrieval [91]–[93]. The main reason is that pedestrians. According to the way of processing mode, data
color is often related to the pedestrian or scene contained in the association can be further categorized into online and offline
image. It’s the most commonly used color feature representa- methods. An online association approach associates the de-
tions, as shown in Fig. 4(a). It’s robust to photometric changes, tections of the incoming frame immediately to the existing
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
6
Fig. 4: An illustration of various visual features. (a) CH feature, (b) HOG feature, (c) OF feature, (d) LBP feature, and (e)
CNN feature.
trajectories; therefore, this approach is more appropriate for 2) Offline data association: Offline association methods
real-time applications. Bipartite graph matching is widely used can obtain global optimization solution. Graph model is the
for online data associations (such as Hungarian algorithm). more popular method, where vertex nodes represent detection
Besides, the offline association approach considers the detec- results in each frame (or short tracklets) and relevant edges
tions from all the frames through an entire sequence. It is denote the cost to measure the similarity between two detec-
worthwhile to note that the global optimization methods, e.g., tions (or short tracklets). These graph-based method belong to
such as NF [19] and max weight independent set [33], are network flow (NF) [19], conditional random filed (CRF) [35],
widely used in data association. generalized minimum clique graphs (GMCP) [43], maximum
1) Online data association: Online association method weighted stable set (MWIS) [33], and minimum cost subgraph
aims to obtain the local optimization solution and focuses multicut (MCSM) [44]. Another popular global optimization
on exploring which features to use for calculating the sim- method used in TBD-based algorithm is continuous energy
ilarity between detection and existing track. Denote Dt = minimization (CEM) [38]. The goal of CEM is to fit a set
{Dti }, i = 1, ..., n as the set of n detections at time t and of trajectories to the data while satisfying some constrains
(t−1)
T t−1 = {Tj }, j = 1, ..., m as the set of m trajecto- mimicking tracking in real world scenario. We have discussed
ries at (t − 1) frame. Each trajectory consists of detections these six main offline data association methods in Section II.
Tj = {Dj,t }, t = tj,s , ..., tj,e , Dj,t is the detection of the Moreover, we provide the brief description in Table III, where
jth trajectory in the (t − 1)th frame, and tj,s and tj,e are the G is the graph consists of vertex set V and edge set E, both
starting and ending frame of Tj , respectively. l and σ indicate the connection between two edges, re is the
Firstly, the affinity between detection and track is con- regularization term, w represents the weight of edge, d keeps
structed. This affinity cue not only contains a single cue (i.e., the solution close to the detections, dn, ex, and pe are motion,
motion) [106], [107], but also can contain multiple cues (i.e. physical, and pedestrian persistence constraint, respectively.
appearance and motion feature) [41], [108]–[110]. Therefore,
affinity is expressed as
D. Track management
∧ (i, j) = ∧A (i, j) ∧M (i, j) ∧S (i, j) , (1)
After data association, a rule needs to be designed to
where ∧A (i, j), ∧M (i, j), and ∧S (i, j) represent the affinity manage the association results. Generally, track management
of appearance, motion, and shape models between the ith contains three essential steps: track update, track termination,
detection and the jth track, respectively. and track initialization.
Secondly, after calculating the affinity between detection • Track update: After data association, we update the state
and track, then the cost C is obtained as of the track, which successfully matches a detection. If
a track is associated with one and only one detection,
C(i, j) = 1 − ∧(i, j), (2)
we assume that the pedestrian is isolated and tracked
where C(i, j) is the cost between detection i and track j. If correctly (but not necessarily precisely). Then, the state
the detection i is similar to the track j, then the cost C(i, j) of the pedestrian is updated with the detection.
is small. Finally, the Hungarian algorithm is used to compute • Track termination: When a track does not associate
an assignment which minimizes the total cost. with any detections, we consider that the pedestrian is
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
7
25
4 4 4
duce the potential false-positive tracks or remove false pos-
1 3
6
7 95
12
2
11
10
1 3
6
7 95
2
1 3
6
7 95
2
itive detection from the trajectory, matched frames threshold
(MAF) [28], [46], [119] is used as a post-processing technique,
8 8 8
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
8
TABLE V: Results of TBD-based algorithms on MOT2015. Red value denotes the best. ↑ indicates that the higher is better
and ↓ represents opposite
Year and Author Tracker MOTA(↑) MOTP(↑) IDF1(↑) MT(↑) ML(↓) FP(↓) FN(↓) IDS (↓) Frag(↓) Hz(↑)
2019 Bergmann et al. TWBW [100] 44.1 75.0 46.7 18.0 26.2 6477 26577 1318 1790 0.9
2019 Chu et al. IAT [121] 38.9 70.6 44.5 16.6 31.5 7321 29501 720 1440 0.3
2017 Chen et al. MTCN [122] 38.5 72.6 47.1 8.7 37.4 4005 33203 586 1263 6.7
2019 Xu et al. STRN [103] 38.1 72.1 46.6 11.5 33.4 5451 31571 1033 2665 13.8
2017 Sadeghian et al. TTU [123] 37.6 71.7 46 15.8 26.8 7933 29397 1026 2024 1.9
2018 Keuper et al. MSA [83] 35.6 71.9 45.1 23.20 39.3 10580 28508 457 969 0.6
2018 Fang et al. RAN [86] 35.1 70.9 45.4 13.0 42.3 6771 32717 381 1523 5.4
2017 Yang et al. HAD [81] 35.0 72.6 47.7 11.4 42.2 8455 31140 358 1267 4.6
2019 Wu et al. IARL [124] 34.7 70.7 42.1 12.5 30.0 9855 29158 1112 2848 2.6
2017 Chu et al. STAM [125] 34.3 70.5 48.3 11.4 43.4 5154 34848 348 1463 0.5
2016 Wang et al. CDE [126] 34.3 71.7 44.1 14.0 39.4 7869 31908 618 959 6.5
2017 Song et al. QCNN [31] 33.8 73.4 40.4 12.9 36.9 7898 32061 703 1430 3.7
2018 Zhou et al. DCCRF [37] 33.6 70.9 39.1 10.4 37.6 5917 34002 866 1566 0.1
2016 Yang et al. TDAM [82] 33.0 72.8 46.1 13.3 39.1 10064 30617 464 1506 5.9
2018 Bae et al. CBDA [108] 32.8 70.7 38.8 9.7 42.2 4983 35690 614 1583 2.3
TABLE VI: Characteristics of TBD-based algorithms tested on MOT2015. CMC-camera motion compensation, Motion-
constant velocity model, MAF-matched frames threshold, IPL-interpolation and DPS-det-pruning-subnet.
Year and Author Tracker Other object Appearance Data Pre- Pos- MOTA Hz Open
localisation association processing processing (↑) (↑) source
2019 Bergmann et al. TWBW [100] CMC Deep On NMS No 44.1 0.9 Yes
2019 Chu et al. IAT [121] SOT Deep MCSM No No 38.9 0.3 No
2017 Chen et al. MTCN [122] PF Deep On No No 38.5 6.7 No
2019 Xu et al. STRN [103] No Deep On No MAF 38.1 13.8 No
2017 Sadeghian et al. TTU [123] No Deep On No No 37.6 1.9 No
2018 Keuper et al. MSA [83] Segmentation Handcrafted MCSM CF No 35.6 0.6 No
2018 Fange et al. RAN [86] RNN Deep On No No 35.1 5.4 No
2017 Yang et al. HAD [81] Motion Deep NF No No 35.0 4.6 No
2019 Wu et al. IARL [124] MBN deep On DPS No 34.7 2.6 No
2017 Chu et al. STAM [125] SOT Deep On NMS No 34.3 0.5 No
2016 Wang et al. CDE [126] No Handcrafted NF No IPL 34.3 6.5 No
2017 Son et al. QCNN [31] No Deep NF No No 33.8 3.7 No
2018 Zhou et al. DCCRF [37] No Deep CRF No No 33.6 0.1 No
2016 Yang et al. TDAM [82] Motion Handcrafted On No No 33 5.9 No
2018 Bar et al. CBDA [108] No Deep NF No No 32.8 2.3 Yes
second of the tracker. For the completeness metrics, mostly B. MOT benchmark datasets
tracked targets (MT), mostly lost targets (ML), and fragmen-
tation (Frag) are used to indicate how completely the ground Publicly available MOT benchmark is a unified evaluation
truth trajectories are tracked. In addition, another metric, called platform for the pedestrian tracking. As far, four pedestrian
identification F-Score (IDF) is defined as the ratio of correctly datasets have been released in MOT benchmark as summarized
identified detections over the average number of ground-truth in Table IV. In 2015, Milan et al. [66] released the first dataset,
and computed detections [129]. MOT2015 that contains a total of 22 sequences, half of them
are for training and remaining are left for the testing purpose.
IOU(GTti , Hti )
P
MOTP = 1 − t P , (6) The ACF detector [65] was used to obtain the detection
t Mt results of MOT2015 dataset. Later, in 2016, the benchmark
where IOU(GTti , Hti ) represents the intersection over union team has released the MOT2016 and MOT2017 datasets.
(IOU) of the pedestrian and its associated tracking results. To briefly explain, the MOT2016 dataset contains 14 video
sequences including 7 train sequences and 7 test sequences.
The deformable part-based model (DPM) v5 [67] was used
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
9
to detect the pedestrian on MOT2016. Now, the MOT2017 than other two datasets. The overall accuracy of the trackers
includes the same videos as in MOT16, but contains three sets on MOT2015 is almost less than 45%, and the other two
of detection for each video such as FRCNN [68], DPM [67] datasets are over 45%. Although MOT2017 includes the same
and SDP [69]. The density of each sequence can reach up to videos as MOT2016, the overall performance of the tracker
25 pedestrians-per-frame on average in both MOT2016 and on MOT2017 dataset is higher than MOT2016 as shown in
MOT2017. Recently, CVPR19 challenge dataset, released in Table IX. There are two main reasons as follows. The first
2019, consists of 8 new sequences, out of 3 are very crowded reason is due to the improvement in detection performance.
scenes. Note that the average density of every sequence can For example, ACF [65] detector that was used on MOT2015
reach a value of 179 pedestrians-per-frame. was the most advanced detector in 2015. Subsequently, more
advanced detector such as DPM [67], SDP [69], and FRC-
C. Performance of TBD-based algorithms on MOT datasets NN [68] are applied in MOT2017. In addition, compared to
In this section, we first present the results of TBD-based MOT2016, multiple detectors can be selected in MOT2017,
algorithms on several MOT datasets, which are obtained from and the SDP and FRCNN outperform the DPM. The second
benchmark before December 1st, 2019 and used the publicly reason is due to the use of different level features. To explain,
available detection results. We selected the top fifteen best the features used in some trackers on MOT2015 were hand-
tracking algorithms on MOT2015, MOT2016, and MOT2017. crafted features, however, the features used in the vast majority
Since MOT2019 has been released recently, the relevant eval- of trackers on other datasets were high-level features extracted
uation results are not available on benchmark, so we do not by the various CNN.
discuss and analyze the performance of trackers on MOT2019. 2) Tracking performance with different detection results:
Second, we analyze characteristics of each trackers used in the As the detection result acts as an input to the MPT algorithm,
evaluation. Then, we analyze the overall tracking performance it has a great impact on the performance of the tracking
on different MOT datasets. Finally, we discuss which model algorithm. We test eight state-of-the-art detectors listed as
in each step has the high impact on the tracking performance. mask R-CNN (MASK) [138], Yolo v3 (YOLO) [139], cascade
1) The overall tracking performance on the challenge mask R-CNN (CAS) [140], Hybrid Task Cascade (HAC) [26],
datasets: From Table V, VII and IX, we see that the overall SDP [69], DPM [67], GT, and FRCNN [68] on MOT2017
performance of the trackers on MOT2015 dataset is lower train dataset to obtain different detection results, as shown
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
10
TABLE X: Characteristics of TBD-based algorithms which were test on MOT2017. JD denotes joint other detection, such as
head detection. DSA denotes detection-scene analysis.
Year and Author Tracker Other object Appearance Data Pre- Pos- MOTA Hz Open
localisation association processing processing (↑) (↑) source
2019 Feng et al. SAC [50] SOT Deep On NMS No 54.7 1.5 No
2019 Bergmann et al. TWBW [100] CMC Deep On NMS No 53.5 1.5 Yes
2019 Henschel et al. BJD [28] JD Deep NF No MAF 52.6 5.4 No
2019 Chu et al. FAMA [49] SOT Deep NF No No 52.0 0 No
2019 Wang et al. ETC [105] EG Deep MCSM No No 51.9 0.7 No
2019 Sheng et al. HAGF [135] SOT Deep MWIS No No 51.8 0.7 No
2018 Sheng et al. AFN [32] KF Deep NF NMS No 51.5 1.8 No
2018 Henschel et al. FHFB [136] JD Handcrafted CRF No No 51.3 0.2 No
2019 Chen et al. ATA [130] No Deep On No No 51.3 17.8 No
2018 Keuper et al. MSA [83] Segmentation Handcrafted MCSM No No 51.2 1.8 No
2019 Xu et al. STRN [103] No Deep On No MAF 50.9 13.8 No
2018 Long et al. MOTDT [56] KF Deep On NMS No 50.9 18.3 Yes
2018 Sheng et al. IMHT [113] No Deep MWIS No No 50.6 2.6 No
2019 Yoon et al. DTAMA [8] KF Deep On NMS No 50.3 1.5 No
2017 Chen et al. EDM [137] KF Deep MWIS DSA No 50.0 0.6 No
MODP (%)
56.3
association and select four tracking algorithms (IOU [106], 50 44.1 42.0 43.8 50
MOTDT [56], SST [141], and SORT [30]) for the testing 22.0
purpose. Among these four trackers, IOU and SORT do not
use the appearance feature, whereas both the appearance and 0 0
PM
PM
FR DP
M N
YO K
LO
TC
FR P
M N
YO K
LO
TC
T
S
S
G
G
SD
A
A
N
S
N
S
the motion feature are used in MOTDT and SST. For fair
H
H
S
A
C
C
C
C
D
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
11
MOTA (in %)
MOTA (in %)
MOTA (in %)
62.8 66.5
57.6 58.5 56.8
49.1 48.8 48.5
50 41.2 50 42.0 43.6 50 39.7 50 42.2 39.7 41.1
36.5 38.0 36.0 37.2
27.6 30.2 28.5 30.2 27.8 31.1 32.7 27.0 24.9
15.7
0 0 0 0
FR PM
FR PM
FR PM
FR PM
SK
YO T
LO
TC
SK
YO T
LO
TC
SK
YO T
LO
TC
SK
YO T
LO
TC
N
S
S
G
G
SD
SD
SD
SD
A
A
N
N
H
H
A
A
C
C
C
C
D
D
M
M
(a) (b) (c) (d)
features used in these trackers are high-level which are extract- only 1.5 Hz. It is hard to meet the requirement of real-time
ed by the CNN. As we discussed earlier, the features extracted application, such as autonomous driving [142]. In contrast,
by CNN are more robust. As a result, these high-level features some researchers are more focus on speed, they designed an
can achieve better performance than traditional handcrafted online methods to improve the speed, but the accuracy is not
features. well. For example, the speed of EAG tracker [133] can reach
5) Tracking performance with different data association up to 197.3 Hz on MOT2017, but its accuracy is low (i.e.
types: The performance of offline trackers as a whole is 47.4%) compared to others.
better than online trackers on MOT2017 as shown in Table IX 4) Other challenges. Note that appearance feature that
and X (seven of the top ten tracking algorithms are offline is one of the important cues for calculating the similarity
trackers). Basically, the offline trackers consider MOT as a between two pedestrians in data association. At the same time,
global optimization problem and leverage various optimization occlusion occurs frequently in the crowded scene that results
methods, such as minimum cost subgraph multicut (MCSM). det-pruning-subnet. Although a lot of approaches have already
Unlike online trackers that only consider the current and past been proposed to solve similar appearance [50], [72], [105],
frame information, both past and future frames information [143] and occlusion problem [83], [104], [120], the tracking
are used in the offline trackers. In fact, the solutions of the performance is still poor.
online method are local optimization, while offline methods
are global optimization. Therefore, the tracking performance B. Future research directions
of offline method is higher than online method.
Although many TBD-based algorithms have been proposed
V. E XISTING ISSUES AND FUTURE RESEARCH DIRECTIONS in recent years, there are still research gaps in MPT. In the
following, we outline some of possible research directions.
In this section, we endeavor to present some of the existing 1) MPT with end-to-end model. TBD framework involves
issues and outline the future research directions of TBD-based multiple individual data processing steps and is optimized
algorithms. differently from each other, which results in complex method
design and extensive tuning parameters to adapt different target
A. Existing Issues categories and tracking scenarios. At present, few researchers
1) Limited open source code. Unlike other computer have started to design an end-to-end model to track multiple
vision tasks, MPT has a limited open source codes . Only pedestrians [32], [57], [112], [141], [144]–[146]. Milan et
a few algorithms in Table VI, VIII, and X provide the source al. [57] designed an end-to-end model contains four steps of
code. This phenomenon limits further advancement due to the TBD framework in a single network. Shen et al. [32] proposed
difficulty while reproducing the results. Since there are many an end-to-end tracklet association module to associate track-
steps in a TBD framework, the code reproduction is hard for lets. Sun et al. [141] designed an end-to-end fashion for the
researchers, especially for beginners. association by jointly modeling pedestrian appearances and
2) The tracking performance highly depends on object their affinities between different frames.
detection results. As discussed above, the first step in TBD is 2) Joint task-based MPT. Missing detection often en-
to obtain the detection results. Note that an identical algorithm counters in MPT in a crowd scene. Other vision tasks can
would produce different tracking results with significant per- help MPT to localize the pedestrian better, such as SOT
formance differences by using different detection results while and segmentation. Joint task by combining MPT and other
fixing other components [76], [86], [89], [108], [122], [126]. vision tasks not only can reduce the number of missing
3) Tradeoff between accuracy and speed. As discussed pedestrian, but also can improve the performance of other
above, the balance between accuracy and speed is very im- vision tasks. Basically, SOT utilizes an appearance template
portant in MPT. As shown in Table VI, VIII, and X, some to search the pedestrian location in the next frame, so it’s
algorithms focused on the accuracy, thus used offline methods suitable for short term predicting in MPT at crowd scene [48]–
and other manipulation in their trackers. For example, MOTA [50], [71]. Tracking and segmentation are closely related, and
of SAC tracker [50] can reach up to 54.7%, but the speed is they can help each other. Object segmentation would separate
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
12
pedestrians from other targets and background, which will be [12] W. Ruan, J. Chen, Y. Wu, J. Wang, C. Liang, R. Hu, and J. Jiang,
useful for locating the person in every frame [52], [53], [89]. “Multi-correlation filters with triangle-structure constraints for object
tracking,” IEEE Trans. Multimedia, vol. 21, no. 5, pp. 1122–1134, May.
3) Multiple 3D pedestrian tracking. The significant chal- 2018.
lenges in MPT include noise in pedestrian detection, appear- [13] M. Ye, C. Liang, Y. Yu, Z. Wang, Q. Leng, C. Xiao, J. Chen, and
ance change, and identity switch caused by pedestrian occlu- R. Hu, “Person reidentification via ranking aggregation of similarity
pulling and dissimilarity pushing,” IEEE Trans. Multimedia., vol. 18,
sion and similar appearance between pedestrians in the group. no. 12, pp. 2553–2566, Dec. 2016.
Because 2D tracking cannot obtain the spatial coordinate [14] B. Yang and R. Nevatia, “Online learned discriminative part-based
information of the pedestrian, it does not support shape-related appearance models for multi-human tracking,” in Proc. Eur. Conf.
Comput. Vis., Oct. 2012, pp. 484–498.
measurements, such as thickness and discriminative features. [15] L. Zhang and L. Van Der Maaten, “Preserving structure in model-free
With the increasing demand for precision now in many appli- tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp.
cations (i.e., autonomous navigation, robotics, and sport), 3D 756–769, Apr. 2013.
tracking has become more popular [143], [147]. Compared [16] H. Morimitsu, I. Bloch, and R. M. Cesar-Jr, “Exploring structure for
long-term tracking of multiple objects in sports videos,” Comput. Vis.
with 2D tracking, 3D tracking can offer more compelling Image Underst., vol. 159, pp. 89–104, Jun. 2017.
information, such as depth information that predicts the object [17] L. Zhang and L. van der Maaten, “Structure preserving object tracking,”
movement and the scale can be more reliable. Besides, 3D in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp.
1838–1845.
geometry information can also be leveraged in the formulation [18] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang, “Single
of data association in the TBD framework [143]. and multiple object tracking using log-euclidean riemannian subspace
and block-division appearance model,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 34, no. 12, pp. 2420–2440, Dec. 2012.
VI. C ONCLUSION [19] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-
object tracking using network flows,” in Proc. IEEE Conf. Comput.
This survey presents a comprehensive review of TBD frame- Vis. Pattern Recognit., Jun. 2008, pp. 1–8.
work. We first introduced the development of TBD-based algo- [20] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal
greedy algorithms for tracking a variable number of objects,” in Proc.
rithms with a timeline. Afterward, the main procedures of the IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1201–1208.
TBD framework are summarized. We also presented the main [21] A. Dehghan, Y. Tian, P. H. S. Torr, and M. Shah, “Target identity-aware
approaches in each step in detail. Moreover, the evaluation network flow for online multiple target tracking,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1146–1154.
metrics and publicly available datasets are discussed. Besides,
[22] A. A. Butt and R. T. Collins, “Multi-target tracking by lagrangian
the performance and characteristics of existing TBD-based relaxation to min-cost network flow,” in Proc. IEEE Conf. Comput.
methods on these datasets are analyzed. Finally, this article Vis. Pattern Recognit., Jun. 2013, pp. 1846–1853.
outlines important research issues that need to be solved in [23] W. Choi and S. Savarese, “A unified framework for multi-target track-
ing and collective activity recognition,” in Proc. Eur. Conf. Comput.
the TBD-based algorithms for MPT. Vis., Oct. 2012, pp. 215–230.
[24] V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic, “On pairwise costs
for network flow multi-object tracking,” in Proc. IEEE Conf. Comput.
R EFERENCES Vis. Pattern Recognit., Jun. 2015, pp. 5537–5545.
[25] S. Schulter, P. Vernaza, W. Choi, and M. Chandraker, “Deep network
[1] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T.-K.
flow for multi-object tracking,” in Proc. IEEE Conf. Comput. Vis.
Kim, “Multiple object tracking: A literature review,” arXiv preprint
Pattern Recognit., Jan. 2017, pp. 6951–6960.
arXiv:1409.7618, 2014.
[26] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng,
[2] C. Qiu, Z. Zhang, H. Lu, and H. Luo, “A survey of motion-based
Z. Liu, J. Shi, W. Ouyang et al., “Hybrid task cascade for instance
multitarget tracking methods,” Prog. Electromagn. Res. B, vol. 62, pp.
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
195–223, 2015.
Jun. 2019, pp. 4974–4983.
[3] M. Camplani, A. Paiement, M. Mirmehdi, D. Damen, S. Hannuna,
T. Burghardt, and L. Tao, “Multiple human tracking in RGB-depth [27] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
data: a survey,” IET Comput. Vis., vol. 11, no. 4, pp. 265–285, Jun. P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in
2017. context,” in Proc. Eur. Conf. Comput. Vis, Sept. 2014, pp. 740–755.
[4] S. Zhou, M. Ke, J. Qiu, and J. Wang, “A survey of multi-object video [28] R. Henschel, Y. Zou, and B. Rosenhahn, “Multiple people tracking
tracking algorithms,” in Proc. Adv. Intell. Sys. Comput., Jul. 2018, pp. using body and joint detections,” in Proc. IEEE Conf. Comput. Vis.
351–369. Pattern Recognit., Jun. 2019, pp. 1–10.
[5] Y. Xu, X. Zhou, S. Chen, and F. Li, “Deep learning for multiple object [29] X. Yan, X. Wu, I. A. Kakadiaris, and S. K. Shah, “To track or to
tracking: A survey,” IET Comput. Vis., vol. 13, no. 4, pp. 355–368, detect? an ensemble framework for optimal selection,” in Proc. Eur.
Jan. 2019. Conf. Comput. Vis., Sept. 2012, pp. 594–607.
[6] G. Ciaparrone, F. L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and [30] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
F. Herrera, “Deep learning in video multi-object tracking: A survey,” tracking with a deep association metric,” in Proc. Int. Conf. Image
Neurocomputing, vol. 381, pp. 61– 88, Mar. 2020. Process., Sept. 2017, pp. 3645–3649.
[7] V. K. Singh, B. Wu, and R. Nevatia, “Pedestrian tracking by associating [31] J. Son, M. Baek, M. Cho, and B. Han, “Multi-object tracking with
tracklets using detection residuals,” in IEEE Workshop Motion Video quadruplet convolutional neural networks,” in Proc. IEEE Conf. Com-
Comput., Jan. 2008, pp. 1–8. put. Vis. Pattern Recognit., Jan. 2017, pp. 5620–5629.
[8] Y.-C. Yoon, D. Y. Kim, K. Yoon, Y.-m. Song, and M. Jeon, “Online [32] H. Shen, L. Huang, C. Huang, and W. Xu, “Tracklet association tracker:
multiple pedestrian tracking using deep temporal appearance matching An end-to-end learning-based association approach for multi-object
association,” arXiv preprint arXiv:1907.00831, 2019. tracking,” arXiv preprint arXiv:1808.01562, 2018.
[9] T. Kimura, M. Ohashi, R. Okada, and H. Ikeno, “A new approach for [33] K. Shafique, M. W. Lee, and N. Haering, “A rank constrained con-
the simultaneous tracking of multiple honeybees for analysis of hive tinuous formulation of multi-frame multi-target tracking problem,” in
behavior,” Apidologie, vol. 42, no. 5, pp. 607–617, Sept. 2011. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8.
[10] H. Rahmani, A. Mian, and M. Shah, “Learning a deep model for human [34] W. Brendel, M. Amer, and S. Todorovic, “Multiobject tracking as
action recognition from novel viewpoints,” IEEE Trans. Pattern Anal. maximum weight independent set,” in Proc. IEEE Conf. Comput. Vis.
Mach. Intell., vol. 40, no. 3, pp. 667–681, Mar. 2017. Pattern Recognit., Jun. 2011, pp. 1273–1280.
[11] W. Ruan, W. Liu, Q. Bao, J. Chen, Y. Cheng, and T. Mei, “Poinet: [35] B. Yang, C. Huang, and R. Nevatia, “Learning affinities and depen-
pose-guided ovonic insight network for multi-person pose tracking,” in dencies for multi-target tracking using a CRF model,” in Proc. IEEE
Proc. ACM Int. Conf. Multimed., Oct. 2019, pp. 284–292. Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1233–1240.
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
13
[36] A. Milan, K. Schindler, and S. Roth, “Detection-and trajectory-level [60] S. Gautam, G. P. Meyer, C. Vallespi-Gonzalez, and B. C. Becker,
exclusion in multiple object tracking,” in Proc. IEEE Conf. Comput. “SDVTracker: Real-time multi-sensor association and tracking for self-
Vis. Pattern Recognit., Jun. 2013, pp. 3682–3689. driving vehicles,” arXiv preprint arXiv:2003.04447, 2020.
[37] H. Zhou, W. Ouyang, J. Cheng, X. Wang, and H. Li, “Deep continuous [61] H. Kuang, X. Liu, J. Zhang, and Z. Fang, “Multi-modality cascad-
conditional random fields with asymmetric inter-object constraints ed fusion technology for autonomous driving,” arXiv preprint arX-
for online multi-object tracking,” IEEE Trans. Circuits Syst. Video iv:2002.03138, 2020.
Technol., vol. 29, no. 4, pp. 1011–1022, Apr. 2018. [62] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
[38] A. Andriyenko and K. Schindler, “Multi-target tracking by continuous detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.
energy minimization,” in Proc. IEEE Conf. Comput. Vis. Pattern 2005, pp. 886–893.
Recognit., Jun. 2011, pp. 1265–1272. [63] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
[39] A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous opti- “Object detection with discriminatively trained part-based models,”
mization for multi-target tracking,” in Proc. IEEE Conf. Comput. Vis. IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645,
Pattern Recognit., Jun. 2012, pp. 1926–1933. Sept. 2010.
[40] A. Milan, K. Schindler, and S. Roth, “Multi-target tracking by discrete- [64] M. Piccardi, “Background subtraction techniques: A review,” in Proc.
continuous energy minimization,” IEEE Trans. Pattern Anal. Mach. IEEE Int. Conf. Syst. Man Cybern., Oct. 2004, pp. 3099–3104.
Intell., vol. 38, no. 10, pp. 2054–2068, Oct. 2016. [65] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Aggregate channel features for
[41] M. Thoreau and N. Kottege, “Improving online multiple object tracking multi-view face detection,” in Proc. IEEE Int. Jt. Conf. Biom., Dec.
with deep metric learning,” arXiv preprint arXiv:1806.07592, 2018. 2014, pp. 1–8.
[42] A. R. Zamir, A. Dehghan, and M. Shah, “GMCP-tracker: Global multi- [66] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChal-
object tracking using generalized minimum clique graphs,” in Proc. lenge2015: Towards a benchmark for multi-target tracking,” arXiv
Eur. Conf. Comput. Vis., Oct. 2012, pp. 343–356. preprint arXiv:1504.01942, 2015.
[43] A. Dehghan, S. Modiri Assari, and M. Shah, “GMMCP tracker: Glob- [67] M. A. Sadeghi and D. Forsyth, “30Hz object detection with DPM V5,”
ally optimal generalized maximum multi clique problem for multiple in Proc. Eur. Conf. Comput. Vis., Sept. 2014, pp. 65–79.
object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [68] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
Jun. 2015, pp. 4091–4099. time object detection with region proposal networks,” in Proc. Adv.
[44] S. Tang, B. Andres, M. Andriluka, and B. Schiele, “Subgraph decom- neural inf. proces. syst., Jan. 2015, pp. 91–99.
position for multi-target tracking,” in Proc. IEEE Conf. Comput. Vis. [69] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate
Pattern Recognit., Jun. 2015, pp. 5033–5041. cnn object detector with scale dependent pooling and cascaded rejection
[45] S. Tang, B. Andres, and M. Andriluka, “Multi-person tracking by classifiers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.
multicut and deep matching,” in Proc. Eur. Conf. Comput. Vis., Oct. 2016, pp. 2129–2137.
2016, pp. 100–111. [70] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid,
[46] S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple people S. Roth, K. Schindler, and L. Leal-Taixe, “CVPR19 tracking and
tracking by lifted multicut and person re-identification,” in Proc. IEEE detection challenge: How crowded can it get?” arXiv preprint arX-
Conf. Comput. Vis. Pattern Recognit., Jun. 2017, pp. 3539–3548. iv:1906.04567, 2019.
[47] J. Xing, H. Ai, and S. Lao, “Multi-object tracking through occlusions [71] S. Pan, Z. Tong, Y. Zhao, Z. Zhao, F. Su, and B. Zhuang, “Multi-object
by local tracklets filtering and global tracklets association with detec- tracking hierarchically in visual data taken from drones,” in Proc. IEEE
tion responses,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Int Conf Comput Vision., Oct. 2019, pp. 135–143.
Jun. 2009, pp. 1200–1207. [72] Y.-m. Song, K. Yoon, Y.-C. Yoon, K.-C. Yow, and M. Jeon, “Online
[48] H. Wu and W. Li, “Robust online multi-object tracking based on KCF multi-object tracking framework with the GMPHD filter and occlusion
trackers and reassignment,” in IEEE Glob. Conf. Signal Inf. Process., group management,” arXiv preprint arXiv:1907.13347, 2019.
Apr. 2016, pp. 124–128. [73] X. Wan, J. Wang, and S. Zhou, “An online and flexible multi-object
[49] P. Chu and H. Ling, “FAMNet: Joint learning of feature, affinity and tracking framework using long short-term memory,” in Proc. IEEE
multi-dimensional assignment for online multiple object tracking,” in Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1230–1238.
Proc. IEEE Int Conf Comput Vision., Oct. 2019, pp. 6172–6181. [74] T. Badal, N. Nain, and M. Ahmed, “Online multi-object tracking:
[50] W. Feng, Z. Hu, W. Wu, J. Yan, and W. Ouyang, “Multi-object tracking multiple instance based target appearance model,” Multimed. Tools
with multiple cues and switcher-aware classification,” arXiv preprint Appl., vol. 77, no. 19, pp. 25 199–25 221, Oct. 2018.
arXiv:1901.06129, 2019. [75] J. Ju, D. Kim, B. Ku, D. K. Han, and H. Ko, “Online multi-object
[51] Q. Zhang and K. N. Ngan, “Segmentation and tracking multiple objects tracking based on hierarchical association framework,” in Proc. IEEE
under occlusion from multiview video,” IEEE Trans. Image Process., Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 34–42.
vol. 20, no. 11, pp. 3308–3313, Nov. 2011. [76] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, “Online multi-target
[52] A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid, “Joint tracking and tracking with strong and weak detections,” in Proc. Eur. Conf. Comput.
segmentation of multiple targets,” in Proc. IEEE Conf. Comput. Vis. Vis., Oct. 2016, pp. 84–99.
Pattern Recognit., Jun. 2015, pp. 5397–5406. [77] J.-W. Choi, D. Moon, and J.-H. Yoo, “Robust multi-person tracking
[53] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, for real-time intelligent video surveillance,” ETRI J., vol. 37, no. 3,
A. Geiger, and B. Leibe, “MOTS: Multi-object tracking and segmenta- pp. 551–561, Jun. 2015.
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, [78] S. Tian, F. Yuan, and G.-S. Xia, “Multi-object tracking with inter-
pp. 7942–7951. feedback between detection and tracking,” Neurocomputing, vol. 171,
[54] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg, “Multiple hypothesis pp. 768–780, Jan. 2016.
tracking revisited,” in Proc. IEEE Int Conf Comput Vision., Dec. 2015, [79] C. M. Bukey, S. V. Kulkarni, and R. A. Chavan, “Multi-object tracking
pp. 4696–4704. using kalman filter and particle filter,” in Proc. IEEE Int. Conf. Power,
[55] Y. Xu, L. Qin, X. Liu, J. Xie, and S.-C. Zhu, “A causal and-or graph Control, Signals Instrum. Eng., Sept. 2017, pp. 1688–1692.
model for visibility fluent reasoning in tracking interacting objects,” [80] A. Milan, S. Roth, and K. Schindler, “Continuous energy minimization
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. for multitarget tracking,” IEEE Trans. Pattern Anal. Mach. Intell.,
2178–2187. vol. 36, no. 1, pp. 58–72, Jan. 2014.
[56] C. Long, A. Haizhou, Z. Zijie, and S. Chong, “Real-time multiple [81] M. Yang, Y. Wu, and Y. Jia, “A hybrid data association framework
people tracking with deeply learned candidate selection and person re- for robust online multi-object tracking,” IEEE Trans. Image Process.,
identification,” in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2018, vol. 26, no. 12, pp. 5667–5679, Dec. 2017.
pp. 1–8. [82] M. Yang and Y. Jia, “Temporal dynamic appearance modeling for
[57] A. Milan, S. H. Rezatofighi, A. Dick, I. Reid, and K. Schindler, “Online online multi-person tracking,” Comput. Vis. Image Underst., vol. 153,
multi-target tracking using recurrent neural networks,” in AAAI Conf. pp. 16–28, Dec. 2016.
Artif. Intell., Feb. 2017, pp. 4225–4232. [83] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele, “Motion
[58] M. Ullah and F. A. Cheikh, “Deep feature based end-to-end transporta- segmentation & multiple object tracking by correlation co-clustering,”
tion network for multi-target tracking,” in Proc. IEEE Int. Conf. Image IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 1, pp. 140–153,
Process., Oct. 2018, pp. 3738–3742. Jan. 2018.
[59] W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy, “Robust [84] M. Keuper, S. Tang, Y. Zhongjie, B. Andres, T. Brox, and B. Schiele,
multi-modality multi-object tracking,” in Proc. IEEE Int. Conf. Comput. “A multi-cut formulation for joint segmentation and tracking of multi-
Vis.., Oct. 2019, pp. 2365–2374. ple objects,” arXiv preprint arXiv:1607.06317, 2016.
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
14
[85] C. Kim, F. Li, and J. M. Rehg, “Multi-object tracking with neural gating [109] H. Yu, Q. Lei, Q. Huang, and H. Yao, “Online multiple object tracking
using bilinear lstm,” in Proc. Eur. Conf. Comput. Vis., Sept. 2018, pp. via exchanging object context,” Neurocomputing, vol. 292, no. 31, pp.
200–215. 28–37, May. 2018.
[86] K. Fang, Y. Xiang, X. Li, and S. Savarese, “Recurrent autoregressive [110] H. Kieritz, W. Hubner, and M. Arens, “Joint detection and online multi-
networks for online multi-object tracking,” in Proc. IEEE Winter Conf. object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Appl. Comput. Vis., Mar. 2018, pp. 466–475. Jun. 2018, pp. 1540–1548.
[87] L. Ren, J. Lu, Z. Wang, Q. Tian, and J. Zhou, “Collaborative deep [111] B. Wang, G. Wang, K. Luk Chan, and L. Wang, “Tracklet association
reinforcement learning for multi-object tracking,” in Proc. Eur. Conf. with online target-specific metric learning,” in Proc. IEEE Conf.
Comput. Vis., Sept. 2018, pp. 586–602. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 1234–1241.
[88] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and [112] J. Xiang, M. Chao, G. Xu, and J. Hou, “End-to-end learning deep crf
L. Van Gool, “Robust tracking-by-detection using a detector confidence models for multi-object tracking,” arXiv preprint arXiv:1907.12176,
particle filter,” in Proc. IEEE Int Conf Comput Vision., Oct. 2009, pp. 2019.
1515–1522. [113] H. Sheng, J. Chen, Y. Zhang, W. Ke, Z. Xiong, and J. Yu, “Iterative
[89] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi- multiple hypothesis tracking with tracklet-level association,” IEEE
object tracking by decision making,” in Proc. IEEE Int Conf Comput Trans. Circuits Syst. Video Technol., vol. 14, no. 8, pp. 1–13, Dec.
Vision., Feb. 2015, pp. 4705–4713. 2019.
[90] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Tracking by [114] L. Ma, S. Tang, M. J. Black, and L. Van Gool, “Customized multi-
prediction: A deep generative model for mutli-person localisation and person tracker,” in Lect. Notes Comput. Sci., Dec. 2018, pp. 612–628.
tracking,” in Proc. IEEE Workshop Appl. Comput. Vis., Dec. 2018, pp. [115] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS–
1122–1132. improving object detection with one line of code,” in Proc. IEEE Int
[91] B. Yang and R. Nevatia, “Multi-target tracking by online learning a crf Conf Comput Vision., Oct. 2017, pp. 5561–5569.
model of appearance and motion patterns,” Int. J. Comput. Vis., vol. [116] U. Iqbal, A. Milan, and J. Gall, “Posetrack: Joint multi-person pose
107, no. 2, pp. 203–217, Apr. 2014. estimation and tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern
[92] L. Wang, N. T. Pham, T.-T. Ng, G. Wang, K. L. Chan, and K. Leman, Recognit., Jun. 2017, pp. 2011–2020.
“Learning deep features for multiple object tracking by using a multi- [117] Y. Xu, Y. Ban, X. Alameda-Pineda, and R. Horaud, “Deepmot: A
task learning strategy,” in Proc. Int. Conf. Image Process., Jan. 2014, differentiable framework for training multiple object trackers,” arXiv
pp. 838–842. preprint arXiv:1906.06618, 2019.
[93] X. Wang and Q. Wang, “Coupled data association and L1 minimization [118] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov,
for multiple object tracking under occlusion,” in Proc. SPIE Int Soc B. Andres, and B. Schiele, “Arttrack: Articulated multi-person tracking
Opt Eng., Oct. 2014, pp. 1–22. in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.
[94] H. Izadinia, I. Saleemi, W. Li, and M. Shah, “(MP)2 T multiple people 2017, pp. 6457–6465.
multiple parts tracker,” in Proc. Eur. Conf. Comput. Vis., Oct. 2012, [119] A. Maksai, X. Wang, F. Fleuret, and P. Fua, “Non-markovian globally
pp. 100–114. consistent multi-object tracking,” in Proc. IEEE Int Conf Comput
[95] F. Zhao, J. Wang, Y. Wu, and M. Tang, “Adversarial deep tracking,” Vision., Oct. 2017, pp. 2544–2554.
IEEE Trans. Circuits Syst. Video Technol, vol. 29, no. 7, pp. 1998– [120] L. Lan, X. Wang, S. Zhang, D. Tao, W. Gao, and T. S. Huang,
2011, Jul. 2018. “Interacting tracklets for multi-object tracking,” IEEE Trans. Image
[96] M. Rodriguez, S. Ali, and T. Kanade, “Tracking in unstructured Process., vol. 27, no. 9, pp. 4585–4597, Sept. 2018.
crowded scenes,” in Proc. IEEE Int Conf Comput Vision., Oct. 2009, [121] P. Chu, H. Fan, C. C. Tan, and H. Ling, “Online multi-object tracking
pp. 1389–1396. with instance-aware tracker and dynamic model refreshment,” in Proc.
[97] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New features and IEEE Workshop Appl. Comput. Vis., Jan. 2019, pp. 161–170.
insights for pedestrian detection,” in Proc. IEEE Conf. Comput. Vis. [122] L. Chen, H. Ai, C. Shang, Z. Zhuang, and B. Bai, “Online multi-object
Pattern Recognit., Jun. 2010, pp. 1030–1037. tracking with convolutional neural networks,” in Proc. Int. Conf. Image
[98] C. Jia, Z. Wang, X. Wu, B. Cai, Z. Huang, G. Wang, T. Zhang, and Process., Sept. 2017, pp. 645–649.
D. Tong, “A tracking-learning-detection (TLD) method with local bi- [123] A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable:
nary pattern improved,” in Proc. IEEE Int. Conf. Robotics Biomimetics, Learning to track multiple cues with long-term dependencies,” in Proc.
Dec. 2015, pp. 1625–1630. IEEE Int Conf Comput Vision., Oct. 2017, pp. 300–311.
[99] P. P. Dash, D. Patra, and S. K. Mishra, “Local binary pattern as a [124] H. Wu, Y. Hu, K. Wang, H. Li, L. Nie, and H. Cheng, “Instance-
texture feature descriptor in object tracking algorithm,” in Proc. Adv. aware representation learning and association for online multi-person
Intell. Sys. Comput., Jun. 2014, pp. 541–548. tracking,” Pattern Recognit., vol. 94, pp. 25–34, Oct. 2019.
[100] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells [125] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu, “Online multi-
and whistles,” in Proc. IEEE Int Conf Comput Vision., Oct. 2019, pp. object tracking using CNN-based single object tracker with spatial-
941–951. temporal attention mechanism,” in Proc. IEEE Int Conf Comput Vision.,
[101] W. Li, J. Mu, and G. Liu, “Multiple object tracking with motion and Oct. 2017, pp. 4836–4845.
appearance cues,” in Proc. IEEE Int. Conf. Comput. Vis. Workshop., [126] B. Wang, G. Wang, K. L. Chan, and L. Wang, “Tracklet association
Oct. 2019, pp. 161–169. by online target-specific metric learning and coherent dynamics esti-
[102] Y. Xu, X. Liu, L. Yang, and S. C. Zhu, “Multi-view people tracking mation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 3, pp.
via hierarchical trajectory composition,” in Proc. IEEE Conf. Comput. 589–602, Feb. 2017.
Vis. Pattern Recognit., Jun. 2016, pp. 4256–4265. [127] R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo,
[103] J. Xu, Y. Cao, Z. Zhang, and H. Hu, “Spatial-temporal relation R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Framework for
networks for multi-object tracking,” in Proc. IEEE Int Conf Comput performance evaluation of face, text, and vehicle detection and tracking
Vision., Oct. 2019, pp. 3988–3998. in video: Data, metrics, and protocol,” IEEE Trans. Pattern Anal. Mach.
[104] X. Gao and T. Jiang, “OSMO: Online specific models for occlusion Intell., vol. 31, no. 2, pp. 319–336, Feb. 2008.
in multiple object tracking under surveillance scene,” in Proc. ACM [128] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking
Multimed. Conf., Oct. 2018, pp. 201–210. performance: the clear mot metrics,” Journal on Image and Video
[105] G. Wang, Y. Wang, H. Zhang, R. Gu, and J.-N. Hwang, “Exploit the Processing, vol. 2008, pp. 1–11, 2008.
connectivity: Multi-object tracking with trackletnet,” in Proc. ACM [129] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Perfor-
Multimed. Conf., Oct. 2019, pp. 482–490. mance measures and a data set for multi-target, multi-camera tracking,”
[106] E. Bochinski, V. Eiselein, and T. Sikora, “High-speed tracking-by- in Proc. Eur. Conf. Comput. Vis., Oct. 2016, pp. 17–35.
detection without using image information,” in IEEE Int. Conf. Adv. [130] L. Chen, H. Ai, R. Chen, and Z. Zhuang, “Aggregate tracklet appear-
Video Signal Based Surveill., Oct. 2017, pp. 1–6. ance features for multi-object tracking,” IEEE Signal Process. Lett.,
[107] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online vol. 26, no. 11, pp. 1613–1617, Nov. 2019.
and realtime tracking,” in Proc. Int. Conf. Image Process., Aug. 2016, [131] C. Ma, C. Yang, F. Yang, Y. Zhuang, Z. Zhang, H. Jia, and X. X-
pp. 3464–3468. ie, “Trajectory factory: Tracklet cleaving and re-connection by deep
[108] S. H. Bae and K. J. Yoon, “Confidence-based data association and siamese Bi-GRU for multiple object tracking,” in Proc. IEEE Int. Conf.
discriminative deep appearance learning for robust online multi-object Multimedia Expo., Jul. 2018, pp. 1–6.
tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. [132] E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov,
595–610, Mar. 2018. C. Rother, T. Brox, B. Schiele, and B. Andres, “Joint graph decom-
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2020.3009717, IEEE
Transactions on Circuits and Systems for Video Technology
15
position & node labeling: Problem, algorithms, applications,” in Proc. Chao Liang received the Ph.D degree from National
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2017, pp. 6012–6020. Lab of Pattern Recognition (NLPR), Institute of Au-
[133] H. Sheng, X. Zhang, Y. Zhang, Y. Wu, J. Chen, and Z. Xiong, “En- tomation, Chinese Academy of Sciences (CASIA),
hanced association with supervoxels in multiple hypothesis tracking,” Beijing, China, in 2012. He is currently working
IEEE Access, vol. 7, pp. 2107–2117, 2018. as an associate professor at National Engineering
[134] W. Tian, M. Lauer, and L. Chen, “Online multi-object tracking using Research Center for Multimedia Software (NERCM-
joint domain information in traffic scenarios,” IEEE Trans. Intell. S), Computer School of Wuhan University, Wuhan,
Transp. Syst., vol. 20, no. 2, pp. 1–11, Jan. 2019. China. His research interests focus on multimedia
[135] H. Sheng, Y. Zhang, J. Chen, Z. Xiong, and J. Zhang, “Heterogeneous content analysis and retrieval, computer vision and
association graph fusion for target association in multiple object pattern recognition, where he has published over
tracking,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 11, 60 papers, including premier conferences such as
pp. 3269–3280, Nov. 2019. CVPR, ACM MM, AAAI, IJCAI and honorable journals like TNNLS, TMM
[136] R. Henschel, L. Leal-Taixe, D. Cremers, and B. Rosenhahn, “Fusion of and TCSVT, and won the best paper award of PCM 2014.
head and full-body detectors for multi-object tracking,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1428–1437.
[137] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong, “Enhancing detection
model for multiple hypothesis tracking,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2017, pp. 18–27.
[138] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Proc. IEEE Int Conf Comput Vision., Oct. 2017, pp. 2961–2969.
[139] J. Redmon and A. Farhadi, “YOLOV3: An incremental improvement,”
arXiv preprint arXiv:1804.02767, 2018.
[140] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 6154–6162.
[141] S. Sun, N. Akhtar, H. Song, A. S. Mian, and M. Shah, “Deep affinity
network for multiple object tracking,” IEEE Trans. Pattern Anal. Mach. Weijian Ruan received the B.E. degree from Elec-
Intell., 2019. tronic Information School of Wuhan University in
[142] A. Agarwal and S. Suryavanshi, “Real-time* multiple object tracking 2014. Since Sep 2014, he has been pursuing his
(mot) for autonomous navigation,” Technical report, Tech. Rep., 2017. Ph.D degree in school of computer science, Wuhan
[143] Z. Tang and J.-N. Hwang, “MOANA: An online learned adaptive University. From 2018.03 to 2018.09, he served as
appearance model for robust multiple object tracking in 3D,” IEEE an intern in the National Institute of Informatics,
Access, vol. 7, pp. 31 934–31 945, 2019. Tokyo, Japan. From 2019.01 to 2019.06, he had been
[144] T. Hu, L. Huang, and H. Shen, “Multi-object tracking via end-to- an intern in JD AI research, China. His research
end tracklet searching and ranking,” arXiv preprint arXiv:2003.02795, interests focus on computer vision and multimedia
2020. analysis, where he has published over 10 papers
[145] S. Wang, Y. Sun, C. Liu, and M. Liu, “PointTrackNet: An end-to-end including AAAI, ACM MM, TMM, TOMM, ICME,
network for 3-D object detection and tracking from point clouds,” Proc. ICIP, etc. In addition, he is active in his research field and has served as the
IEEE Robot. Autom., Apr. 2020. reviewer in related journals and conferences, such as TMM, TIP, AAAI 2020,
[146] J. Xiang, G. Xu, C. Ma, and J. Hou, “End-to-end learning deep CRF ACM MM 2020, ICME 2020, ICASSP 2020.
models for multi-object tracking,” IEEE Trans. Circuits Syst. Video
Technol, 2020.
[147] E. Baser, V. Balasubramanian, P. Bhattacharyya, and K. Czarnecki,
“Fantrack: 3D multi-object tracking with feature association network,”
arXiv preprint arXiv:1905.02843, 2019.
1051-8215 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Carleton University. Downloaded on August 09,2020 at 08:12:17 UTC from IEEE Xplore. Restrictions apply.