Multi Player Tracking
Multi Player Tracking
sciences
Article
Multi-Player Tracking for Multi-View Sports Videos
with Improved K-Shortest Path Algorithm
Qiaokang Liang 1,2 , Wanneng Wu 1,2,3, *, Yukun Yang 3 , Ruiheng Zhang 3 , Yu Peng 3,4
and Min Xu 3, *
1 College of Electrical and Information Engineering, Hunan University, Changsha 410082, China;
[email protected]
2 National Engineering Laboratory for Robot Vision Perception and Control Technology, Hunan University,
Changsha 410082, China
3 Faculty of Engineering and IT, University of Technology Sydney, Sydney NSW 2007, Australia;
[email protected] (Y.Y.); [email protected] (R.Z.);
[email protected] (Y.P.)
4 Tong Xing Technology, Guangzhou 510000, China
* Correspondence: [email protected] (W.W.); [email protected] (M.X.)
Received: 26 November 2019; Accepted: 19 January 2020; Published: 27 January 2020
Abstract: Sports analysis has recently attracted increasing research efforts in computer vision. Among
them, basketball video analysis is very challenging due to severe occlusions and fast motions. As a
typical tracking-by-detection method, k-shortest paths (KSP) tracking framework has been well used
for multiple-person tracking. While effective and fast, the neglect of the appearance model would
easily lead to identity switches, especially when two or more players are intertwined with each other.
This paper addresses this problem by taking the appearance features into account based on the KSP
framework. Furthermore, we also introduce a similarity measurement method that can fuse multiple
appearance features together. In this paper, we select jersey color and jersey number as two example
features. Experiments indicate that about 70% of jersey color and 50% of jersey number over a whole
sequence would ensure our proposed method preserve the player identity better than the existing
KSP tracking method.
1. Introduction
Multiple object tracking (MOT) is part of computer vision interests which is of great significance
in terms of both commercial and academic potential. Most of its applications focus on driver
assistance and visual surveillance, especially for pedestrian tracking and vehicle tracking. Recently,
multiple-player tracking in team sports [1] has been another popular research field for MOT with
the increasing passion for sports in the world, especially for the basketball and soccer games.
The popularity of sports deservedly results in huge market demands for sports analysis using
computer vision methods, and the potential applications include player or team performance analysis,
technical-tactics analysis, motion capture, and novel applications in sports broadcasting [2,3]. To those
ends, multiple-player tracking, serving as the necessary foundation of the above applications, has been
widely studied recently.
Compared with multiple pedestrian tracking, basketball player tracking is very challenging, due
to frequent occlusions and abrupt movements. Thus, many existing tracking methods fail in this field.
Moreover, players in the same team wearing the same uniforms causes them to be more difficult to
distinguish. In addition, players’ body-shape variations, motion blur, spectator interferences, and the
illumination variations make the players hard to be tracked reliably.
Figure 1. An example of the identity switch problem. It easily occurs when two players annotated with
thicker lines are close to each other. (a,b) Blue player number 7 (annotated by red bounding box) and
yellow player 4 (labeled with green bounding box) are approaching from frame 1016 to 1025. (c) The
two players separate and then swap their identities in frame 1030. (d) The wrong identities remain in
frame 1035 and will not change until the next meeting.
Being inspired by KSP-based tracking, we further address the problem of identity switches by
computing the similarity of players in consecutive frames. We use the similarity to represent the linking
weight between two players. To calculate the similarity, appropriate features need to be carefully
selected to distinguish the players. Generally, robust tracking algorithms rely on more than one feature
to obtain accurate tracking results [12]. In this paper, both the jersey color and jersey number of each
player from a multi-view camera are explored for tracking. Generally speaking, working together,
these two features are enough to accurately determine a player’s identity. However, due to occlusions,
the two features mentioned above can only be precisely extracted in limited frames, especially for
the jersey number. In this regard, our proposed method is also designed to allow more features to
be considered.
Overall, our major contribution are as follows. We improve the existing KSP multiple player
tracking method by taking the appearance features and similarity measurement into consideration.
Although in this paper, only two features are applied for similarity computing, our method can take
multiple features into account while assigning different weights on them. Moreover, we carried out
extensive experiments to quantify the requirements of appearance features in our proposed approach.
Appl. Sci. 2020, 10, 864 3 of 22
2. Related Work
As a detailed review stated [4], the objective of MOT is to find optimal sequential states of all the
objects, which can be generally modeled by performing MAP (maximal a posteriori) estimation from
the conditional distribution of the sequential states given all the observations:
where S1:t = {S1 , S2 , ..., St } denotes all the sequential states of all the objects from the first frame to
the t-th frame, and O1:t = {O1 , O2 , ..., Ot } represents all the collected sequential observations of all
the objects throughout the same frames.
Different MOT algorithms from previous works can be thought as designing different approaches
to solving the above MAP problem, either from a probabilistic inference perspective or a deterministic
optimization perspective. The former approach usually solves the MAP problem using a two-step
iterative procedure: step one is a prediction process with a dynamic model and the other step is
an update process using an observation model, such as some traditional tracking methods; e.g.,
mean-shift [13], Kalman filter [14,15], particle filter [16,17], and their variations [18–20]. Once the
up-to-time observations are given, these kinds of dynamic approaches can predict the current
observations with gradually extend existing trajectories, which make them being very suitable for the
online tracking. However, they would easily miss a tracking object when occluded. Not to mention
the terrible occlusions in the sports matches.
As for the MOT methods based on deterministic optimization, they solve the above problem by
maximizing a likelihood function or conversely minimizing an energy function, which is very popular
nowadays. This is because they can obtain a global optimal solution, and thus are usually more robust
at dealing with false positives and occlusions. One main disadvantage of these approaches is that there
are some delays in outputting final results, as they need to process a batch of consecutive frames. Also,
they must depend on the detecting results for each frame. Fortunately, the state-of-the-art detectors,
such as Faster RCNN, YOLO et al. [21], can be used in real time. So the tracking output only needs to
wait for a short time for processing a batch of frames, the process being able to be controlled within 5
or 10 s, which is acceptable in most applications, even a sports live broadcast.
In the team sports analysis field, the global optimization methods are obviously more popular.
Earlier in 2007, people formulated the MOT problem as an integer linear program to obtain a nearly
optimal solution [22,23]. But the number of targets in the scene needs to be fixed a-priori, which is a
bit confusing. Also, there are some studies that consider multiple-player tracking as a network flow
problem, which can be solved in polynomial time [24,25]. Other global optimal methods [26] applied
in MOT include graph-based approaches [27,28], and quadratic and linear objective optimization [29].
Even though most of the above global approaches improve a lot compared with the earlier dynamic
methods, they are still difficult to optimize and sometimes become trapped in local minima. In practice,
issues manifest in tracking errors such as fragmented trajectories and identity switches. Compared
with the above methods, the MOT based on KSP algorithm [9] is very fast and easily obtains the
global optimal solution. However, identities are also easily be switched when two players are close to
each other.
To address this problem of identity switches, people have explored some appearance models [30]
or motion models [31]. For example, J. Liu et al. [32] defined a set of game context features to describe
the current state of a match. The context-conditioned motion models implicitly incorporates complex
inter-object correlations while remaining tractable using cost flow networks. H. B. Shitrit et al. [33] also
took image-appearances into account based on KSP tracking. These improved methods can preserve
identities better than previous approaches over long sequences, but they perform badly when most of
the appearance cues or motion cues cannot be available. This is very common in team sports because
of the serious frequent occlusions.
Appl. Sci. 2020, 10, 864 4 of 22
3. Methodology
Figure 2. The pipline of the multiple-player tracking system based on the KSP-AF tracking method
mainly consists of three parts: (1) 3D player detection [42], (2) feature extraction, and (3) tracking
by k-shortest paths with appearance feature (KSP-AF). The input of the whole system is multi-view
basketball videos, and the output is multiple player trajectories. For the KSP-AF tracking, the inputs
are both of the 3D detections represented as POM data and the extracted features for each detection.
can be available, the obtained trajectories would be perfect without any identity switches. However,
due to the serious occlusions in team sports, only a few pieces of identity information can be accessed.
In this paper we further carry out extensive experiments to quantify the requirements of appearance
features in KSP-based tracking.
where ρit is the estimated probability of a location i to be occupied by a player at time t, Xit is a random
variable standing the true occupancy of location i at time t, and I t represents the original image at
time t.
Moreover, what we need to consider in our tracking algorithm is not only the occupied probability
of each location, but also more informative features for each grid cell occupied by a player. Thus, we
further assume that we can compute an appearance feature model Yit and that we use it to estimate the
identity similarity between two node pairs in consecutive frames:
t t t +1 t
si,j ∈N = P (Yi = Yj∈N | It , Xi = 1), (3)
t is the similarity that the identity information in location i at time t is the same as the location
where si,j
j ∈ N(i ) at time t + 1; N(i ) is the neighborhood of location i; and Yit and Yjt+1 are appearance features
of location i and location j ∈ N(i ) at time t and t + 1, respectively.
As our goal is to find a set of physically possible trajectories that each trajectory contains only
one player, on one hand we need to link the detections as many as possible, and, on the other hand,
the similarity between the connected nodes should be as high as possible. We can formulate it as:
where m b is a set of occupancy maps and Ω is the space of occupancy maps satisfying several constraints
in (7) to (10).
Assume the conditional independence of the occupancy maps given I t is true and our objective
function can be rewritten as follows based on log function (the detailed deduction process can be seen
in the Appendix A of this paper):
The xit represents the actual occupancy of location i at time t, and it is equal to the sum of flows
t ). Therefore, together with the constraints, our goal is to solve the
leaving from the same location (∑ f i,j
following linear program optimization problem:
ρit · si,j
t !
T n
∈N( i )
max ∑∑ log
1 − ρit
· ∑ t
f i,j (6)
t =1 i =1 j ∈N( i )
s.t.
t
∀t, i, f i,j ∈N( i ) ≥ 0 (7)
∀t, i, ∑ t
f i,j ≤1 (8)
j ∈N( i )
∀t, i, ∑ t
f i,j − ∑ t
f j,k ≤0 (9)
j ∈N( i ) k ∈N( j )
The imposed constraints (7) and (8) are reasonable because the flows have to be nonnegative
and a location cannot be occupied by more than one player at a time. Also, as a player cannot leave
from one location if there is no player having entered into this location, so we have the constraint (9).
Similarly, we impose the constraint (10) because all of the flows departing from vsource will end up
in vsink .
In the next step, we will analyze how to solve the above linear program optimization problem.
Among various methods, the computing complexity can be drastically reduced by reformulating the
problem as a k-shortest node-disjoint paths problem on a directed acyclic graph, as shown in Figure 2.
In order to apply the KSP to solve it, we convert the above maximization problem into an equivalent
minimization problem by negating the objective function. Therefore, a directed linking edge ei,jt from
ρit · si,j
t ·c
!
t
w(ei,j ∈N(i ) ) = − log 1 − ρit
, (11)
while the thin gray lines assume the depth of 2. Afterwards, given a pair of virtual nodes, vsource and
vsink , any path between them is considered as a player’s trajectory, such as the blue curve in Figure 2.
On the whole, the nodes of the graph are defined as the discrete grids, each of which corresponds
the location in the court. And the graph’s edges consist of three types: (1) incoming edges: from the
source node to the nodes in the first frame or the access nodes in other frames; (2) outgoing edges:
from the nodes in last frame or the access nodes in other frames to the terminal node; (3) middle edges:
from the node i to its neighboring nodes j ∈ N(i ).
where Lit , Ltj +1 are the available feature labels for the two nodes (mit , mtj +1 ) respectively and pit , ptj +1 are
their corresponding probabilities.
Figure 3. A simple flow model: people in location i at time t can travel to location j ∈ N(i ) at time t + 1.
The green arrow means a strong match with a similarity near to 1; the red dotted arrow represents an
impossible match with a similarity close to 0; the gray arrows are possible matches whose similarities
are set the same as 0.5.
However, due to occlusions, some features (especially the jersey number) can be available only in
very few frames. Sometimes, when the appearance feature at current time t is unavailable, the similarity
between node mit and node mtj +1 would be set as the same. It means that the probabilities of the player
in location i at current time t arriving at anyone of its neighbor location j ∈ N(i ) at the following
time t + 1 are the same, like the original KSP. In our proposed method, in order to make full use of
the available features as much as possible, we also consider the neighboring nodes in the previous
frame to help us discriminate more players when the feature can not be extracted in the current
frame due to occlusion or low resolution. For example, when the jersey color in location i at time t is
unavailable but can be known for one of its neighboring nodes in the preceding time t − 1, as shown
in Figure 4a,b), we can compute the similarity between the preceding and following frame instead.
However, if more than one node with its feature available could arrive at location i (see Figure 4c),
or none of the neighboring nodes can be access to their feature (see Figure 4d) at time t − 1, we cannot
match the node i and j ∈ N(i ) because the existing identity information is ambiguous or not enough to
judge whether they should be connected.
Figure 4. Illustration of similarity measurement when jersey color at the current frame (t) is unavailable:
(a,b) one of its neighboring nodes (k) in previous frame can access the jersey color; (c) jersey color can
be available in more than one node (k1 and k2 ) at t − 1; (d) jersey color cannot be obtained for all of the
neighboring nodes at t − 1. A green arrow here is called a relative strong match (s → 1); a red dotted
arrow is an impossible match (s → 0); other gray arrows are possible matches (s = 0.5).
Appl. Sci. 2020, 10, 864 10 of 22
Figure 5. The illustration of feature extraction and fusion: players 1, 2 and 3 are taken as three examples
that correspond three locations in the probability occupancy map, respectively. For each player, it can
be mapped into several views to crop its image patches. The jersey color and jersey number can be
jointly determined by all the cropped images, which can be done by Boyer–Moore voting algorithm.
The output of color feature can be represented as ( Lc , p) where Lc ∈{1 (yellow), 2 (blue), 3 (yellow and
blue), 4 (none)} and p means its probability. The output of number feature can be denoted as ( Ln , q)
where Ln and q is the jersey number and its probability, respectively.
For one location (i.e., one player), it might correspond to several bounding boxes cropped from
different cameras. Thus, the appearance feature can be jointly determined based on these cropped
images, which can avoid some interferences caused by occlusions to some extent. The appearance
feature used in this paper is the jersey color and jersey number. For the jersey color, assume the output
of color feature for one cropped image can be represented as ( Lc , p) where Lc ∈{1 (yellow), 2 (blue),
3 (yellow and blue), and 4 (none)}, and p means its probability. Then, we need to fuse multiple color
feature candidates into one, which can be considered as a vote issue that needs to determine which
team the player should belong to. This problem can be easily solved by the classic Boyer–Moore voting
algorithm [47]. Similarly, the jersey number as well as its probability can be obtained in the same way.
However, it is worth mentioning that before the jersey number extraction for each cropped image,
we need to ensure the player’s color label is the same as the winning color label after voting. If not,
the jersey number might be wrong due to occlusion (see the player 3 in Camera 2 and 4 in Figure 5).
As to how to extract the jersey color and jersey number for each cropped image, many existing
methods can be available. In this paper, we adopted a pre-trained person re-identification (re-ID)
network [48] to obtain a descriptor for the bounding box (a 128-dimensional vector) and then classified
the players as four classes using SVM algorithm [49]: 1 (yellow player), 2 (blue player), 3 (yellow
and blue player), and 4 (none). For the jersey number, we first train a detection model to locate
the jersey number and then classify the jersey number images with a six-layer convolutional neural
Appl. Sci. 2020, 10, 864 11 of 22
network which is borrowed from an jersey number recognition approach crafted for soccer players [50].
Other relative works related to jersey number recognition can be seen in [51,52]. An accurate feature
extraction is very important for our tracking framework, and how to further improve its accuracy is
our future work while this paper mainly focus on how to improve the problem of identity switches by
taking full advantage of the appearance features. Thus, before the implement of our tracking method,
two prerequisite knowledge should be given: (1) player occupied probability for each location in
the court (provided by POM algorithm [44]); (2) the label of jersey color and jersey number and its
corresponding probabilities (provided by re-ID [48] + SVM [49] and jersey number recognition based
on CNN [50]).
4. Experiments
Figure 6. APIDIS basketball dataset is captured by five ground cameras around the court and two
fish-eye cameras overhead of the court: (a) images captured from seven views; (b) the schematic
diagram of camera installation.
STU dataset: The dataset was collected by eight synchronized cameras with a size of 1280 × 720
and the frame rate is also 25 fps, as shown in Figure 7. The cameras were evenly distributed around
the basketball court. The dataset has a better illumination condition compared with the APIDIS
dataset. However, the occlusion is more serious since our cameras were installed at a lower position,
although this kind of installation can help us recognize more jersey numbers. Thus, this dataset is also
challenging for tracking. The STU dataset was collected by our own in a basketball match and we
divided the data into 10 periods of sequences. There is not much difference between them except the
frame length which varies from 600 to 4400.
Appl. Sci. 2020, 10, 864 12 of 22
Figure 7. Our own STU basketball dataset is collected with eight ground cameras evenly
distributed around the court: (a) images captured from eight views; (b) the schematic diagram of
camera installation.
Figure 8. Connect two tracking segments according to players’ positions within the matching range
of l frames. For example, suppose the batch size is set as L: both tracklet 1 and tracklet 2 contain L
frames with l frames being overlapped. We call the overlapped l frames matching range frames. Then
we calculate the distance between the two positions frame by frame within the matching range. If the
distance is less than a given threshold, we call it a match. If not, a mismatch. Only if the matching
number of times overpasses 80%, we then connect the two tracklets. Otherwise, we consider the current
tracklet segment a new trajectory.
Overall, compared with other relative approaches, our proposed method depends on very few
parameters: only the granularity of the grid cells, the depth of KSP network, and the batch size.
The batch size can be given according to the computing capability, while the other two parameters are
determined by the consideration of the moving speed of players and the frame rate. In our experiments,
we divide the court into discreted grids with a size of 128 × 72 × 1 as well as the depth of 1. In view
of the running speed, the batch size is set as 100 frames and the number of matching range frames is
defined as 50 in our experiments. Thus, in addition of the virtual source and sink nodes, 921,602 nodes
Appl. Sci. 2020, 10, 864 13 of 22
were the total for one batch. After the weights of all possible links are calculated by Formulas (11) and
(12), a graph network is then constructed, as shown in Figure 2. It is not hard to see that the linking
edges count of the graph mainly depends on the three parameters we have mentioned above: KSP
network depth, grid number, and the batch size. More edges would results more time consumption in
tracking process.
4.3. Results
∑t (c1 · f nt + c2 · f pt + c3 · gidswt )
GMOTA = 1 − (14)
∑ t gt
where the weighting factors are set to c1 = c2 = c3 = 1 since the gidsw is much larger compared with
idsw; hence, no need for extra penalty.
trend of the tracking performance. These two experiments are done on the publicly APIDIS dataset.
Finally, to further demonstrate the effectiveness of our method, we use our own STU dataset, totaling
10 periods of sequences whose lengths vary from 600 to 4400 frames.
(1) Tracking experiments with jersey color only(APIDIS dataset).
We first apply the original KSP tracking approach without considering any appearance feature in
the 1500-frames APIDIS dataset, and we can obtain as many as 14 trajectories with the MOTA and
GMOTA scores being 0.83 and 0.71 respectively, as shown in Figure 9 (0%Color). Then, as we gradually
exert more jersey color information into our KSP-AP tracking framework, the tracking performance
improves very little or even degrade, as shown in Figure 9. Both the MOTA and GMOTA scores do not
increase a lot until more than 70% of the color information can be available. This is because players
often wear the same uniforms when they belong to the same team, and thus integrating only the jersey
color into the tracking framework cannot completely distinguish players. Anyway, when more than
70% of color feature can be available over the sequence, the tracking performance of KSP-AF would be
obviously improved compared with the original KSP method (0%Color).
Figure 9. Using only jersey color information cannot largely improve the tracking performance for
KSP-AF: (1) 0%Color corresponds the original KSP; (2) both MOTA and GMOTA scores are not largely
increased with the escalation of jersey color cues; (3) this experiment is implemented in APIDIS dataset.
(2) Tracking experiments with both jersey color and jersey number (APIDIS dataset).
Generally, we can have access to more jersey color than jersey number. In both APIDIS and STU
datasets, we can commonly extract around 50% to 80% of the jersey color and 40% to 60% of the jersey
number based on some advanced extracting methods [51,52]. The available proportions of these two
features depend on several factors, including the image resolution, light conditions, and extracting
approach. More features in the tracking process would result an obviously better tracking performance,
as shown in Figure 10. More detailed comparison results can be seen in Table 1.
Figure 10. The tracking performance of KSP-AF is gradually improved when more and more jersey
colors and jersey numbers are given: (1) 0%Color+%0Num corresponds the original KSP; (2) under
KSP-AF, both the MOTA and GMOTA scores gradually increase when more appearance information
(e.g., jersey color and jersey number) can be extracted; (3) this experiment is done with APIDIS dataset.
Appl. Sci. 2020, 10, 864 15 of 22
In Figure 10 and Table 1, 0%Color+0%Num represents that none of appearance features are
considered in KSP-AF, which exactly corresponds the original KSP. 100%Color+100%Num means
that the appearance feature of jersey color and jersey number can be available for all the players in
all frames, which can lead to a deserved result (MOTA = 1.0, GMOTA = 1.0). Our KSP-AF tracking
performance does not improve a lot, or even degrades to some extent at the beginning when the
available features are very limited. Until the available proportions of jersey color and jersey number
reach to 60% and 40% respectively, both the MOTA and GMOTA scores begin to outperform original
KSP. After several repeated experiments, we can finally report that more than 70% color plus more
than 50% jersey number over the whole sequence can ensure our KSP-AF tracking performance against
original KSP.
Table 1. The list of tracking performance under different available proportion of the jersey color and
jersey number based on KSP-AF in APIDIS dataset.
Figure 11. The comparison of MOTA and GMOTA scores between original KSP and our proposed
KSP-AF in STU dataset: (1) 70%Color+50%Number is provided for KSP-AF tracking while original
KSP considers neither of them; (2) both MOTA and GMOTA scores are clearly improved in KSP-AF
compared to original KSP; (3) numbers in brackets are the frame lengths of each period respectively.
Figure 12. Identity switches can be successfully avoided based on our KSP-AF tracking method in
APIDIS and STU datasets. (a–d) Part of the tracking effects in APIDIS dataset. Two players annotated
with thicker line (yellow player with number 4 and blue one with number 7) come close to each other
from frame 350 to 379 and then separate apart at frame 380. After that, they can still remain their own
identities at frame 393, which is impossible when the original KSP approach is applied; (e–h) the same
effectiveness of our KSP-AF method when it is applied in STU dataset.
4.3.5. Discussion
In order to test the performance of our proposed method, we did three groups of experiments.
The first two experiments were working toward the public 1500-frames APIDIS dataset while the
third one was implemented on our own STU dataset (10 periods of long sequences). Based on these
experiments, we can have the following discussions:
(1) Compared with the original KSP tracking which ignores appearance features, the identity
switches problem in multiple-player tracking can be obviously improved by considering
appearance cues (see Figures 9 and 10). It also can be seen that the combination of two features
is better for KSP-AF tracking than the case when only color feature is included. Although the
MOTA and GMOTA scores can not be largely improved or even drop at the beginning when
very few features are available, they can surpass the original KSP method when more than 60%
jersey color and 40% jersey number are extracted. Ideally, both of the MOTA and GMOTA scores
of our KSP-AF can reach 1 if we can extract jersey color and jersey number for each player in all
frames, as shown in Figure 10 (100%Color+100%Num).
(2) It also can be seen from Table 2 that KSP-based methods are better adaptable to the APIDIS
dataset than other related methods (e.g., MOT-CE [59] (MOTA = 0.799) and MOT-SA [58]
(MOTA = 0.855)). Among them, KSP-AF (70%Color+50%Num) can gain an obviously large
improvement compared to original KSP. When it reaches to 90%Color+50%Num, it can even
defeat the state-of-the-art method (T-MCNF [11]).
(3) Finally, we use longer sequences (STU dataset collected by our own) to further verify the
effectiveness of our proposed KSP-AF. The STU dataset contains 10 periods of sequences
Appl. Sci. 2020, 10, 864 18 of 22
varying from 600 to 4400 frames, and more occlusions, but higher image resolution compared
with the public APIDIS dataset. The original KSP method and our KSP-AF with 70% color and
50% number are respectively applied in these sequence. Experimental results show that our
method (70% color + 50% number) can maintain the player identity more steadily than the
original KSP method.
To sum up, our proposed KSP-AF method can avoid more identity switches if given more
features. And the basic requirements of extracting 70% color and 50% number information are not
very challenging considering that many advanced feature extraction methods are emerging [50–52].
In addition, our method is also flexible for taking more features into consideration. More features
would make our method more robust to occlusions and miss-detections.
Author Contributions: Conceptualization, Q.L. and Y.P.; formal analysis, W.W.; funding acquisition, M.X.;
investigation: Q.L.; methodology, W.W. and Y.P.; software, W.W., Y.Y., and R.Z.; project administration, Y.P.;
supervision, M.X.; validation, W.W., Y.Y., and R.Z.; writing—original draft, Q.L. and W.W.; writing—review and
editing, Q.L., W.W., and M.X. All authors have read and agreed to the published version of the manuscript.
Funding: This research was supported in part by the National Natural Science Foundation of China (NSFC
61673163), Chang-Zhu-Tan National Indigenous Innovation Demonstration Zone Project (2017XK2102), Hunan
Key Laboratory of Intelligent Robot Technology in Electronic Manufacturing (IRT2018003), and the Chinese
Scholarship Council (CSC Student ID 201706130020).
Conflicts of Interest: The authors declare no conflict of interest.
Appl. Sci. 2020, 10, 864 19 of 22
Appendix A
The deduction process of Formula (5) is detailed described as follows:
T n h
= arg max ∑ ∑ log P(Yit = Yjt∈+N1(i)) | I t , Xit = xit )
x ∈Ω t =1 i =1
+ log P( Xit = xit | I t )
T n h
= arg max ∑ ∑ (1 − xit ) log P(Yit = Yjt∈+N1(i)) | I t , Xit = 0)
x ∈Ω t =1 i =1
+ xit log P(Yit = Yjt∈+N1(i)) | I t , Xit = 1)
+ (1 − xit ) log P( Xit = 0| I t )
+ xit log P( Xit = 1| I t )
T n h
= arg max ∑ ∑ xit log P(Yit = Yjt∈+N1(i)) | I t , Xit = 1)
x ∈Ω t =1 i =1
#
P( Xit = 1| I t )
+ xit log
P( Xit = 0| I t
t
ρit · si,j
" #
T n
∈N( i )
= arg max ∑ ∑ log · xit .
x ∈ Ω t =1 i =1 1 − ρit
References
1. Thomas, G.; Gade, R.; Moeslund, T.B.; Carr, P.; Hilton, A. Computer vision for sports: Current applications
and research topics. Comput. Vis. Image Underst. 2017, 159, 3–18. [CrossRef]
2. Chen, H.T.; Chou, C.L.; Fu, T.S.; Lee, S.Y.; Lin, B.S.P. Recognizing tactic patterns in broadcast basketball
video using player trajectory. J. Vis. Commun. Image Represent. 2012, 23, 932–947. [CrossRef]
3. Manafifard, M.; Ebadi, H.; Moghaddam, H.A. Appearance-based multiple hypothesis tracking: Application
to soccer broadcast videos analysis. Signal Process. Image Commun. 2017, 55, 157–170. [CrossRef]
4. Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Zhao, X.; Kim, T.K. Multiple object tracking: A literature
review. arXiv 2014, arXiv:1409.7618.
5. Zhang, L.; Van Der Maaten, L. Preserving structure in model-free tracking. IEEE Trans. Pattern Anal. Mach.
Intell. 2014, 36, 756–769. [CrossRef] [PubMed]
6. Li, M.; He, X.; Wei, Z.; Wang, J.; Mu, Z.; Kuijper, A. Enhanced Multiple-Object Tracking Using Delay
Processing and Binary-Channel Verification. Appl. Sci. 2019, 9, 4771. [CrossRef]
7. Liu, P.; Li, X.; Liu, H.; Fu, Z. Online Learned Siamese Network with Auto-Encoding Constraints for Robust
Multi-Object Tracking. Electronics 2019, 8, 595. [CrossRef]
8. Yang, B.; Nevatia, R. Online learned discriminative part-based appearance models for multi-human
tracking. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012;
pp. 484–498.
9. Berclaz, J.; Fleuret, F.; Turetken, E.; Fua, P. Multiple object tracking using k-shortest paths optimization. IEEE
Trans. Pattern Anal. Mach. Intell. 2011, 33, 1806–1819. [CrossRef]
Appl. Sci. 2020, 10, 864 20 of 22
10. Nishikawa, Y.; Sato, H.; Ozawa, J. Performance evaluation of multiple sports player tracking system based
on graph optimization. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data),
Boston, MA, USA, 11–14 December 2017; pp. 2903–2910.
11. Shitrit, H.B.; Berclaz, J.; Fleuret, F.; Fua, P. Multi-commodity network flow for tracking multiple people.
IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1614–1627. [CrossRef]
12. Al-Ali, A.; Almaadeed, S. A review on soccer player tracking techniques based on extracted features.
In Proceedings of the 2017 6th International Conference on Information and Communication Technology
and Accessibility (ICTA), Muscat, Oman, 19–21 December 2017; pp. 1–6.
13. Cai, Y.; de Freitas, N.; Little, J.J. Robust visual tracking for multiple targets. In Proceedings of the European
Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 107–118.
14. Fu, T.S.; Chen, H.T.; Chou, C.L.; Tsai, W.J.; Lee, S.Y. Screen-strategy analysis in broadcast basketball
video using player tracking. In Proceedings of the Visual Communications and Image Processing (VCIP),
Tainan, Taiwan, 6–9 November 2011; pp. 1–4.
15. He, M.; Luo, H.; Hui, B.; Chang, Z. Pedestrian Flow Tracking and Statistics of Monocular Camera Based on
Convolutional Neural Network and Kalman Filter. Appl. Sci. 2019, 9, 1624. [CrossRef]
16. Breitenstein, M.D.; Reichlin, F.; Leibe, B.; Koller-Meier, E.; Van Gool, L. Online multiperson
tracking-by-detection from a single, uncalibrated camera. IEEE Trans. Pattern Anal. Mach. Intell. 2011,
33, 1820–1833. [CrossRef] [PubMed]
17. Yang, Y.; Li, D. Robust player detection and tracking in broadcast soccer video based on enhanced particle
filter. J. Vis. Commun. Image Represent. 2017, 46, 81–94. [CrossRef]
18. Itoh, H.; Takiguchi, T.; Ariki, Y. 3D tracking of soccer players using time-situation graph in monocular
image sequence. In Proceedings of the 2012 21st International Conference on Pattern Recognition (ICPR),
Tsukuba, Japan, 11–15 November 2012; pp. 2532–2536.
19. Najafzadeh, N.; Fotouhi, M.; Kasaei, S. Multiple soccer players tracking. In Proceedings of the 2015
International Symposium on Artificial Intelligence and Signal Processing (AISP), Mashhad, Iran, 3–5 March
2015; pp. 310–315.
20. Baysal, S.; Duygulu, P. Sentioscope: a soccer player tracking system using model field particles. IEEE Trans.
Circuits Syst. Video Technol. 2016, 26, 1350–1362. [CrossRef]
21. Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object
detection: A survey. arXiv 2018, arXiv:1809.02165.
22. Jiang, H.; Fels, S.; Little, J.J. A linear programming approach for multiple object tracking. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007;
pp. 1–8.
23. Berclaz, J.; Fleuret, F.; Fua, P. Multiple object tracking using flow linear programming. In Proceedings of
the 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance
(PETS-Winter), Snowbird, UT, USA, 7–9 December 2009; pp. 1–8.
24. Zhang, L.; Li, Y.; Nevatia, R. Global data association for multi-object tracking using network flows.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK,
USA, 23–28 June 2008; pp. 1–8.
25. Butt, A.A.; Collins, R.T. Multi-target tracking by lagrangian relaxation to min-cost network flow.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA,
23–28 June 2013; pp. 1846–1853.
26. Henriques, J.F.; Caseiro, R.; Batista, J. Globally optimal solution to multi-object tracking with merged
measurements. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV),
Barcelona, Spain, 6–13 November 2011; pp. 2470–2477.
27. Wu, Z.; Kunz, T.H.; Betke, M. Efficient track linking methods for track graphs using network-flow and
set-cover techniques. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1185–1192.
28. Zamir, A.R.; Dehghan, A.; Shah, M. Gmcp-tracker: Global multi-object tracking using generalized minimum
clique graphs. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy,
7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 343–356.
Appl. Sci. 2020, 10, 864 21 of 22
29. Wu, Z.; Thangali, A.; Sclaroff, S.; Betke, M. Coupling detection and data association for multiple object
tracking. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition,
Providence, RI, USA, 16–21 June 2012; pp. 1948–1955.
30. Kang, T.; Mo, Y.; Pae, D.; Ahn, C.; Lim, M. Robust visual tracking framework in the presence of blurring by
arbitrating appearance-and feature-based detection. Measurement 2017, 95, 50–69. [CrossRef]
31. Li, Z.; Gao, S.; Nai, K. Robust object tracking based on adaptive templates matching via the fusion of
multiple features. J. Vis. Commun. Image Represent. 2017, 44, 1–20. [CrossRef]
32. Liu, J.; Carr, P.; Collins, R.T.; Liu, Y. Tracking sports players with context-conditioned motion models. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA,
23–28 June 2013; pp. 1830–1837.
33. Shitrit, H.B.; Berclaz, J.; Fleuret, F.; Fua, P. Tracking multiple people under global appearance
constraints. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain,
6–13 November 2011.
34. KC, A.K.; Delannay, D.; Jacques, L.; De Vleeschouwer, C. Iterative hypothesis testing for multi-object tracking
with noisy/missing appearance features. In Proceedings of the Asian Conference on Computer Vision,
Daejeon, Korea, 5–9 November 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 412–426.
35. Amit Kumar, K.; De Vleeschouwer, C. Discriminative label propagation for multi-object tracking with
sporadic appearance features. In Proceedings of the IEEE International Conference on Computer Vision,
Sydney, Australia, 1–8 December 2013; pp. 2000–2007.
36. Lu, W.L.; Ting, J.A.; Little, J.J.; Murphy, K.P. Learning to track and identify players from broadcast sports
videos. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1704–1716.
37. Sabirin, H.; Sankoh, H.; Naito, S. Automatic Soccer Player Tracking in Single Camera with Robust Occlusion
Handling Using Attribute Matching. IEICE Trans. Inf. Syst. 2015, 98, 1580–1588. [CrossRef]
38. Bozorgtabar, B.; Goecke, R. Efficient multi-target tracking via discovering dense subgraphs. Comput. Vis.
Image Underst. 2016, 144, 205–216. [CrossRef]
39. Milan, A.; Gade, R.; Dick, A.; Moeslund, T.B.; Reid, I. Improving global multi-target tracking with
local updates. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland,
6–12 September 2014; pp. 174–190.
40. Wu, C.; Sun, H.; Wang, H.; Fu, K.; Xu, G.; Zhang, W.; Sun, X. Online Multi-Object Tracking via Combining
Discriminative Correlation Filters With Making Decision. IEEE Access 2018, 6, 43499–43512. [CrossRef]
41. Schulter, S.; Vernaza, P.; Choi, W.; Chandraker, M. Deep network flow for multi-object tracking.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, USA, 21–26 July 2017; pp. 2730–2739.
42. Yang, Y.; Xu, M.; Wu, W.; Zhang, R.; Peng, Y. 3D Multiview Basketball Players Detection and Localization
Based on Probabilistic Occupancy. In Proceedings of the 2018 Digital Image Computing: Techniques and
Applications (DICTA), Canberra, Australia, 10–13 December 2018; pp. 2730–2739.
43. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988.
44. Fleuret, F.; Berclaz, J.; Lengagne, R.; Fua, P. Multicamera people tracking with a probabilistic occupancy
map. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 267–282. [CrossRef] [PubMed]
45. Usamentiaga, R.; Garcia, D. Multi-camera calibration for accurate geometric measurements in industrial
environments. Measurement 2019, 134, 345–358. [CrossRef]
46. Eppstein, D. Finding the k shortest paths. SIAM J. Comput. 1998, 28, 652–673. [CrossRef]
47. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015;
pp. 91–99.
48. Boyer, R.S.; Moore, J.S. MJRTY—A fast majority vote algorithm. In Automated Reasoning; Springer:
Berlin/Heidelberg, Germany, 1991; pp. 105–117.
49. Gunn, S.R. Support vector machines for classification and regression. ISIS Tech. Rep. 1998, 14, 5–16.
50. Gerke, S.; Muller, K.; Schafer, R. Soccer jersey number recognition using convolutional neural networks.
In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile,
7–13 December 2015; pp. 17–24.
Appl. Sci. 2020, 10, 864 22 of 22
51. Li, G.; Xu, S.; Liu, X.; Li, L.; Wang, C. Jersey Number Recognition with Semi-Supervised Spatial Transformer
Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Salt Lake City, UT, USA, 18–22 June 2018; pp. 1783–1790.
52. Gerke, S.; Linnemann, A.; Müller, K. Soccer player recognition using spatial constellation features and jersey
number recognition. Comput. Vis. Image Underst. 2017, 159, 105–115. [CrossRef]
53. Jean-François Prior; Philippe Delmulle. APIDIS Dataset. Available online: https://sites.uclouvain.be/
ispgroup/Softwares/APIDIS (accessed on 29 July 2016).
54. Osvald, L. K-th Shortest Path C++ Library. Available online: https://github.com/losvald/ksp (accessed on
21 March 2013).
55. Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The CLEAR MOT metrics.
J. Image Video Process. 2008, 2008, 1. [CrossRef]
56. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking.
arXiv 2016, arXiv:1603.00831.
57. Kasturi, R.; Goldgof, D.; Soundararajan, P.; Manohar, V.; Garofolo, J.; Bowers, R.; Boonstra, M.; Korzhova, V.;
Zhang, J. Framework for performance evaluation of face, text, and vehicle detection and tracking in video:
Data, metrics, and protocol. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 319–336. [CrossRef]
58. Sekii, T. Robust, real-time 3d tracking of multiple objects with similar appearances. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016;
pp. 4275–4283.
59. Ghedia, N.S.; Vithalani, C.; Kothari, A. A Novel Approach for Monocular 3D Object Tracking in Cluttered
Environment. Int. J. Comput. Intell. Res. 2017, 13, 851–864.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).