0% found this document useful (0 votes)
18 views11 pages

IET Computer Vision - 2023 - Yang - Online Multiple Object Tracking With Enhanced Re Identification

This research presents an online multiple object tracking method that integrates object detection and re-identification (ReID) using a Task-Related Attention Network (TRAN) to mitigate competition between tasks. The proposed method employs a smooth gradient-boosting loss function to enhance the quality of ReID features by focusing on hard negative samples during training. Extensive experiments demonstrate the effectiveness of this approach on multiple datasets, showing competitive performance with existing algorithms.

Uploaded by

divers.mery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

IET Computer Vision - 2023 - Yang - Online Multiple Object Tracking With Enhanced Re Identification

This research presents an online multiple object tracking method that integrates object detection and re-identification (ReID) using a Task-Related Attention Network (TRAN) to mitigate competition between tasks. The proposed method employs a smooth gradient-boosting loss function to enhance the quality of ReID features by focusing on hard negative samples during training. Extensive experiments demonstrate the effectiveness of this approach on multiple datasets, showing competitive performance with existing algorithms.

Uploaded by

divers.mery
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received: 17 September 2022

DOI: 10.1049/cvi2.12191

ORIGINAL RESEARCH
- -Revised: 22 February 2023 Accepted: 4 March 2023

- IET Computer Vision

Online multiple object tracking with enhanced Re‐identification

Wenyu Yang | Yong Jiang | Shuai Wen | Yong Fan

Department of Computer Science and Technology, Abstract


Southwest University of Science and Technology,
Mianyang, China
In existing online multiple object tracking algorithms, schemes that combine object
detection and re‐identification (ReID) tasks in a single model for simultaneous learning
Correspondence have drawn great attention due to their balanced speed and accuracy. However, different
Yong Fan. tasks require to focus different features. Learning two different tasks in the same model
Email: [email protected] extracted features can lead to competition between the two tasks, making it difficult to
achieve optimal performance. To reduce this competition, a task‐related attention
Funding information
network, which uses a self‐attention mechanism to allow each branch to learn on feature
Sichuan Science and Technology Program, Grant/
Award Number: 2021YFG0031
maps related to its task is proposed. Besides, a smooth gradient‐boosting loss function,
which improves the quality of the extracted ReID features by gradually shifting the focus
to the hard negative samples of each object during training is introduced. Extensive
experiments on MOT16, MOT17, and MOT20 datasets demonstrate the effectiveness of
the proposed method, which is also competitive in current mainstream algorithm.

KEYWORDS
computer vision, object tracking

1 | INTRODUCTION which brings a large computational overhead. Therefore, the


Joint Detection and Embedding (JDE) method was proposed.
The goal of Multi‐object Tracking (MOT) is to track multiple Joint Detection and Embedding integrates the object detection
objects of interest in a video and to get the correct position of and ReID feature extraction modules into a single network,
each object in its occurrence frame, which is one of the which can predict the location of the object and extract ReID
important tasks in computer vision [1]. It has been widely used features simultaneously.
in several fields, such as autonomous driving [2], security sys- However, directly fusing the object detection and ReID
tems, and human activity recognition [3]. tasks into one model can lead to competition between the two
Multiple object tracking methods can be divided into on- tasks and reduce the tracking accuracy [9]. The detection task
line and offline methods [4]. Offline multi‐object tracking can expects objects in the same class to have the same semantics,
use the global information of video sequences, while online and the network is able to reduce the intra‐class variation. But
methods can only use the present and past information to the ReID task is more concerned with the differences of
make predictions for the current frame, but online methods different objects between the same class and expects the
have received more attention because it is in line with the network to amplify such differences. The inconsistency of the
actual application scenarios. optimisation goals of the two tasks hinders the network from
Online MOT methods usually contain two subtasks, object optimising on both tasks.
detection and re‐identification (ReID). Most of the methods To alleviate this competition, we design a Task‐Related
[5–8] firstly obtain the bounding boxes of all objects in the Attention Network (TRAN). Given the feature maps extrac-
current frame by the detector, and secondly extract the ReID ted using the backbone network, TRAN uses a self‐attention
features of each bounding box, and use the ReID features to mechanism [10] to reorganise the feature maps into two
match the candidate boxes with the existing trajectory. In this task‐related feature maps fed to the corresponding branches,
way, the ReID features need to be extracted for each box, so that each branch can learn on its task‐related features.

-
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited and is not used for commercial purposes.
© 2023 The Authors. IET Computer Vision published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

676 IET Comput. Vis. 2023;17:676–686. wileyonlinelibrary.com/journal/cvi2


17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YANG ET AL.
- 677

In addition, we also find that existing methods [11, 12] for Although the methods that introduce ReID features
training ReID branch often use cross‐entropy (CE) loss, which improve the performance of tracking, all take additional net-
allows all negative classes to participate equally in the update of works to extract the appearance features. These ReID net-
the network, but there are a large number of IDs in MOT, that works act on each box detected by the detector, which
is, a large number of classes. At the early stage of training, the significantly increases the inference time of the network and
network is quickly able to distinguish a large number of simple brings additional overhead.
negative samples. However, at the later stage of training, simple
samples does not help the network to learn well with
discriminative features, it is often the hard negative samples 2.2 | Joint‐detection‐tracking
that make the network difficult to distinguish.
So, we argue that focussing on the hard negative samples in Joint‐detection‐tracking is a fusion of some modules on the
the later stages of training helps the network to learn to extract basis of tracking‐on‐detection, which reduces the complexity
more discriminative ReID features. Therefore, we propose a of the algorithm and improves the efficiency. It is common
smooth gradient‐boosting cross‐entropy loss that allows the practice to add a tracking‐related branch to the detector to
network to gradually shift its focus to hard negative samples predict object offsets or ReID features for data association.
for each sample during training. Tracktor [22] takes a regression head to directly predict the
The contributions of this study are summarised as follows: position of the object in the next frame. Centertrack [23]
uses the information from the previous frame as additional
1) We propose a TRAN to reduce the competition between input to predict the displacement of the object centre point
object detection and re‐identification tasks in one model. of the previous frame in the current frame. Although
2) We improve the quality of the final extracted ReID features these methods fuse detection and tracking information in
by using a smooth gradient‐boosting cross‐entropy loss. the same network, it is still difficult to cope with complex
3) The extensive experiments on MOT16, MOT17, and scenarios.
MOT20 prove that our method enhances the association Therefore, JDE method is proposed to add an additional
ability of ReID features, which is competitive among the branch to the detector for the extraction of ReID features.
existing mainstream algorithms. Joint Detection and Embedding [11] embeds the appearance
model into the object detection network YOLOv3 and shares
the network weights, allowing the model to output both the
2 | RELATED WORK detection results and their corresponding ReID features,
improving the accuracy and speed of MOT. FairMOT [12] uses
2.1 | Tracking‐by‐detection the anchor‐free method to eliminate the ambiguity of ReID
features caused by anchors. Although these methods achieve a
With the development of object detection [13–15], a large balance of accuracy and speed, the competition between ReID
number of MOT methods [5, 6, 16, 17] have adopted tracking‐ and object detection still exists and hurts the performance of
by‐detection paradigm. The bounding boxes of the object will the network.
be obtained by a high performance detector, so these methods Besides, many Transformer‐based methods also follow
focus mainly on tracking, that is, associating the same objects joint‐detection‐tracking paradigm. TransTrack [24] takes the
of different frames. Early work used the motion information of current frame and the previous frame as the inputs, two
object to predict the position in the next frame using methods decoders transform the learnt object queries and queries
such as Kalman Filter [18]. Sort [16] was the first to use from the last frame into detection boxes and tracking boxes
Kalman Filter to predict the position of all candidate boxes in respectively. Trackformer [25] only takes the current frame
the next frame and to compute the IOU with the next as the encoder input, and the learnt object queries and the
candidate frame and finally to match them using the Hungarian track queries from the last frame interact with each other in
algorithm [19]. The LSTM [17] is also used to predict the the decoder, which directly tracks the object. Although
position in the current frame using the previous frame infor- Transformer‐based architecture is better suited for model-
mation of the object. MAT [20] and TMOH [21] focus on the ling long‐range relationships, you have to take a larger
interpolation taking motion information into account. model to get good results, which reduces the speed of
However, due to the uncertainty of the object movement, tracking.
the object location is difficult to predict and vulnerable to the
crowded scenes, which easily leads to a large number of ID
switches and poor tracking performance. Therefore, there are 2.3 | Attention mechanisms in multiple
some works to introduce ReID features to get accurate object tracking
appearance representation of the object. POI [6] uses a
modified GoogleNet pre‐trained on a large‐scale ReID dataset Attention mechanism, which is inspired by human perception,
to extract the appearance features of the detected object. aims to focus on important features and suppress irrelevant
DeepSort [5], as a modified version of Sort, introduces ResNet features. In video object segmentation, COSNet [26] uses a
network to extract appearance features. global co‐attention mechanism to make model focus on
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
678
- YANG ET AL.

learning discriminative foreground representations. Lu [27] 3 | METHOD


uses the neural attention force the episodic memory module to
choose the important parts as the inputs. Lu [28] uses a This section explains how our method is organised. First of all,
differentiable attention thoroughly examines fine‐grained se- Subsection 3.1 describes the overall framework of our method.
mantic similarities between all the possible location pairs in two Then, Subsection 3.2 and 3.3 present the technical details of
data instances. In video moment localisation, Teng [29] uses modules which compose ERTrack respectively. Finally, Sub-
attention mechanism to guide the model to capture the most section 3.4 and 3.5 introduce the details of training and online
matching video segments with the text description. Besides, inference respectively.
self attention [10] mechanism has also shown its power in
many vision tasks [10, 30, 31].
In MOT, attention mechanism has also been extensively 3.1 | Overview
used. The review [32] points out that a lot of visual attention is
needed in MOT. DMAN [33] uses dual matching attention to As shown in Figure 1, the input frame is first fed into the
better match targets to tracklets during the data association backbone (DLA‐34 [36]) to extract its corresponding feature
phase. MASS [34] focuses attention on measuring similarity us- maps. Then, in the TRAN, in order to alleviate the competition
ing appearance, structure, motion and size. PCAN [35] employs between detection task and ReID task, the learnt features will
cross‐attention to retrieve rich spatio‐temporal information be reconstructed into the information required by the detec-
from the past frames for online MOT and segmentation. tion branch and the ReID branch. Afterwards, the detection
branch's information will be used to localise the objects and
the ReID branch's information will be extracted as ReID
2.4 | Loss function of re‐identification features at the centres of predicted object. To extract more
discriminative ReID features, a smooth gradient‐boosting CE
In MOT, POI [6] uses Triplet loss for the training of the ReID loss with more focuses on difficult classes is used.
network, which shortens the distance of the same object fea-
tures while widening the difference between different objects.
Joint Detection and Embedding [11] and FairMOT [12] have 3.2 | Task‐related attention network
experimentally demonstrated that better results can be ob-
tained by using CE loss when adding a ReID branch to the In this section, we introduce the details of TRAN that re-
detector. However, these losses treat all negative classes equally constructs the features extracted by the backbone into two
and do not focus on the fact that in a MOT dataset, only a task‐related features.
small number of classes are similar to the object. It is often The structure of our network is illustrated in Figure 2,
difficult to extract discriminative features using the above where the features extracted from the backbone are denoted as
losses. F ∈ RC�H�W . First, we aggregate spatial information of F by

FIGURE 1 The overall architecture of ERTrack (Conv: convolution).


17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YANG ET AL.
- 679

and their ReID features are extracted from the detected objects
and passed through a linear layer for classification, so the loss
function often used for training the ReID branch is the CE
loss, which is shown below:

e cl
LCE ðc; lÞ ¼ −log P ð2Þ
i∈J e
ci

where l denotes the ground‐truth label of the object ID and J


denotes all classes. Here CE loss treats all negative samples
equally, and all negative samples contribute to the update of the
FIGURE 2 Diagram of Task‐Related Attention Network (TRAN). network. However, in the ReID task, different IDs belong to
the same class. When facing similar persons (e.g. the same
clothing), the network will be more likely to misidentify them,
using average‐pooling and max‐pooling operations, generating
so it is more important to focus on the subtle differences
two different spatial context feature maps Favg and Fmax. Both
between these similar objects. A gradient‐boosting cross en-
feature maps are fused using element‐wise summation to
tropy (GCE) loss function [38] that specifically focuses on
obtain the aggregating spatial information F 0 ∈ RC�H �W ,
0 0

confusing classes to avoid misclassifications has been proposed


using both features can improves representation power of
for fine‐grained classification.
networks [37]. Second, we use self‐attention mechanism [10]
for each branch. Specifically, we pass F 0 through different
convolution layers to generate tensors T1 ∈ RC�H �W and ec l
0 0

LGCE ðc; l Þ ¼ −log P ð3Þ


T2 ∈ RC�H �W , reshape them to C � N0 , where N0 = H0 �
0 0
ec l þ i∈Jk e
ci
W0 , then perform a matrix multiplication
pffiffiffiffi between T1 and the
transpose of T1, divide result by d k , and apply a softmax where Jk denotes the negative class with confidence ranking
function to obtain the channel attention weight maps in the top‐k. In fine‐grained classification, related subsets of
W1 ∈ RC�C , T2 does the same operation and obtains classes tend to have more similar appearance, and GCE loss
w2 ∈ RC�C . We compute the matrix as: achieves better results by focussing only on those classes
�� � pffiffiffiffi � with high similarity. Although ReID should focus more on
j
exp mik ⋅ mk = d c those similar objects, there are a large number of IDs in the
ij
wk ¼ P �� � pffiffiffiffi � ð1Þ MOT dataset, that is, a large number of classes (MOT17
C i j
i¼0 exp mk
⋅ mk =
dc [39] has 1565 IDs). During the training process only focus
on the top k negative samples, which maybe not true hard
j negative samples, and the network may only focus on
Where mik and mk indicate the ith and jth row of T1/T2,
ij differentiating simple samples most of the time, which
w measures the ith channel's effect on the jth channel of wk.
pkffiffiffiffi eventually makes it difficult to obtain a discriminative ReID
d c is the scaling factor, dc = H � W. Then, we perform feature of the object.
matrix multiplication on the learnt attention maps W1/2 and To enable the network to find the truly hard negative
the reshaped features maps F to obtain new representation samples in training, motivated by curriculum learning [40], we
which more focus on task‐related information. New repre- propose a simple smooth gradient‐boosting cross‐entropy
sentation is reshaped to the same size as F ∈ RC�H�W and (SGCE) loss, a loss function that smooths the training of the
added to original F to prevent information loss. The final network from all negative classes to focus more on the top‐k
feature maps FT1 and FT2 will be used in different task branches. hard negative samples. The proposed SGCE loss is given by:

LSGCE ¼ αLGCE þ ð1 − αÞLCE ; α ¼ e−LCE ð4Þ


3.3 | Smooth gradient‐boosting cross
entropy loss where we have taken e−LCE as the value of the weight α. At the
beginning of the training, the value of LCE is large, so α will be
After the TRAN feeds the task‐related feature maps to small, and then CE loss will dominate the training, but as the
different branches, we use a smooth gradient‐boosting loss number of iterations increases, the value of LCE will gradually
function which gradually focuses on confusing classes to train decrease, which leads to an increase in α. Then GCE loss will
the ReID branch. We elaborate the proposed loss function dominate, making the network focus on the distinction of the
below. top‐k negative samples, where the top‐k negative samples will
In MOT tasks, the training of the ReID branch is often be real hard negative samples that are similar to the target. So
considered as a classification task [11, 12], where objects with the loss function adopted by the ReID branch could be
the same ID in the dataset are considered as the same class, formulated as follows:
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
680
- YANG ET AL.

K
X where γ is the weight parameter and is set to 0.1 in the same
Lid ¼ LSGCE ðc; lÞ ð5Þ way as Centernet.
l¼1
Multi‐task training loss function. In order to combine
where K denotes the total number of categories. the losses of the two tasks, an uncertainty loss [43] is taken,
using 2 learnable parameters ω1 and ω2 that can dynamically
balance the object detection and ReID tasks.
3.4 | Training details
Ldet ¼ Lheat þ Lbox ð8Þ
ERTrack contains two subtasks, each with a different optimi-
sation goal. The training method of the ReID branch has been � �
1 1 1
introduced, and the optimisation goal of the object detection Ltotal ¼ L þ L þ ω1 þ ω2 ð9Þ
branch is introduced as follows. 2 eω1 det eω2 id
Detection branch. Object detection takes a similar setup
as CenterNet [41], locating the centre of the object and
regressing its width and height. Therefore, three parallel head 3.5 | Online association
networks are set up on backbone (DLA‐34 [36]) to predict the
object centre heat map, the centre offset, and the width and We use an online association method consistent with FairMOT
height of the bounding box respectively. Denote the ith [12]. It starts with initialising all detected objects as new tra-
i jectories in the first frame. In the subsequent frames, all the
bounding box annotation in a frame as bi ¼ ðxi1 ; yi1 ; x2 ; yi2 Þ,
detected boxes and ReID features in the current frame are first
ðxi1 ; yi1 Þ indicates the coordinates of the upper left corner of obtained through the network, and the cost matrix is calculated

the bounding box, xi2 ; yi2 indicates the coordinates of the based on the cosine distance between the current detected
right bottom corner. The centre point of bi can describe as object and the existing trajectory. At the same time, the Kal-
xi þxi yi þyi
ðcix ; ciy Þ, cix ¼ 1 2 2 , ciy ¼ 1 2 2 , since the final feature map ob- man Filter [18] is used to predict the position of the existing
tained by the network is one‐fourth of the original map, trajectory in the current frame, and the Mahalanobis distance is
�j i k jc i k� calculated by the predicted value and the current detection
c
ð~cix ; ~ciy Þ ¼ 4x ; 4y , the heatmap groundtruth can describe value, and the more distant matches are excluded first by the

i 2

i 2 Mahalanobis distance and then incorporated into the cost
x−~cx þ y−~cy
N
P − ; Rxy
as Rxy ¼ exp 2σ 2c
is denoted as the response matrix, finally the Hungarian algorithm is used to complete the
i¼1 first matching. Then for the remaining unmatched objects and
for each coordinate ðx; yÞon the feature map, N is denoted as trajectories, the IoU cost matrix is calculated for their
the number of all targets on the picture, and σc is the standard bounding boxes, the Hungarian algorithm is used for the
deviation. Therefore, the loss function of the heatmap is second matching, and for the successfully matched trajectories,
defined as a pixel‐level logistic regression: we all update the trajectory‐related information. For the un-
matched targets, we will assign new IDs and generate new
H X W
1 X trajectories, and for the unmatched trajectories, we will record
Lheat ¼ −
N y¼1 x¼1 their unmatched times, and if their unmatched times exceed a
8� �α � � certain threshold, we will stop the update of the trajectory and
>
< 1−R b xy log R b xy ; M xy ¼ 1; consider it as the end of the trajectory.
� � � � α � �
>
: 1 − Rxy β R b xy log 1 − R
b xy ; otherwise;
4 | EXPERIMENT
ð6Þ
4.1 | Datasets and evaluation metrics
where Rxy denotes the predicted heatmap, and α and β are
hyperparameters [42]. In the heatmap groundtruth, we reduce We train and test our network on datasets MOT16 [39],
its centre point and round down, which will lose some accu- MOT17 [39], and MOT20 [44]. MOT16 is a commonly
racy. In order to accurately determine the area of the bounding adopted benchmark dataset with high pedestrian density, which
box, we take another branch of the network to predict not only contains 14 video sequences (7 for training and 7 for testing),
the width and height of the box but also the offset of the and MOT17 contains the same video sequences as MOT16,
� i � �j k j i k�
centre point, offset O b i ¼ cix ; cy − cix ; cy , box size but with more accurate labelling. MOT20 consists of 8 video
4 4 4 4
i i i i i
� sequences captured in three very crowded scenes, which makes
bs ¼ x2 − x1 ; y2 − y1 , the l1 loss is used to optimise these two
tracking more challenging. In addition to training on the MOT
branches: dataset, we keep the same setting as FairMOT [12] and take
N �
X � � � additional datasets for training, including ETH [45], CityPerson
� i b i� � i�
Lbox ¼ �O − O � þ γ�si − bs � ð7Þ [46], CalTech [47], CUHK‐SYSU [48], PRW [49], and
1 1
i¼1 CrowdHuman [50]. ETH and CityPerson only provide
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YANG ET AL.
- 681

annotations for detection, so we only use them to train the 0.2% on IDF1 and MOTA. In terms of FN, ERTrack get the
detection branch. The other four datasets provide annotations best result, which means the model tends to detect more ob-
for both detection and ID, allowing us to train both detection jects, but this also brings an increase in FP. The results
and ReID tasks. We adopt widely accepted CLEAR metrics demonstrate that ERTrack greatly enhances the re‐
[51], Multiple Object Tracking Accuracy (MOTA) to evaluate identification capability.
overall performance, and IDs (Identity Switch) to count the
number of ID switches. IDF1 (ID F1 score) [52], which
evaluates the ID recognition ability of the tracker, Most 4.4 | Comparison with preceding trackers
Tracked ratio (MT), the ratio of most tracked (more than 80%)
objects, Most Lost ratio (ML), the ratio of most lost (less than We compare ERTrack with other current competitive methods
20%) objects, and the inference speed of the model Frames Per on MOT16 and MOT17, and the results are shown in Table 3
Seconds (FPS) are also used. Among them, IDF1 and MOTA and Table 4. In Table 3 and Table 4, all of them are online
are the main metrics to evaluate the model performance. multi‐object tracking methods except TubeTK [57]. In terms
of IDF1 metrics, ERTrack outperforms FairMOT by 2.9% and
2.4% on MOT16 and MOT17 respectively. Although
4.2 | Implementation details Transformer‐based methods [24, 25] behave better in MOTA,
its IDF1 is more slower than ours, it demonstrates that
In our experiments, we use the DLA‐34 [36] model pre‐trained Transformer‐based methods have more power in object
on the COCO dataset [53]. The Adam optimiser [54] is used, detection that detects more objects than ours, and its tracking
the initial learning rate is set to 10−4, the batch size is set to 12, performance is not so good, its IDF1 score is lower about 10%
and the input size image is uniformly 1088 � 608. The ex- than ours. This result proves the competitiveness of ERTrack
periments are run on 2 NVIDIA TITAN V GPUs and Xeon in tracking performance and can demonstrate that the TRAN
E5‐2650 2.20 GHz CPUs. module and SGCE loss enable the network to extract more
discriminative ReID features, enhancing the performance of
re‐identification. It can also be seen from the ML metrics and
4.3 | Ablation studies IDs that ERTrack reduces the number of tracked tracks that
account for less than 20% of the real tracks and the number of
In this section, we analyse the effectiveness of the TRAN ID switches, proving the stability of ERTrack in ID
network and SGCE loss on the validation set of MOT17.
ERTrack is built on FairMOT, so we use FairMOT as the
baseline. The results are listed in Table 1. It is clear from the table T A B L E 2 The effectiveness of SGCE loss on the Joint Detection and
that the proposed TRAN module outperforms the baseline by Embedding (JDE) tracker.
1.1% in the IDF1 score, demonstrating the need to mitigate the Algorithm IDF1 ↑ MOTA ↑ FP ↓ FN ↓ IDs ↓
competition between object detection and ReID tasks.
JDE + CE loss 63.6 66.0 7261 29,844 1284
At the same time, it can be observed that training the ReID
branch of the network using SGCE loss outperforms the JDE + SGCE loss 64.8 66.2 6650 30,053 1269
baseline trained using CE loss by 1.2% on IDF1 and 0.2% on Note: Best results are marked bold.
MOTA. We also apply SGCE loss to the JDE tracker, as Abbreviations: IDs, Identity Switch; IDF1, ID F1 score; MOTA, Multiple Object
shown in Table 2, using SGCE loss outperform the CE losss Tracking Accuracy.
1.2% on IDF1 and 0.2% on MOTA. These prove that using TABLE 3 Comparison of different methods on MOT16.
SGCE loss can extract better ReID features.
Method IDF1 ↑ MOTA ↑ MT ↑ ML ↓ IDs ↓ FPS ↓
Combining the TRAN and SGCE loss, the final ERTrack
achieves a much better tracking performance, as shown in JDE [11] 55.8 64.4 35.4 20.0 1544 18.5
Table 1, where ERTrack outperforms the baseline by 2.5% and CTracker [56] 57.2 67.6 32.9 23.1 1117 6.8
T A B L E 1 The effectiveness of the Task‐Related Attention Network TubeTK [57] 59.4 64.0 33.5 19.4 4137 1.0
(TRAN) network and smooth gradient‐boosting cross‐entropy (SGCE) loss
on the validation set of MOT17. TraDeS [9] 64.7 70.1 37.3 20.0 1144 17.5

Algorithm IDF1 ↑ MOTA ↑ FP ↓ FN ↓ IDs ↓ QDTrack [55] 67.1 69.8 41.6 19.8 1097 20.3

Baseline 70.7 67.9 2725 14,228 376 FairMOT [12] 72.8 74.9 44.7 15.9 1074 21.3

Baseline + TRAN 71.8 67.9 3215 13,700 438 ReMOT [59] 73.2 76.9 52.8 12.4 742 1.8

Baseline + SGCE loss 71.9 68.1 3030 13,804 417 ERTrack(Ours) 75.7 75.2 42.8 15.3 924 20.6

ERTrack 73.2 68.1 3297 13,509 411 Note: Best results are marked bold.
Abbreviations: IDs, Identity Switch; IDF1, ID F1 score; MOTA, Multiple Object
Note: The best results are shown in bold. Tracking Accuracy; MT, Most Tracked ratio, ML, Most Lost ratio; FPS, Frames Per
Abbreviations: IDs, Identity Switch; IDF1, ID F1 score; MOTA, Multiple Object Seconds.

Tracking Accuracy. Represents the offline method.
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
682
- YANG ET AL.

TABLE 4 Comparison of different methods on MOT17.


recognition. Meanwhile, in terms of model inference speed
Method IDF1 ↑ MOTA ↑ MT ↑ ML ↓ IDs ↓ FPS ↑ FPS, ERTrack is only 0.8 lower than FairMOT, and
CTracker [56] 57.4 66.6 32.2 24.2 5529 6.8 Transformer‐base methods are more slower, which proves that
the proposed module has less overhead and maintains the
TubeTK [57] 58.6 63.0 31.2 19.9 5529 3.0
efficiency of JDE paradigm. We also compare with the offline
CenterTrack [23] 64.7 67.8 34.6 24.6 2583 17.5 method ReMOT [59], which fix some tracking errors after
TraDeS [9] 63.9 69.1 36.4 21.5 3555 17.5 videos have been processed, but its IDF1 score is still lower
2.7% than us in MOT17, and some metrics ReMOT is higher
TransTrack [24] 63.9 74.5 46.8 11.3 3663 10
because of the refinement of the results. Except the offline
QDTrack [55] 66.3 68.7 40.6 21.9 3378 20.3 method ReMOT, our method is competitive with other online
TrackFormer [25] 68.0 74.1 47.3 10.4 3378 7.4 methods.
We verify ERTrack on the MOT20 dataset to futher
MOTR [58] 68.6 73.4 50.3 13.1 2439 7.5
evaluate the proposed framework. As shown in Table 5,

ReMOT [59] 72.0 77.0 51.7 13.8 2853 1.8 ERTrack behaves better in most metrics. First, our method
FairMOT [12] 72.3 73.7 43.2 17.3 3303 21.3 achieves 68.9 IDF1 and 63.0 MOTA, which surpass FairMOT
1.6% on IDF1 and 1.2% on MOTA. Despite the slight increase
ERTrack(Ours) 74.7 73.8 41.7 16.8 2868 20.6
in IDs, MT and ML outperform most methods, which proves
Note: Best results are marked bold. most of the tracks are tracked and are intact. For FPS, our
Abbreviations: IDs, Identity Switch; IDF1, ID F1 score; MOTA, Multiple Object method is also only 1.1 below FairMOT.
Tracking Accuracy; MT, Most Tracked ratio, ML, Most Lost ratio; FPS, Frames Per
Seconds.
In summary, ERTrack outperforms FairMOT on IDF1 and

Represents the offline method. MOTA, while being very close in FPS, achieving a balance of
accuracy and efficiency.
TABLE 5 Comparison of different methods on MOT20.

Method IDF1 ↑ MOTA ↑ MT ↑ ML ↓ IDs ↓ FPS ↑


4.5 | Further analysis
MLT [60] 54.6 48.9 30.9 22.1 2187 3.7

TransTrack [24] 59.2 64.5 49.1 13.6 3565 10 Visualisation of TRAN. We visualise the feature maps of the
original network and the task‐related feature maps obtained
TrackFormer [25] 65.7 68.6 44.4 12.1 1532 7.4
after the TRAN network. As shown in Figure 3, in the original
FairMOT [12] 67.3 61.8 68.8 7.6 5243 13.2 detection feature maps, the model without TRAN incorrectly
ERTrack(Ours) 68.9 63.0 68.9 6.9 5608 12.1 concentrates on some irrelevant environments and does not
focus well on the centre of the object. The ReID features of
Note: Best results are marked bold.
the object extracted by the original model are overly focussed
Abbreviations: IDs, Identity Switch; IDF1, ID F1 score; MOTA, Multiple Object
Tracking Accuracy; MT, Most Tracked ratio, ML, Most Lost ratio; FPS, Frames Per on other people and unrelated environment. On the contrary,
Seconds. when the task‐related feature maps are obtained after TRAN,

FIGURE 3 Visualisation of the original and task‐related feature maps.


17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YANG ET AL.
- 683

detection‐based features better focus on the centre of the features is improved by focussing on these hard negative
people, the ReID features of the object extracted on the ReID‐ samples.
related feature maps are also more focussed on the corre- Convergence Analysis. We compare the training curves
sponding object. These phenomena demonstrate that the of our methods with CE loss and GCE loss in Figure 5. It
TRAN network successfully reorganises the original feature shows that our SGCE loss converges faster than the GCE loss,
maps into task‐related feature maps, better focuses on task‐ and final training loss slower than CE loss, which proves that
related features, and effectively alleviates the competition be- most of the losses in the training phase are caused by hard
tween ReID and objection detection tasks. negative samples.
Robustness analysis. We also analyse the robustness of Qualitative results. Figure 6 visualises some tracking re-
ERTrack in common occlusion cases. As shown in Figure 4, in sults of ERTrack on the test set of MOT17 [39]. In MOT17‐01
FairMOT, two people are given different IDs in the two frames and MOT17‐08, occlusion is found, and the IDs are correctly
as they walk through the streetlight and the tree respectively. assigned whether they are occluded or reappeared after cross
On the contrary, our method identify the target successfully. over each other, which can be attributed to the high‐quality
This example demonstrates the high quality of ReID features ReID features. In MOT‐06 and MOT‐12, people are identi-
extracted from ERTrack. fied correctly even though they changed in size from far to
Impact of k on our loss. As introduced in section 3.3, the near. In MOT17‐07 and MOT17‐14, all small objects are also
k value is the number of hard negative samples (negative mostly detected and the IDs are correctly identified. These
classes) that the SGCE loss focuses on, and it has an important phenomena demonstrate that ERTrack enhances the re‐
influence on the quality of the final extracted ReID features. identification capability while maintaining the advantages of
Therefore, in this section we did experiments with different the original network, which finally improves the tracking
values of k. As shown in Table 6, with the increase of k value, performance.
the tracking evaluation metrics IDF1 and MOTA both
decrease and do not differ much, which proves that by
reducing k, our loss can focus more on those hard negative 5 | CONCLUSION
samples, that is, hard negative classes. Considering the tracking
performance, we decide to choose a k value of 15, which also In this paper, we propose an online multi‐object tracking
demonstrates that only a small number of dataset for an ID is method ERTrack. A TRAN is used to alleviate the competition
really hard negative samples, and the quality of the final ReID between object detection and ReID tasks, and a smooth
gradient‐boosting loss is used to train the ReID branch so that
the network can obtain more discriminative ReID features.
Experiments demonstrate that the proposed method greatly
enhances the re‐identification ability of the network, improves
tracking performance, and achieves good results on MOT
datasets with both speed and accuracy. But our approach is also
limited by the crowded scenes, as people are heavily occluded,
affecting the quality of the final extracted features. In the
future, we will investigate deeper into the feature extraction
under occlusion.

F I G U R E 4 Robustness analysis of ERTrack compared with the


FairMOT. IDS indicates that the same person is given different IDs in
consecutive frames. Better viewed zoom in.

TABLE 6 Comparison of different value of k.

Loss Num IDF1 MOTA ↑ FP ↓ FN ↓ IDs ↓


SGCE 15 71.9 68.1 3030 13,804 417
30 71.2 67.9 3000 13,810 397
60 71.4 67.8 2858 14,448 395
100 71.2 67.7 2538 14,507 384
200 71.2 67.9 2903 14,067 399

Note: Best results are marked bold.


Abbreviations: IDs, Identity Switch; IDF1, ID F1 score; MOTA, Multiple Object F I G U R E 5 Training curves of SGCE loss and cross entropy (CE) loss
Tracking Accuracy. and GCE loss, where k is 15 in SGCE loss.
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ET AL.

F I G U R E 6 Example tracking results of ERTrack on the test set of MOT17 [40]. The direction of the red arrow indicates the person who needs attention.
YANG

Best viewed in colour and zoom in.


-684
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
YANG ET AL.
- 685

AUT HO R CO N TR I BU T I ON 10. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process.
WenYu Yang: Investigation, Methodology, Validation Writing Syst. 30 (2017)
– original draft; Yong Jiang: Conceptualisation, Formal anal- 11. Wang, Z., et al.: Towards real‐time multi‐object tracking. In: European
Conference on Computer Vision, pp. 107–122. Springer (2020)
ysis, Supervision, Writing – review & editing; Shuai Wen: Data 12. Zhang, Y., et al.: Fairmot: on the fairness of detection and re‐
curation, Investigation, Writing – review & editing; Yong Fan: identification in multiple object tracking. Int. J. Comput. Vis. 129(11),
Supervision, Writing – review & editing. 3069–3087 (2021). https://doi.org/10.1007/s11263‐021‐01513‐4
13. Cai, Z., Vasconcelos, N.: Cascade r‐cnn: delving into high quality object
ACK NO W LE DG E ME NT S detection. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 6154–6162 (2018)
This study is supported by the Sichuan Science and Technol- 14. Law, H., Deng, J.: Cornernet: detecting objects as paired keypoints. In:
ogy Program (NO. 2021YFG0031). Proceedings of the European Conference on Computer Vision (ECCV),
pp. 734–750 (2018)
CON FL I CT OF I N T E RE ST STA T EM E N T 15. Yang, Z., et al.: Reppoints: point set representation for object detection.
In: Proceedings of the IEEE/CVF International Conference on Com-
We declare that we have no conflict of interest.
puter Vision, pp. 9657–9666 (2019)
16. Bewley, A., et al.: Simple online and realtime tracking. In: 2016 IEEE
P ER MIS SI ON T O R EP ROD U CE M AT E R I A L S International Conference on Image Processing (ICIP), pp. 3464–3468.
F RO M O TH ER S OU RC ES IEEE (2016)
None. 17. Babaee, M., Li, Z., Rigoll, G.: Occlusion handling in tracking multiple
people using rnn. In: 2018 25th IEEE International Conference on
Image Processing (ICIP), pp. 2715–2719. IEEE (2018)
DATA AVA IL AB I LI T Y STA T E ME N T 18. Kalman, R.E.: A New Approach to Linear Filtering and Prediction
The data support the findings of this study are openly available Problems (1960)
in MOT16 and MOT17 and MOT20 at https://motchallenge. 19. Kuhn, H.W.: The Hungarian method for the assignment problem. Nav.
net/, reference number: [MOT16]Milan A, Leal‐Taixé L, Reid Res. Logist. Q. 2(1‐2), 83–97 (1955). https://doi.org/10.1002/nav.
3800020109
I, et al. MOT16: A benchmark for multi‐object tracking [J].
20. Han, S., et al.: Mat: motion‐aware multi‐object tracking. Neurocomputing
arXiv preprint arXiv:1603.00831, 2016. [MOT20] P. Dendor- 476, 75–86 (2022). https://doi.org/10.1016/j.neucom.2021.12.104
fer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. 21. Saleh, F., et al.: Probabilistic tracklet scoring and inpainting for multiple
Roth, K. Schindler, and L. Leal‐Taixé, “Mot20: A benchmark object tracking. In: Proceedings of the IEEE/CVF Conference on
for multi object tracking in crowded scenes,” arXiv preprint Computer Vision and Pattern Recognition, pp. 14329–14339 (2021)
arXiv:2003.09003, 2020. 22. Bergmann, P., Meinhardt, T., Leal‐Taixe, L.: Tracking without bells and
whistles. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp. 941–951 (2019)
O RC ID 23. Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Eu-
Wenyu Yang https://orcid.org/0000-0002-2027-9112 ropean Conference on Computer Vision, pp. 474–490. Springer (2020)
24. Sun, P., et al.: Transtrack: multiple object tracking with transformer. arXiv
preprint arXiv:2012.15460 (2020)
REFE R ENC ES 25. Meinhardt, T., et al.: Trackformer: multi‐object tracking with trans-
1. Luo, W., et al.: Multiple object tracking: a literature review. Artif. Intell. formers. In: Proceedings of the IEEE/CVF Conference on Computer
293, 103448 (2021). https://doi.org/10.1016/j.artint.2020.103448 Vision and Pattern Recognition, pp. 8844–8854 (2022)
2. Manglik, A., et al.: Forecasting time‐to‐collision from monocular video: 26. Lu, X., et al.: See more, know more: unsupervised video object seg-
feasibility, dataset, and challenges. In: 2019 IEEE/RSJ International mentation with co‐attention siamese networks. In: Proceedings of the
Conference on Intelligent Robots and Systems (IROS), pp. 8081–8088. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
IEEE (2019) pp. 3623–3632 (2019)
3. Luo, C., et al.: Learning discriminative activated simplices for action 27. Lu, X., et al.: Video object segmentation with episodic graph memory
recognition. In: Thirty‐First AAAI Conference on Artificial Intelligence networks. In: European Conference on Computer Vision, pp. 661–679.
(2017) Springer (2020)
4. Ciaparrone, G., et al.: Deep learning in video multi‐object tracking: a 28. Lu, X., et al.: Segmenting objects from relational visual data. IEEE Trans.
survey. Neurocomputing 381, 61–88 (2020). https://doi.org/10.1016/j. Pattern Anal. Mach. Intell. 44(11), 7885–7897 (2021). https://doi.org/10.
neucom.2019.11.023 1109/tpami.2021.3115815
5. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking 29. Teng, J., et al.: Regularized two granularity loss function for weakly su-
with a deep association metric. In: 2017 IEEE International Conference pervised video moment retrieval. IEEE Trans. Multimed. 24, 1141–1151
on Image Processing (ICIP), pp. 3645–3649. IEEE (2017) (2021). https://doi.org/10.1109/tmm.2021.3120545
6. Yu, F., et al.: Multiple object tracking with high performance detection 30. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for
and appearance feature. In: European Conference on Computer Vision, image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
pp. 36–42. Springer (2016) 31. Carion, N., et al.: End‐to‐end object detection with transformers. In:
7. Zhou, Z., et al.: Online multi‐target tracking with tensor‐based high‐ European Conference on Computer Vision, pp. 213–229. Springer
order graph matching. In: 2018 24th International Conference on (2020)
Pattern Recognition (ICPR), pp. 1809–1814. IEEE (2018) 32. Meyerhoff, H.S., Papenmeier, F., Huff, M.: Studying visual attention
8. Fang, K., et al.: Recurrent autoregressive networks for online multi‐ using the multiple object tracking paradigm: a tutorial review. Atten.
object tracking. In: 2018 IEEE Winter Conference on Applications of Percept. Psychophys. 79(5), 1255–1274 (2017). https://doi.org/10.3758/
Computer Vision (WACV), pp. 466–475. IEEE (2018) s13414‐017‐1338‐1
9. Wu, J., et al.: Track to detect and segment: an online multi‐object tracker. 33. Zhu, J., et al.: Online multi‐object tracking with dual matching attention
In: Proceedings of the IEEE/CVF Conference on Computer Vision and networks. In: Proceedings of the European Conference on Computer
Pattern Recognition, pp. 12352–12361 (2021) Vision (ECCV), pp. 366–382 (2018)
17519640, 2023, 6, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12191 by Morocco Hinari NPL, Wiley Online Library on [05/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
686
- YANG ET AL.

34. Karunasekera, H., Wang, H., Zhang, H.: Multiple object tracking with 49. Zhong, Z., et al.: Camstyle: a novel data augmentation method for person
attention to appearance, structure, motion and size. IEEE Access 7, re‐identification. IEEE Trans. Image Process. 28(3), 1176–1190 (2018).
104423–104434 (2019). https://doi.org/10.1109/access.2019.2932301 https://doi.org/10.1109/tip.2018.2874313
35. Ke, L., et al.: Prototypical cross‐attention networks for multiple object 50. Shao, S., et al.: Crowdhuman: a benchmark for detecting human in a
tracking and segmentation. Adv. Neural Inf. Process. Syst. 34, 1192–1203 crowd. arXiv preprint arXiv:1805.00123 (2018)
(2021) 51. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking per-
36. Yu, F., et al.: Deep layer aggregation. In: Proceedings of the IEEE Con- formance: the clear mot metrics. EURASIP Journal on Image and Video
ference on Computer Vision and Pattern Recognition, pp. 2403–2412 Processing 2008, 1–10 (2008). https://doi.org/10.1155/2008/246309
(2018) 52. Ristani, E., et al.: Performance measures and a data set for multi‐target,
37. Woo, S., et al.: Cbam: convolutional block attention module. In: Pro- multi‐camera tracking. In: European Conference on Computer Vision,
ceedings of the European Conference on Computer Vision (ECCV), pp. pp. 17–35. Springer (2016)
3–19 (2018) 53. Lin, T.‐Y., et al.: Microsoft coco: common objects in context. In: Eu-
38. Sun, G., et al.: Fine‐grained recognition: accounting for subtle differences ropean Conference on Computer Vision, pp. 740–755. Springer (2014)
between similar classes. In: Proceedings of the AAAI Conference on 54. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv
Artificial Intelligence, vol. 34, pp. 12047–12054 (2020) preprint arXiv:1412.6980 (2014)
39. Milan, A., et al.: Mot16: a benchmark for multi‐object tracking. arXiv 55. Pang, J., et al.: Quasi‐dense similarity learning for multiple object tracking.
preprint arXiv:1603.00831 (2016) In: Proceedings of the IEEE/CVF Conference on Computer Vision and
40. Saputra, M.R.U., et al.: Learning monocular visual odometry through Pattern Recognition, pp. 164–173 (2021)
geometry‐aware curriculum learning. In: 2019 International Conference 56. Peng, J., et al.: Chained‐tracker: chaining paired attentive regression re-
on Robotics and Automation (ICRA), pp. 3549–3555. IEEE (2019) sults for end‐to‐end joint multiple‐object detection and tracking. In:
41. Duan, K., et al.: Centernet: keypoint triplets for object detection. In: European Conference on Computer Vision, pp. 145–161. Springer
Proceedings of the IEEE/CVF International Conference on Computer (2020)
Vision, pp. 6569–6578 (2019) 57. Pang, B., et al.: Tubetk: adopting tubes to track multi‐object in a one‐step
42. Lin, T.‐Y., et al.: Focal loss for dense object detection. In: Proceedings of training model. In: Proceedings of the IEEE/CVF Conference on
the IEEE International Conference on Computer Vision, pp. 2980–2988 Computer Vision and Pattern Recognition, pp. 6308–6318 (2020)
(2017) 58. Zeng, F., et al.: Motr: end‐to‐end multiple‐object tracking with trans-
43. Kendall, A., Gal, Y., Cipolla, R.: Multi‐task learning using uncertainty to former. In: European Conference on Computer Vision, pp. 659–675.
weigh losses for scene geometry and semantics. In: Proceedings of the Springer (2022)
IEEE Conference on Computer Vision and Pattern Recognition, pp. 59. Yang, F., et al.: Remot: a model‐agnostic refinement for multiple object
7482–7491 (2018) tracking. Image Vis Comput. 106, 104091 (2021). https://doi.org/10.
44. Dendorfer, P., et al.: Mot20: a benchmark for multi object tracking in 1016/j.imavis.2020.104091
crowded scenes. arXiv preprint arXiv:2003.09003 (2020) 60. Zhang, Y., et al.: Multiplex labeling graph for near‐online tracking in
45. Ess, A., et al.: A mobile vision system for robust multi‐person tracking. crowded scenes. IEEE Internet Things J. 7(9), 7892–7902 (2020).
In: 2008 IEEE Conference on Computer Vision and Pattern Recogni- https://doi.org/10.1109/jiot.2020.2996609
tion, pp. 1–8. IEEE (2008)
46. Zhang, S., Benenson, R., Schiele, B.: Citypersons: a diverse dataset for
pedestrian detection. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3213–3221 (2017)
47. Dollár, P., et al.: Pedestrian detection: a benchmark. In: 2009 IEEE How to cite this article: Yang, W., et al.: Online
Conference on Computer Vision and Pattern Recognition, pp. 304–311. multiple object tracking with enhanced Re‐
IEEE (2009)
48. Xiao, T., et al.: Joint detection and identification feature learning for
identification. IET Comput. Vis. 17(6), 676–686 (2023).
person search. In: Proceedings of the IEEE Conference on Computer https://doi.org/10.1049/cvi2.12191
Vision and Pattern Recognition, pp. 3415–3424 (2017)

You might also like