0% found this document useful (0 votes)
36 views7 pages

4-Multi-Camera Multi-Person Tracking in Surveillance System

Uploaded by

erhan.arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

4-Multi-Camera Multi-Person Tracking in Surveillance System

Uploaded by

erhan.arslan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Multi-Camera Multi-Person Tracking in

Surveillance System
Omprasad V. Hedde Anand Kumar M.
2023 12th International Conference on Advanced Computing (ICoAC) | 979-8-3503-1821-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICoAC59537.2023.10249836

Department of Information Technology Department of Information Technology


National Institute of Technology Karnataka National Institute of Technology Karnataka
Surathkal India Surathkal India
[email protected] m [email protected]

Abstract—Surveillance systems have become an integral part tracking. Person re-identification is matching the same indi-
of modern security infrastructure. The ability to track multiple vidual across different cameras with different views. Deep
individuals across multiple cameras in real-time is crucial for learning-based person re-identification models [9] have shown
the effectiveness of such systems. In this paper, we propose a
multi-camera multi-person tracking system capable of accurately significant improvements in recent years, making tracking
tracking multiple individuals across a network of cameras. Our individuals accurately across multiple cameras possible.
proposed system utilizes a combination of computer vision and Several recent examples demonstrate the importance of
machine learning techniques to perform robust tracking in com- multi-camera multi-person tracking in real-world scenarios:
plex environments with varying lighting conditions, occlusions,
and camera views. The system employs a deep learning-based 1) Times Square Car Crash: In 2017, a driver plowed into
object detection algorithm to detect individuals in each camera pedestrians in Times Square, killing one person and
view and a multi-object tracking algorithm to associate and track injuring several others. Multi-camera footage was used
the individuals across the camera network. To evaluate the per-
formance of our proposed system, we conducted experiments on to track the vehicle’s movements and identify the driver.
a publicly available dataset, and the results show the effectiveness The footage was critical in helping law enforcement
of our system in achieving high accuracy and efficiency in multi- agencies to piece together the events and apprehend the
camera multi-person tracking. The proposed system is scalable suspect.
and can be easily integrated with existing surveillance systems to 2) COVID-19 Social Distancing Enforcement: During the
enhance their tracking capabilities. Overall, our proposed multi-
camera multi-person tracking system can provide an effective COVID-19 pandemic, many cities implemented social
solution for the real-time tracking of individuals across a network distancing guidelines to prevent the spread of the virus.
of cameras, which can significantly enhance security and safety Multi-camera multi-person tracking was used in some
in various domains, such as transportation, public spaces, and places to monitor public areas and enforce social dis-
critical infrastructure. tancing guidelines. The technology was used to detect
Index Terms—computer vision, object detection, object track-
ing, person re-identification
crowded areas and alert law enforcement agencies to
enforce social distancing rules.

I. I NTRODUCTION These examples demonstrate the crucial role of multi-camera


multi-person tracking in enhancing public safety and security,
Surveillance systems are widely used in various environ- investigating criminal activities, and improving urban manage-
ments to ensure safety and security. However, tracking mul- ment.
tiple individuals across multiple cameras can be daunting This research paper proposes a multi-camera multi-person
for human operators, especially in large, crowded areas. To tracking system that utilizes YOLO for object detection, Deep-
overcome this challenge, computer vision techniques can be SORT for object tracking, and a person re-identification model
utilized to automate the tracking of individuals in surveillance for re-identification. The proposed system aims to provide
footage. Multi-camera multi-person tracking is a complex accurate and robust real-time tracking of multiple individuals
computer vision problem that involves detecting and tracking across multiple cameras in real-world scenarios, which can
multiple people across multiple cameras. Object detection and be used for various applications, such as crowd monitoring,
tracking are fundamental steps in the multi-camera multi- behavior analysis, and suspicious activity detection. The pro-
person tracking pipeline. In recent years, deep learning-based posed multi-camera multi-person tracking system can benefit
object detection and tracking algorithms such as YOLO (You society in several ways. First, it can enhance public safety
Only Look Once) [7] and DeepSORT (Deep Simple Online and security in various environments, such as airports, train
Real-time Tracking) [8] have significantly improved real-time stations, stadiums, and shopping malls. The system can help
object detection and tracking. identify suspicious individuals and track their movements in
However, re-identifying individuals across different cameras real time, which can aid in preventing potential threats and
is crucial in surveillance scenarios to maintain continuous criminal activities. Second, the system can be used for crowd

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.
monitoring, especially during large public events such as for robust MOT, which is directly supervised by the loss
concerts and protests. The system can track the movement from the assignment ground truth. We use a single object
and behavior of crowds and detect any anomalies, such as tracking approach and a dedicated target management scheme
overcrowding or suspicious activities, which can help prevent to recover false negatives and prevent noisy target candidates
accidents and ensure public safety. Third, the system can aid produced by the external detector.
in investigating criminal activities by providing accurate and
reliable tracking information of individuals across multiple III. M ETHODOLOGY
cameras. This can aid law enforcement agencies in identifying In this section, the proposed methodology is explained in
suspects, tracking their movements, and gathering evidence. figure 1, which will elaborate on the process of flow of the
In summary, the proposed multi-camera multi-person track- experiment.
ing system can benefit society by enhancing public safety
and security, aiding in crowd monitoring, and assisting in
investigating criminal activities.
II. L ITERATURE S URVEY
In past studies, the problem of multi-target tracking has
been formulated differently and from multiple perspectives.
Several surveys and literature studies have been undertaken on
those studies. Some reviews are dedicated to specific fields,
such as crowd analysis [3], intelligent visual surveillance,
behavior analysis, etc. Some are dedicated to techniques, e.g.,
appearance models in visual tracking, Xi Li et.al. [3] have
Fig. 1. Flowchart of proposed methodology
studied recent advancements in 2D appearance models for
visual object tracking were studied. Jiarui et al. [5] demon-
This will be the flow of execution of the proposed approach.
strated how hybrid features, which Alreshidi [4] proposed
We will follow the steps below to find similar frames to a given
for facial emotion recognition, could be used for multi-object
image.
tracking. Nikolajs at el. [6]identified recent trends in Multi-
target tracking and determined the required computer vision 1) Combine all the given input videos into a single video
building blocks for use in road traffic scenarios such as file.
surveillance and planning. 2) Give this video file to yolo model, which will give the
Since it was first stated, the pedestrian tracking problem has person’s position in each input frame.
attracted much scientific interest. The ideal situation is to have 3) Using Deep-SORT, track all detected persons.
everyone monitored consistently in a single movie; however, 4) Using the re-identification model, give a unique id to
this is rarely the case due to occlusions, people entering the same person in every frame.
and exiting the camera’s field of view, and other problems. 5) Write the output of the boundary box and id to the
These difficulties significantly raise the barrier to monocular person in every frame
tracking. Setting up several synchronous cameras to ensure 6) Create a video with resulting frames and provide it as a
that pedestrian movements are noticed has been the most result.
significant answer to these problems up until now. However,
this will also result in the ReID problem due to multiple A. YOLO for person detection
cameras. Researchers have put up with many solutions to solve YOLO (You Only Look Once) is an object detection model
this issue [1]. introduced by Joseph Reydmon in 2016. Subsequent changes
Modern tracklets generated by single-camera trackers are have been happening to the model since then, and improve-
used as suggestions in the approach given by Nguyen [10]. ments were made to object detection. We use the pre-trained
They clean them up using a unique pre-clustering generated model to detect the person from a given input frame. The
from 3D geometry projections because these tracklets could output consists of a class of the object and the bounding box
include ID-Switch mistakes. Henceforth, they obtain a better around the object. This is the best object detection model so
tracking graph without ID changes and more accurate affin- far. Most object detection model uses to step approach for
ity costs for the data association phase. Next, tracklets are object detection. Those two steps are object classification and
matched to multi-camera trajectories by resolving a global localization. But only simultaneously does both these things
lifted multi-cut formulation that includes short- and long-range in a single step. This is the reason it is very fast and as well
temporal interactions on intra- and inter-camera tracklets. as accurate for object detection purposes.
The approach suggested by P. Chu [11] has been given as We give our input video to the model for locating the person
FAMNET. All of the layers in FAMNet were designed to in the frames and giving output as a bounding box around
be differentiable, allowing for combined optimization to train every person in that particular frame. The following image 2
the higher-order affinity model and discriminative features [15] represents how YOLO works.

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.
into consideration all these scenarios, DeepSORT works best
for us. The working of DeepSORT can be observed in fig 3.
C. TorchReId
TorchReId is a framework used to re-identify people from
different views. It is specifically developed for person re-
identification in a video. We are using the inbuilt resNet50
model provided by the framework. Using the pre-trained
weights of the model to find the feature map of every detected
person. By comparing the distance between the feature map,
we will find the similarity between each person’s feature map
and decide whether the person is the same. That is how the
torchReId framework will ensure that a person has a unique
id throughout the end of the video. TorchReId is flexible with
Fig. 2. How YOLO works [15] a multiple-person identification dataset. It can be trained on
all the person identification datasets. It uses the latest state-of-
the-art re-Identification model. It is a specifically developed
B. DeepSORT
framework for person re-identification which will be helpful
Tracking a person from a stream of a video is a complicated for our project.
task. It requires predicting the next position of the object
D. Evaluation Metric
in the frame. While doing so, there are various difficulties
while tracking an object, such as occlusion or if an object Multi-object tracking (MOT) detects and tracks multiple
is hidden by something. At these times, tracking the object objects in a video sequence. There are several commonly used
becomes difficult. To address all these problems, deepSORT metrics for evaluating the performance of an MOT system.
comes into play. Previous models used mean-shift and optical These include:
flow approaches to tracking the objects. The accuracy of the 1) Intersection over Union (IoU): Calculates the overlap
model built using these approaches gave good results. Still, between the predicted and true bounding boxes.
some problems made it difficult to use, such as it being very 2) Multiple Object Tracking Accuracy (MOTA): Measures
computationally expensive. It is prone to noise. And it was the percentage of correctly tracked objects over the total
not addressing the problem of occlusion. number of objects in the sequence.
DeepSORT uses the Kalman filter to predict the object’s 3) Mostly Tracked Targets (MT): Measures the percentage
position being tracked using the velocity and its Gaussian of objects tracked for at least 80% of their respective
distribution. For example, if you see a cat walking in the room ground truth track lengths.
and you are tracking that cat. Now that cat goes behind the 4) Mostly Lost Targets (ML): Measures the percentage of
box, and you can’t see the cat. Now the cat is occluded by the objects tracked for less than 20% of their respective
box. DeepSORt, in this case, uses the Kalman filter to predict ground truth track lengths.
the cat’s relative position. The following image gives a simple 5) Identity Switches (ID Sw): Measures the number of
working of the deepSORT model. times the tracker mistakenly switched the identity of two
objects.
6) Fragmentation (Frag): Measures the number of times the
tracker lost and re-detected the same object.
These metrics can be used to evaluate an MOT system’s overall
performance. Other than this, there are metrics such as IDF1
and HOTA, which are described below.
IDF1 is Integrated Detection and Tracking F1 Score, which
refers to how a system’s overall performance is evaluated
by combining the detection and tracking F1 scores into a
Fig. 3. Working of deepsort [16] single metric. HOTA is the latest evaluation metric used for
MOT evaluation. It combines object detection and tracking
There are various alternatives to deepSORt, but they come accuracy. It measures localization recall, localization preci-
with different cons. Tracktor++ is also a tracking model which sion, and temporal stability. HOTA provides a comprehensive
has better accuracy but works very slowly. TrackRCNN does evaluation of MOT system performance. It is commonly used
good segmentation but can’t handle it well when it comes to in MOT benchmarks and competitions. HOTA is crucial for
processing real-time data. It processes the frames at 1.6FPS. real-world applications such as video surveillance and au-
JDE is another model for object tracking with a 12 FPS pro- tonomous driving. Track-mAP measures trajectories that com-
cessing speed but works with higher-resolution frames. Taking bine association, detection, and localization to make the error

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.
type indistinguishable and inseparable. Track-mAP, which is
biassed at measuring association, performs both matching
and association at the trajectory level. It functions based on
confidence-ranked potential tracking results. IDF1 emphasizes
Association accuracy rather than detection and is used as a
secondary metric on the MOTChallenge benchmark due to
its focus on measuring association accuracy over detection
accuracy.
Fig. 4. Before processing the video
IV. T EST DATASET
The MOTchallenge dataset is a regularly used benchmark input to the YOLO model. which will be further processed by
dataset in computer vision research for evaluating object- DeepSORT and Torch Re-Id, which will give the final results.
tracking algorithms, including person tracking. The dataset Figure 5 informs us about the persons. Here we see the same
often comprises video sequences containing many moving
objects, such as humans, in diverse real-world circumstances.
The MOTchallenge dataset is designed to be difficult, with
variations in object appearance, scale, occlusion, motion pat-
terns, lighting conditions, and camera views to simulate real-
world tracking settings. Several video sequences may be
included in the MOTchallenge dataset, such as indoor and
outdoor scenes, congested settings, metropolitan situations, or
surveillance scenarios. The videos are often taken by several
cameras or sensors, resulting in differences in resolution, frame Fig. 5. Output of tracking the video
rate, and image quality.
Annotations or ground truth labels that specify the correct person has given a different id in the output video. But this
positions of objects (e.g., humans) in each video frame may problem will be handled by the re-identification framework.
also be included in the dataset, which may be used to train and This framework will make sure every person will get the
evaluate tracking algorithms. The objects in the MOTchallenge unique id. It will further process the frames and re-identifies
dataset, usually individuals for person tracking, may walk, the persons, and gives them a prior unique ID. Fig 6 We can
run, stand, interact with other items or people, and occlude
or be occluded by other objects in the scene. Occlusions,
camera movements, illumination changes, and complicated
item interactions may all be included in the dataset, making
tracking more difficult.
We are using MOT16 and MOT20 dataset for the evaluation
of the system. MOT16 dataset contains video sequences cap-
tured in unconstrained environments with annotated ground
truth object trajectories, and it is widely used for benchmark- Fig. 6. After the re-identification of the person
ing MOT algorithms. MOT20 dataset is an updated version
of MOT16, providing additional video sequences with diverse see that after processing through the re-identification, we have
challenges such as occlusions, scale variations, and crowded the same ID as the same person. This is how we will ensure
scenes. Both datasets are commonly used to evaluate the every person has a unique id in the video sequence. With this,
performance of MOT algorithms in real-world scenarios and the system also gives a text file that includes all the person’s
to compare the accuracy and robustness of different tracking details with its bounding box and the ID of that person for
methods. every frame. That can extract all the scenes containing that
person in the video.
V. E XPERIMENTS AND R ESULTS The tracking model cannot identify the same person who
We have given two input videos to test the system we cre- can be seen in figure 7. The tracking model will give different
ated. The video contains a person roaming around circularly. IDs to the same person if the same person reappears in the
The length of both videos is 3 seconds and 7 seconds. The same or different camera.
system combines both these videos into a single video of 10s. After applying the re-identification model to the tracking of
This video was converted into 318 frames. Which means the the videos, the model will give the smallest unique given to the
system is working with 30FPS speed. These are the sample person so that each person will have the same id throughout
frames from the original video. We can see in the above figure the video stream. This can be visualized in figure 8. The
4 persons roam around in a room. Both videos are recorded results were given for individual videos on different evaluation
from the same camera. This combined video will be given as metrics as follows. These results were generated by an online

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.
DetP: Detection precision. TP /(TP + FP) averaged over
localization thresholds.
frag:The total number of fragmentations in a trajectory
(i.e. interrupted during tracking).
Id switches: the quantity of identity switches divided by
the recall.
FAF: The average number of false alarms per frame.
LocA: Average localization similarity calculated by aver-
aging all matching detections and localization thresholds.

TABLE I
Fig. 7. Same person being tracked in different Camera MOT16 RESULTS
Sequence MOTA IDF1 HOTA MT ML FP FN R P
MOT16-01 58.2 62.9 50.0 10 3 329 2,303 64.0 92.6
MOT16-03 76.1 75.5 62.3 95 12 7,453 17,427 83.3 92.1
reviewer website from which the dataset has been downloaded. MOT16-06 55.7 57.2 45.1 80 39 1,331 3,635 68.5 85.6
MOT16-07 68.0 62.2 49.7 22 1 919 4,161 74.5 93.0
MOT16-08 49.6 46.9 41.2 21 2 2,549 5,560 66.8 81.4
MOT16-12 48.3 66.4 53.2 40 6 2,067 2,163 73.9 74.8
MOT16-14 51.3 61.2 43.4 26 35 853 7,939 57.0 92.5
MOT16 67.2 68.5 55.9 294 98 15,501 43,188 76.3 90.0

TABLE II
MOT16-R ESULTS
Sequence AssA DetA AssR AssP DetR DetP LocA FAF ID Sw. Frag
MOT16-01 50.8 49.2 56.3 79.1 53.4 77.2 83.9 0.7 39 157
MOT16-03 59.5 65.3 67.3 74.7 71.2 78.7 83.3 5.0 132 847
MOT16-06 41.7 48.9 61.5 54.4 56.1 70.1 81.9 1.1 146 454
MOT16-07 44.1 56.1 48.3 74.5 61.4 76.6 82.9 1.8 144 295
MOT16-08 36.9 47.0 43.0 61.8 55.4 67.5 81.7 4.1 323 458
MOT16-12 57.7 49.2 63.9 76.9 62.8 63.5 83.4 2.3 61 214
MOT16-14 45.3 41.8 51.0 71.1 45.1 73.1 80.3 1.1 202 495
MOT16 54.1 57.9 61.9 72.6 64.3 75.8 82.9 2.6 1,047 2,920
Fig. 8. Reidentification model will give single id to person
Table I shows the tracking results of every testing video
The terms used in table II and table IV are described below. sample. Every video sample from the testing dataset has been
evaluated individually and has given the results. By observing
MT:Mostly tracked targets. A track hypothesis covers the the results, we can see so much variation in the results of
proportion of ground-truth trajectories for at least 80% of individual videos. After the detailed observations, we found
their respective lives. that the detection model cannot detect all the persons in the
ML: Mostly lost targets. The proportion of ground-truth frame in the densely populated video. And because of that, the
trajectories covered by a track hypothesis for at least 20% tracking of that person is not being done. We have to train the
of their respective life spans. model which will be able to detect all the persons from the
FP: False positive. The total number of false positives crowded video frame. After averaging all the results of the
FN: False negative. The total number of false negatives.
R: Recall. Ratio of correct detection to total number of TABLE III
ground truth boxes. MOT20 R ESULTS
P: Precision. Ratio of correct detections over all detec-
Sequence MOTA IDF1 HOTA MT ML FP FN R P
tions. MOT20-04 68.6 59.6 47.4 258 63 8,285 75,479 72.5 96.0
MOT20-06 54.7 55.2 43.8 74 80 2,490 56,643 57.3 96.8
AssA : Association accuracy The association Jaccard MOT20-07 68.7 64.6 52.8 53 6 1,251 8,788 73.5 95.1
index was calculated by averaging all matching detections MOT20-08 48.1 54.5 42.0 35 64 1,654 38,009 50.9 96.0
MOT20 62.0 58.2 46.2 420 213 13,680 178,919 65.4 96.1
and then averaging localization thresholds.
DetA: Detection accuracy. Detection Jaccard index aver-
aged over localization thresholds. TABLE IV
AssR: Association recall. TPA / (TPA + FNA) averaged MOT20 R ESULTS
over all matching detections and then averaged over Sequence AssA DetA AssR AssP DetR DetP LocA FAF ID Sw. Frag
localization thresholds. MOT20-04 40.6 55.6 46.7 64.3 59.3 78.5 82.7 4.0 2,259 7,813
MOT20-06 43.5 44.3 46.9 71.1 46.6 78.7 82.7 2.5 961 2,387
AssP:Association precision. TPA / (TPA + FPA)averaged MOT20-07 49.7 56.7 57.2 67.3 61.5 79.6 84.6 2.1 323 651
MOT20-08 45.9 38.8 49.4 73.5 40.9 77.0 81.9 2.1 552 1,406
over all matching detections and then averaged over MOT20 42.7 50.3 48.0 67.3 53.4 78.5 82.7 3.1 4,095 12,257
localization thresholds.
DetR: Detection recall. TP /(TP + FN)averaged over testing dataset, the final results are listed at the end of the table
localization thresholds. I. The results were ranked 4 in the MOT16 dataset challenge

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.
TABLE V time, the ID 2 is given to some other person. And the person
R ESULTS OF DATASET with Sometimes, different people have the same identity. The
Video seq HOTA DetA AssA DetRe DetPr AssRe AssPr LocA system is built to avoid these problems, but eliminating them
camo 88.691 95.533 82.34 96.249 99.214 88.367 89.053 99.569
cam1 92.702 91.42 94.003 99.563 91.788 94.118 97.398 99.976
is impossible. So there is scope for improvement. Though
cam2 97.093 97.065 97.122 97.065 100 97.122 100 100 the system effectively tracks persons in crowded scenes, it
Combine 93.552 93.987 93.119 98.123 95.708 94.189 96.894 99.916
performs much better in a less crowded area. So this system
can be used for home security or less crowded areas such as
in the list of 100 results according to the MOTA metric. small offices. This system works better in broad daylight. But
We achieved the highest detection accuracy and recall for it is difficult to track people at night. So there is room for
public detection in the MOT16 dataset. Considering the HOTA improvement.
metric, we ranked second in public detection with 55.9% VII. C ONCLUSION AND FUTURE SCOPE
accuracy. For the MOT20 dataset, we got the results shown The system can track the people that are present in the
in table III. The MOT20 dataset contains high resolution and given frame. When a person arrives in the frame for the first
a very dense crowd. The average MOTA score is 62.0 for time, he/she will be given a track-id unique for the entire video
the testing dataset, ranked fifth in the public detection set of stream. The output video contains a bounding box for everyone
the MOTchallenge benchmarking site. The metrics that are and its track-id. For very crowded scenes, the detection of
calculated in I and III are dependent on II and IV as discussed people is much less, requiring training the model on crowded
above. datasets. The densely populated video sequence model does
To further evaluate the system, we are creating our dataset. not detect all the people, which leads to a loss of detection
These 3 videos have different scenarios, different illumination and tracking.
changes, and occlusions. We annotated our dataset manually. As in crowded places, the model cannot detect all the
Two videos contain 4 persons each, and one video contains 5 persons; the detection model will be trained on the crowd
persons. We used the Trackeval library to evaluate our results. human dataset. By using this dataset, we can improve the
We used the MOTChallenge format for the evaluation. The tracking accuracy of the system too much extent.
format contains Frame, ID, top, left, width, height, x, y,z,
and confidence. Based on the format, the evaluation is done, R EFERENCES
and the following table V is the result of the evaluation. The [1] Jian Xu, Chunjuan Bo, Dong Wang, A novel multi-target multi-camera
evaluation is done on HOTA metric. tracking approach based on feature grouping, Computers & Electrical
Engineering, Volume 92, 2021, 107153, ISSN 0045-7906,
[2] Sultan Daud Khan, Maqsood Mahmud, Habib Ullah, Mohib Ullah,
and Faouzi Alaya Cheikh, “Crowd congestion detection in videos,”
Electronic Imaging, vol. 2020, no. 6, pp. 72–1, 2020.
[3] Xi Li, Weiming Hu, Chunhua Shen, Zhongfei Zhang, Anthony Dick, and
Anton Van Den Hengel, “A survey of appearance models in visual object
tracking,” ACM transactions on Intelligent Systems and Technology
(TIST), vol. 4, no. 4, pp. 1–48, 2013
[4] Abdulrahman Alreshidi and Mohib Ullah, “Facial emotion recognition
using hybrid features,” in Informatics. Multidisciplinary Digital Publish-
ing Institute, 2020, vol. 7, p. 6
[5] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu, “Spatial-temporal
relation networks for multi-object tracking,” in Proceedings of the
IEEE/CVF International Conference on Computer Vision, 2019, pp.
3988–3998
[6] Nikolajs Bumanis, Gatis Vitols, Irina Arhipova, and Egons Solmanis,
“Multi-object tracking for urban and multilane traffic: Building blocks
for real-world application,” 2021.
[7] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, ”You Only Look
Fig. 9. Frame from first video Fig. 10. Frame from second video Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV,
Fig. 11. Errors in Identification USA, 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.
[8] Wojke, Nicolai et al. “Simple online and realtime tracking with a
deep association metric.” 2017 IEEE International Conference on Image
Processing (ICIP) (2017): 3645-3649.
VI. D ISCUSSION [9] Zhou, Kaiyang & Xiang, Tao. (2019). Torchreid: A Library for Deep
Learning Person Re-Identification in Pytorch.
Even after rigorous training, there will be some problems. [10] D. M. H. Nguyen, R. Henschel, B. Rosenhahn, D. Sonntag and P.
There are some situations where it will be difficult to track Swoboda, ”LMGP: Lifted Multicut Meets Geometry Projections for
the person if the light is very less or sometimes it identifies Multi-Camera Multi-Object Tracking,” 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), New Orleans, LA,
different structures as a person. Sometimes the same person USA, 2022, pp. 8856-8865, doi: 10.1109/CVPR52688.2022.00866.
is given a different identity, as shown in figure 9 and 10. This [11] P. Chu and H. Ling, ”FAMNet: Joint Learning of Feature, Affin-
figure is cropped to show the Identification details. These two ity and Multi-Dimensional Assignment for Online Multiple Object
Tracking,” 2019 IEEE/CVF International Conference on Computer
frames are from different videos which are processed together Vision (ICCV), Seoul, Korea (South), 2019, pp. 6171-6180, doi:
as one video. The person with ID 2 is given ID 5. At the same 10.1109/ICCV.2019.00627..

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.
[12] Phang, J. T. S., & Lim, K. H. (2019, January). Real-time multi-camera
multi-person action recognition using pose estimation. In Proceedings of
the 3rd international conference on machine learning and soft computing
(pp. 175-180).
[13] Guo, Y., Liu, Z., Luo, H., Pu, H., & Tan, J. (2022). Multi-person multi-
camera tracking for live stream videos based on improved motion model
and matching cascade. Neurocomputing, 492, 561-571.
[14] Guo, Y., Wang, X., Luo, H., Pu, H., Liu, Z., & Tan, J. (2022, Febru-
ary). Real-time multi-person multi-camera tracking based on improved
matching cascade. In Advances in Intelligent Systems and Computing:
Proceedings of the 7th Euro-China Conference on Intelligent Data
Analysis and Applications, May 29–31, 2021, Hangzhou, China (pp.
199-209). Singapore: Springer Nature Singapore.
[15] Grace Karimi Working of YOLO Available on-
line: https://www.section.io/engineering-education/
introduction-to-yolo-algorithm-for-object-detection/
[16] Sanyam How deepSORT works Available online: https://learnopencv.
com/understanding-multiple-object-tracking-using-deepsort/

Authorized licensed use limited to: University of Bolton. Downloaded on February 13,2024 at 20:30:35 UTC from IEEE Xplore. Restrictions apply.

You might also like