Monitoring COVID-19 Social Distancing With Person PDF
Monitoring COVID-19 Social Distancing With Person PDF
Monitoring COVID-19 Social Distancing With Person PDF
has brought global crisis with its deadly spread to more than
180 countries, and about 3,519,901 confirmed cases along with
247,630 deaths globally as on May 4, 2020. The absence of any ac-
tive therapeutic agents and the lack of immunity against COVID-
19 increases the vulnerability of the population. Since there are no
vaccines available, social distancing is the only feasible approach
to fight against this pandemic. Motivated by this notion, this
article proposes a deep learning based framework for automating
the task of monitoring social distancing using surveillance video.
The proposed framework utilizes the YOLO v3 object detection Fig. 1: An outcome of social distancing as the reduced peak of
model to segregate humans from the background and Deepsort the epidemic and matching with available healthcare capacity.
approach to track the identified people with the help of bounding
boxes and assigned IDs. The results of the YOLO v3 model
are further compared with other popular state-of-the-art models, and propose a solution to detect the social distancing among
e.g. faster region-based CNN (convolution neural network) and
single shot detector (SSD) in terms of mean average precision people gathered at any public place.
(mAP), frames per second (FPS) and loss values defined by object The word social distancing is best practice in the direction
classification and localization. Later, the pairwise vectorized L2 of efforts through a variety of means, aiming to minimize or
norm is computed based on the three-dimensional feature space interrupt the transmission of COVID-19. It aims at reducing
obtained by using the centroid coordinates and dimensions of the physical contact between possibly infected individuals and
the bounding box. The violation index term is proposed to
quantize the non adoption of social distancing protocol. From healthy persons. As per the WHO norms [3] it is prescribed
the experimental analysis, it is observed that the YOLO v3 with that people should maintain at least 6 feet of distance among
Deepsort tracking scheme displayed best results with balanced each other in order to follow social distancing.
mAP and FPS score to monitor the social distancing in real-time. A recent study indicates that social distancing is an im-
portant containment measure and essential to prevent SARS-
Index Terms—COVID-19, Video surveillance, Social distanc- CoV-2, because people with mild or no symptoms may for-
ing, Object detection, Object tracking. tuitously carry corona infection and can infect others [4].
Fig. 1 indicates that proper social distancing is the best way to
I. I NTRODUCTION reduce infectious physical contact, hence reduces the infection
rate [5], [6]. This reduced peak may surely match with the
OVID-19 belongs to the family of coronavirus caused
C diseases, initially reported at Wuhan, China, during late
December 2020. On March 11, it spread over 114 countries
available healthcare infrastructure and help to offer better facil-
ities to the patients battling against the coronavirus pandemic.
Epidemiology is the study of factors and reasons for the spread
with 118,000 active cases and 4000 deaths, WHO declared this of infectious diseases. To study epidemiological phenomena,
a pandemic [1], [2]. On May 4, 2020, over 3,519,901 cases mathematical models are always the most preferred choice.
and 247,630 deaths had been reported worldwide. Several Almost all models descend from the classical SIR model of
healthcare organizations, medical experts and scientists are Kermack and McKendrick established in 1927 [7]. Various
trying to develop proper medicines and vaccines for this deadly research works have been done on the SIR model and its
virus, but till date, no success is reported. This situation forces extensions by the deterministic system [8], and consequently,
the global community to look for alternate ways to stop the many researchers studied stochastic biological systems and
spread of this infectious virus. Social distancing is claimed as epidemic models [9].
the best spread stopper in the present scenario, and all affected Respiratory diseases are infectious where the rate and mode
countries are locked-down to implement social distancing. of transmission of the causing virus are the most critical
This research is aimed to support and mitigate the coronavirus factors to be considered for the treatment or ways to stop
pandemic along with minimum loss of economic endeavours, the spread of the virus in the community. Several medicine
organizations and pandemic researchers are trying to develop
N. S. Punn, S. K. Sonbhadra, S. Agarwal, Indian Institute of Informa-
tion Technology Allahabad, Jhalwa, Prayagraj, Uttar Pradesh, India; emails: vaccines for COVID-19, but still, there is no well-known
{pse2017002, rsi2017502, sonali}@iiita.ac.in. medicine available for treatment. Hence, precautionary steps
2
are taken by the whole world to restrict the spread of infection. distancing. In Section V experimentation and the correspond-
Recently, Eksin et al. [8] proposed a modified SIR model with ing results are discussed, accompanied by the outcome in
the inclusion of a social distancing parameter, a(I, R) which Section VI. In Section VII the future scope and challenges
can be determined with the help of the number of infected and are discussed and lastly Section VIII presents the conclusion
recovered persons represented as I and R, respectively. of the present research work.
dS I
= −βS a(I, N ) II. BACKGROUND STUDY AND RELATED WORK
dt N
dI I Social distancing is surely the most trustworthy technique
= −δI + βI a(I, N ) (1) to stop the spreading of infectious disease, with this belief, in
dt N
dR the background of December 2019, when COVID-19 emerged
= δI in Wuhan, China, it was opted as an unprecedented measure
dt
on January 23, 2020 [13]. Within one month, the outbreak
where β represents the infection rate and δ represents recovery
in China gained a peak in the first week of February with
rate. The population size is computed as N = S + I + R.
2,000 to 4,000 new confirmed cases per day. Later, for the
Here the social distancing term (a(I, R) : R2 [0, 1]) maps
first time after this outbreak, there have been a sign of relief
the transition rate from a susceptible state (S) to an infected
with no new confirmed cases for five consecutive days up to
state (I), which is calculated by aβSI
N . 23 March 2020 [14]. This is evident that social distancing
The social distancing models are of two types, where the
measures enacted in China initially, adopted worldwide later
first model is known as “long-term awareness” in which the
to control COVID-19.
occurrence of interaction of an individual with other is reduced
Prem et al. [15] aimed to study the effects of social
proportionally with the cumulative percentage of affected
distancing measures on the spread of the COVID-19 epi-
(infectious and recovered) individuals (Eq. 2),
k demic. Authors used synthetic location-specific contact pat-
I +R terns to simulate the ongoing trajectory of the outbreak using
a= 1− (2) susceptible-exposed-infected-removed (SEIR) models. It was
N
also suggested that premature and sudden lifting of social
Meanwhile, the second model is known as “short-term aware-
distancing could lead to an earlier secondary peak, which
ness”, where the reduction in interaction is directly propor-
could be flattened by relaxing the interventions gradually [15].
tional to the proportion of infectious individuals at a given
As we all understand, social distancing though essential but
instance (Eq. 3), k economically painful measures to flatten the infection curve.
I
a= 1− (3) Adolph et al. [16] highlighted the situation of the United States
N of America, where due to lack of common consent among all
where k is behavior parameter defined as, k ≥ 0. Higher value policymakers it could not be adopted at an early stage, which is
of k implies that individuals are becoming sensitive to the resulting into on-going harm to public health. Although social
disease prevalence. distancing impacted economic productivity, many researchers
In the similar background, on April 16, 2020, a company are trying hard to overcome the loss. Following from this
Landing AI [10] under the leadership of most recognizable context, Kylie et al. [17] studied the correlation between
names in AI, Dr. Andrew Ng [11] announced the creation the strictness of social distancing and the economic status
of an AI tool to monitor social distancing at the workplace. of the region. The study indicated that intermediate levels
In a brief article, the company claimed that the upcoming of activities could be permitted while avoiding a massive
tool could detect if people are maintaining the safe physical outbreak.
distance from each other by analyzing real-time video streams Since the novel coronavirus pandemic began, many coun-
from the camera. It is also claimed that this tool can easily get tries have been taking the help of technology based solutions
integrated with existing security cameras available at different in different capacities to contain the outbreak [18], [19], [20].
workplaces to maintain a safe distance among all workers. A Many developed countries, including India and South Korea,
brief demo was released that shows three steps: calibration, for instance, utilising GPS to track the movements of the
detection and measurement to monitor the social distancing. suspected or infected persons to monitor any possibility of
On April 21, 2020, Gartner, Inc. identified Landing AI as Cool their exposure among healthy people. In India, the government
Vendors in AI Core Technologies to appreciate their timely is using the Arogya Setu App, which worked with the help
initiative in this revolutionary area to support the fight against of GPS and bluetooth to locate the presence of COVID-19
the COVID -19 [12]. patients in the vicinity area. It also helps others to keep a
Motivated by this, in this present work authors are at- safe distance from the infected person [21]. On the other
tempting to check and compare the performance of popular hand, some law enforcement departments have been using
object detection and tracking schemes in monitoring the social drones and other surveillance cameras to detect mass gath-
distancing. Rest of the paper structure is organized as follows: erings of people, and taking regulatory actions to disperse the
Section II presents the recent work proposed in this field crowd [22], [23]. Such manual intervention in these critical
of study, followed by the state-of-the-art object detection situations might help flatten the curve, but it also brings a
and tracking models in Section III. Later, in Section IV the unique set of threats to the public and is challenging to the
deep learning based framework is proposed to monitor social workforce.
3
Human detection using visual surveillance system is an a promising area of research, with many societal applications.
established area of research which is relying upon manual Eshel et al. [42], focused on crowd detection and person
methods of identifying unusual activities, however, it has count by proposing multiple height homographies for head
limited capabilities [24]. In this direction, recent advancements top detection and solved the occlusions problem associated
advocate the need for intelligent systems to detect and capture with video surveillance related applications. Chen et al. [43]
human activities. Although human detection is an ambitious developed an electronic advertising application based on the
goal, due to a variety of constraints such as low-resolution concept of crowd counting. In similar application, Chih-Wen
video, varying articulated pose, clothing, lighting and back- et al. [44] proposed a vision-based people counting model.
ground complexities and limited machine vision capabilities, Following this, Yao et al. [45] generated inputs from stationary
wherein prior knowledge on these challenges can improve the cameras to perform background subtraction to train the model
detection performance [25]. for the appearance and the foreground shape of the crowd in
Detecting an object which is in motion, incorporates two videos.
stages: object detection [26] and object classification [27]. The Once an object is detected, classification techniques can
primary stage of object detection could be achieved by using be applied to identify a human on the basis of shape, tex-
background subtraction [28], optical flow [29] and spatio- ture or motion-based features. In shape-based methods, the
temporal filtering techniques [30]. In the background subtrac- shape related information of moving regions such as points,
tion method [31], the difference between the current frame boxes and blobs are determined to identify the human. This
and a background frame (first frame), at pixel or block level is method performs poorly due to certain limitations in stan-
computed. Adaptive Gaussian mixture, temporal differencing, dard template-matching schemes [46], [47], which is further
hierarchical background models, warping background and enhanced by applying part-based template matching [48] ap-
non-parametric background are the most popular approaches proach. In another research, Dalal et al. [49] proposed texture-
of background subtraction [32]. In optical flow-based object based schemes such as histograms of oriented gradient (HOG),
detection technique [29], flow vectors associated with the which utilises high dimensional features based on edges along
objects motion are characterised over a time span in order to with the support vector machine (SVM) to detect humans.
identify regions in motion for a given sequence of images [33]. According to recent research, further identification of a
Researchers reported that optical flow based techniques consist person through video surveillance can be done by using
of computational overheads and are sensitive to various motion face [50], [51] and gait recognition [52] techniques. However,
related outliers such as noise, colour and lighting, etc. [34]. detection and tracking of people under crowd are difficult
In another method of motion detection Aslani et al. [30] sometimes due to partial or full occlusion problems. Leibe
proposed spatio-temporal filter based approach in which the et al. [53] proposed trajectory estimation based solution while
motion parameters are identified by using three-dimensional Andriluka et al. [54] proposed a solution to detect partially
(3D) spatio-temporal features of the person in motion in the occluded people using tracklet-based detectors. Many other
image sequence. These methods are advantageous due to its tracking techniques, including a variety of object and motion
simplicity and less computational complexity, however shows representations, are reviewed by Yilmaz et al. [55].
limited performance because of noise and uncertainties on A large number of studies are available in the area of video
moving patterns [35]. surveillance. Among many publically available datasets, KTH
Object detection problems have been efficiently addressed human motion dataset [56] shows six categories of activities,
by recently developed advanced techniques. In the last decade, whereas INRIA XMAS multi-view dataset [57] and Weizmann
convolutional neural networks (CNN), region-based CNN [36] human action dataset [58] contain 11 and 10 categories of
and faster region-based CNN [37] used region proposal tech- actions, respectively. Another dataset named as performance
niques to generate the objectness score prior to its classifica- evaluation of tracking and surveillance (PETS) is proposed
tion and later generates the bounding boxes around the object by a group of researchers at university of Oxford [59]. This
of interest for visualization and other statistical analysis [38]. dataset is available for vision based research comprising a
Although these methods are efficient but suffer in terms large number of datasets for varying tasks in the field of
of larger training time requirements [39]. Since all these computer vision. In the present research, in order to fine-tune
CNN based approaches utilize classification, another approach the object detection and tracking models for identifying the
YOLO considers a regression based method to dimension- person, open images datasets [60] are considered. It is a col-
ally separate the bounding boxes and interpret their class lection of 19,957 classes out of which the models are trained
probabilities [40]. In this method, the designed framework for the identification of a person. The images are annotated
efficiently divides the image into several portions representing with image-level labels and corresponding coordinates of the
bounding boxes along with the class probability scores for bounding boxes representing the person. Furthermore, the fine
each portion to consider as an object. This approach offers tuned proposed framework is simulated on the Oxford town
excellent improvements in terms of speed while trading the center surveillance footage [23] to monitor social distancing.
gained speed with the efficiency. The detector module exhibits We believe that having a single dataset with unified an-
powerful generalization capabilities of representing an entire notations for image classification, object detection, visual
image [41]. relationship detection, instance segmentation, and multimodal
Based on the above concepts, many research findings have image descriptions will enable us to study and perform ob-
been reported in the last few years. Crowd counting emerged as ject detection tasks efficiently and stimulate progress towards
4
A. Anchor boxes
With the exhaustive literature survey, it is observed that
Fig. 2: Performance overview of the most popular object
every popular object detection model utilizes the concept of
detection models on PASCAL-VOC and MS-COCO datasets.
anchor boxes to detect multiple objects in the scene [36]. These
boxes are overlaid on the input image over various spatial
genuine understanding of the scene. All explored literature locations (per filter) with varying sizes and aspect ratio. In
and related research work clearly establishes a picture that this article for an image of dimension breadth (b) × height
the application of human detection can easily get extended to (h) the anchor boxes are generated in the following manner.
many applications to cater the situation that arises presently Consider the parameters, size as p (0, 1] and aspect ratio
such as to check prescribed standards for hygiene, social as r > 0, then the anchor boxes for a certain location
√ in√an
distancing, work practices, etc. image can be constructed with dimensions as bp r × hp r.
Table II shows the values of p and r configured for each model.
III. O BJECT DETECTION AND TRACKING MODELS Later the object detection model is trained to predict for each
generated anchor box to belong to a certain class, and an offset
As observed from Fig. 2, the successful object detection
to adjust the dimensions of the anchor box to better fit the
models like RCNN [61], fast RCNN [62], faster RCNN [38],
ground-truth of the object while using the classification and
SSD [63], YOLO v1 [40], YOLO v2 [64] and YOLO v3 [65]
regression loss. Since there are many anchor boxes for a spatial
tested on PASCAL-VOC [66] and MS-COCO [67] datasets,
location, the object can get associated with more than one
undergo trade-off between speed and accuracy of the detec-
anchor box. This problem is dealt with non-max suppression
tion which is dependent on various factors like backbone
(NMS) by computing intersection over union (IoU) parameter
architecture (feature extraction network e.g. VGG-16 [68],
that limits the anchor boxes association with the object of
ResNet-101 [69], Inception v2 [70], etc.), input sizes, model
interest by calculating the score as the ratio of overlapping
depth, varying software and hardware environment. A feature
regions between the assigned anchor box and the ground-truth
extractor tends to encode the models input into certain feature
to the union of regions of the anchor box and the ground-
representation which aids in learning and discovering the
truth. The score value is then compared with the set threshold
patterns associated with the desired objects. In order to identify
hyperparameter to return the best bounding box for an object.
multiple objects of varying scale or size, it also uses predefined
boxes covering an entire image termed as anchor boxes. Table I
describes the performance in terms of accuracy for each of 1) Loss Function: With each step of model training, pre-
these popular and powerful feature extraction networks on dicted anchor box ‘a’ is assigned a label as positive (1) or
ILSVRC ImageNet challenge [71], along with the number negative (0), based on its associativity with the object of
of trainable parameters, which have a direct impact on the interest having ground-truth box ‘g’. The positive anchor box
training speed and time. As highlighted in Table I, the ratio of is then assigned a class label yo {c1 , c2 , ...., cn }, here cn
accuracy to the number of parameters is highest for Inception indicates the category of the nth object, while also generating
v2 model indicating that Inception v2 achieved adequate the encoding vector for box ‘g’ with respect to ‘a’ as f (ga |a),
classification accuracy with minimal trainable parameters in where yo = 0 for negative anchor boxes. Consider an image
contrast to other models, and hence is utilized as a backbone I, for some anchor ‘a’, model with trained parameters ω,
architecture for faster and efficient computations in the faster predicted the object class as Ycls (I|a; ω) and the correspond-
RCNN and SSD object detection models, whereas YOLO v3 ing box offset as Yreg (I|a; ω), then the loss for a single
uses different architecture Darknet-53 as proposed by Redmon anchor prediction can be computed (Lcls ) and bounding box
et al. [65]. regression loss (Lreg ), as given by the Eq 4.
Fig. 3: Schematic representation of faster RCNN architecture Fig. 4: Schematic representation of SSD architecture
B. Faster RCNN based on the presence of object class instances in those boxes,
Proposed by Ren et al. [38], the faster RCNN is derived followed by NMS step to produce the final detections. Thus,
from its predecessors RCNN [61] and fast RCNN [62], which it consists of two steps: extracting feature maps and applying
rely on external region proposal approach based on selective convolution filters to detect objects by using an architecture
search (SS) [73]. Many researchers [74], [75], [76], observed having three main parts. First part is a base pretrained network
that instead of using the SS, it is recommended to utilize to extract feature maps, whereas, in the second part, multi-
the advantages of convolution layers for better and faster scale feature layers are used in which series of convolution
localization of the objects. Hence, Ren et al. proposed the filters are cascaded after the base network. The last part is
Region Proposal Network (RPN) which uses CNN models, a non-maximum suppression unit for eliminating overlapping
e.g. VGGNet, ResNet, etc. to generate the region proposals boxes and one object only per box. The architecture of SSD
that made faster RCNN 10 times faster than fast RCNN. is shown in Fig. 4.
Fig. 3 shows the schematic representation of faster RCNN 1) Loss function: Similar to the above discussed faster
architecture, where RPN module performs binary classification RCNN model, the overall loss function of the SSD model
of an object or not an object (background) while classification is equal to the sum of multi-class classification loss (Lcls )
module assigns categories for each detected object (multi- and bounding box regression loss (localization loss, Lreg ), as
class classification) by using the region of interest (RoI) shown in Eq. 4, where Lreg and Lcls is defined by Eq. 6 and 7:
pooling [38] on the extracted feature maps with projected
N
regions. X X
1) Loss function: The faster RCNN is the combination Lreg (x, l, g) = xkij smoothL1 (lim − ĝjm ),
ipos mcx ,cy ,w,h
of two modules RPN and fast RCNN detector. The overall c c
multi-task loss function is composed of classification loss and (gjcx − aci x ) cy (gj y − ai y )
ĝjcx = , ĝj = ,
bounding box regression loss as defined in Eq. 4 with Lcls aw
i ahi
and Lreg functions defined in Eq. 5 w
gj gjh
!
ĝjw = log , ĝ h
= log ,
Lcls (pi , p∗i ) = −p∗i log(pi ) − (1 − p∗i ) log(1 − pi ) awi
j
ahi
X
Lreg (tu , v) = Lsmooth (tui − v)
(
1 1, if IoU > 0.5
xx,y,w,h (5) xpij =
( 0, otherwise.
0.5q 2 , if | q |< 1. (6)
Lsmooth (q) =
1
| q | −0.5, otherwise. where l is the predicted box, g is the ground truth box, xpij is
an indicator that matches the ith anchor box to the j th ground
where tu is the predicted corrections of the bounding box tu = truth box, cx and cy are offsets to the anchor box a.
{tux , tuy , tuw , tuh }. Here u is a true class label, (x, y) corresponds
N
to the top-left coordinates of the bounding box with height h X X
Lcls (x, c) = − xpij log(ĉpi ) − log(ĉoi ) (7)
and width w, v is a ground-truth bounding box, p∗i is the
iP os iN eg
predicted class and pi is the actual class,
exp cp
where ĉpi = P i
p and N is the number of default matched
p exp ci
C. Single Shot Detector (SSD) boxes.
In this research, single shot detector (SSD) [63] is also used
as another object identification method to detect people in real- D. YOLO
time video surveillance system. As discussed earlier, faster For object detection, another competitor of SSD is
R-CNN works on region proposals to create boundary boxes YOLO [40]. This method can predict the type and location of
to indicate objects, shows better accuracy, but has slow pro- an object by looking only once at the image. YOLO considers
cessing of frames per second (FPS). For real-time processing, the object detection problem as a regression task instead
SSD further improves the accuracy and FPS by using multi- of classification to assign class probabilities to the anchor
scale features and default boxes in a single process. It follows boxes. A single convolutional network simultaneously predicts
the principle of the feed-forward convolution network which multiple bounding boxes and class probabilities. Majorly, there
generates bounding boxes of fixed sizes along with a score are three versions of YOLO: v1, v2 and v3. YOLO v1 is
6
E. Deepsort
Deepsort is a deep learning based approach to track custom
objects in a video [78]. In the present research, Deepsort is
utilized to track individuals present in the surveillance footage.
It makes use of patterns learned via detected objects in the
images which is later combined with the temporal information
for predicting associated trajectories of the objects of interest.
Fig. 5: Schematic representation of YOLO v3 architecture It keeps track of each object under consideration by mapping
unique identifiers for further statistical analysis. Deepsort is
also useful to handle associated challenges such as occlusion,
inspired by GoogleNet (Inception network) which is designed multiple viewpoints, non-stationary cameras and annotating
for object classification in an image. This network consists training data. For effective tracking, the Kalman filter and
of 24 convolutional layers and 2 fully connected layers. the Hungarian algorithm are used. Kalman filter is recursively
Instead of the Inception modules used by GoogLeNet, YOLO used for better association, and it can predict future positions
v1 simply uses a reduction layer followed by convolutional based on the current position. Hungarian algorithm is used for
layers. Later, YOLO v2 [64] is proposed with the objective association and id attribution that identifies if an object in the
of improving the accuracy significantly while making it faster. current frame is the same as the one in the previous frame.
YOLO v2 uses Darknet-19 as a backbone network consisting Initially, a Faster RCNN is trained for person identification and
of 19 convolution layers along with 5 max pooling layers for tracking, a linear constant velocity model [79] is utilized to
and an output softmax layer for object classification. YOLO describe each target with eight dimensional space as follows:
v2 outperformed its predecessor (YOLO v1) with significant T
improvements in mAP, FPS and object classification score. x = [u, v, λ, h, x, , y , , λ, , h, ] (9)
In contrast, YOLO v3 performs multi-label classification with where (u, v) is the centroid of the bounding box, a is the aspect
the help of logistic classifiers instead of using softmax as ratio and h is the height of the image. The other variables are
in case of YOLO v1 and v2. In YOLO v3 Redmon et al. the respective velocities of the variables. Later, the standard
proposed Darknet-53 as a backbone architecture that extracts Kalman filter is used with constant velocity motion and linear
features maps for classification. In contrast to Darknet-19, observation model, where the bounding coordinates (u, v, λ, h)
Darknet-53 consists of residual blocks (short connections) are taken as direct observations of the object state.
along with the upsampling layers for concatenation and added For each track k, starting from the last successful measure-
depth to the network. YOLO v3 generates three predictions ment association ak , the total number of frames are calculated.
for each spatial location at different scales in an image, which With positive prediction from the Kalman filter, the counter
eliminates the problem of not being able to detect small objects is incremented and later when the track gets associated with
efficiently [77]. Each prediction is monitored by computing a measurement it resets its value to 0. Furthermore, if the
objectness, boundary box regressor and classification scores. identified tracks exceed a predefined maximum age, then
In Fig. 5 a schematic description of the YOLOv3 architecture those objects are considered to have left the scene and the
is presented. corresponding track gets removed from the track set. And
1) Loss function: The overall loss function of YOLO v3 if there are no tracks available for some detected objects
consists of localization loss (bounding box regressor), cross then new track hypotheses are initiated for each unidentified
entropy and confidence loss for classification score, defined track of novel detected objects that cannot be mapped to the
as follows: existing tracks. For the first three frames the new tracks are
classified as indefinite until a successful measurement map-
2
S X
X B
2 2 2
ping is computed. If the tracks are not successfully mapped
λcoord 1obj
i,j ((tx − t̂x ) + (ty − t̂y ) + (tw − t̂w ) + with measurement then it gets deleted from the track set.
i=0 j=0 Hungarian algorithm is then utilized in order to solve the
2 mapping problem between the newly arrived measurements
(th − t̂h ) )
S X
B 2
C
and the predicted Kalman states by considering the motion and
X X appearance information with the help of Mahalanobis distance
+ 1obj
i,j (− log(σ(to )) + BCE(ŷk , σ(sk )))
computed between them as defined in Eq. 10.
i=0 j=0 k=1
T
d(1) (i, j) = (dj − yi ) Si−1 (dj − yi )
2
S X
X B (10)
+λnoobj 1noobj
i,j (− log(1 − σ(to ))
i=0 j=0 where the projection of the ith track distribution into measure-
(8) ment space is represented by (yi , Si ) and the j th bounding
where λcoord indicates the weight of the coordinate error, box detection by dj . The Mahalanobis distance considers this
S 2 indicates the number of grids in the image, and B is uncertainty by estimating the count of standard deviations,
the number of generated bounding boxes per grid. 1obj i,j = 1 the detection is away from the mean track location. Further,
describes that object confines in the j th bounding box in grid using this metric, it is possible to exclude unlikely associations
i, otherwise it is 0. by thresholding the Mahalanobis distance. This decision is
7
denoted with an indicator that evaluates to 1 if the association the same color. Each surveillance frame is also accompanied
between the ith track and j th detection is admissible (Eq. 11). with the streamline plot depicting the statistical count of the
number of social groups and an index term (violation index)
(1)
bi,j = 1[d(1) (i, j) < t(1) ] (11) representing the ratio of the number of people to the number
of groups. Furthermore, estimated violations can be computed
Though Mahalanobis distance performs efficiently but fails by multiplying the violation index with the total number of
in the environment where camera motion is possible, thereby social groups.
another metric is introduced for the assignment problem. This
second metric measures the smallest cosine distance between
the ith track and j th detection in appearance space as follows: A. Workflow
This section includes the necessary steps undertaken to
d(2) (i, j) = min{1 − rj T rk (i) | rk (i) R2 } (12) compose a framework for monitoring social distancing.
1. Fine-tune the trained object detection model to identify
Again, a binary variable is introduced to indicate if an asso- and track the person in a footage.
ciation is admissible according to the following metric: 2. The trained model is feeded with the surveillance footage.
(1) The model generates a set of bounding boxes and an ID
bi,j = 1[d(2) (i, j) < t(2) ] (13)
for each identified person.
and a suitable threshold is measured for this indicator on a 3. Each individual is associated with three-dimensional fea-
separate training dataset. To build the association problem, ture space (x, y, d), where (x, y) corresponds to the
both metrics are combined using a weighted sum: centroid coordinates of the bounding box and d defines
the depth of the individual as observed from the camera.
ci,j = λd(1) (i, j) + (1 − λ)d(2) (i, j) (14)
where an association is admissible if it is within the gating
d = ((2 ∗ 3.14 ∗ 180)/(w + h ∗ 360) ∗ 1000 + 3) (16)
region of both metrics:
Y (m) where w is the width of the bounding box and h is the
bi,j = 2bi,j . (15) height of the bounding box [83].
m=1
4. For the set of bounding boxes, pairwise L2 norm is
The influence of each metric on the combined association cost computed as given by the following equation.
can be controlled through hyperparameter λ. v
u n
uX 2
||D||2 = t (qi − pi ) (17)
IV. P ROPOSED APPROACH
i=1
The emergence of deep learning has brought the best per-
where in this work n = 3.
forming techniques for a wide variety of tasks and challenges
5. The dense matrix of L2 norm is then utilized to assign the
including medical diagnosis [74], machine translation [75],
neighbors for each individual that satisfies the closeness
speech recognition [76], and a lot more [80]. Most of these
sensitivity. With extensive trials the closeness threshold
tasks are centred around object classification, detection, seg-
is updated dynamically based on the spatial location of
mentation, tracking, and recognition [81], [82]. In recent years,
the person in a given frame ranging between (90, 170)
the convolution neural network (CNN) based architectures
pixels.
have shown significant performance improvements that are
6. Any individual that meets the closeness property is
leading towards the high quality of object detection, as shown
assigned a neighbour or neighbours forming a group
in Fig. 2, which presents the performance of such models
represented in a different color coding in contrast to other
in terms of mAP and FPS on standard benchmark datasets,
people.
PASCAL-VOC [66] and MS-COCO [67], and similar hard-
7. The formation of groups indicates the violation of the
ware resources.
practice of social distancing which is quantified with help
In the present article, a deep learning based framework is
of the following:
proposed that utilizes object detection and tracking models
to aid in the social distancing remedy for dealing with the – Consider ng as number of groups or clusters identi-
escalation of COVID-19 cases. In order to maintain the fied, and np as total number of people found in close
balance of speed and accuracy, YOLO v3 [65] alongside the proximity.
Deepsort [78] are utilized as object detection and tracking – vi = np /ng , where vi is the violation index.
approaches while surrounding each detected object with the
bounding boxes. Later, these bounding boxes are utilized to V. E XPERIMENTS AND RESULTS
compute the pairwise L2 norm with computationally efficient The above discussed object detection models are fine tuned
vectorized representation for identifying the clusters of people for binary classification (person or not a person) with Inception
not obeying the order of social distancing. Furthermore, to v2 as a backbone network on the Nvidia GTX 1060 GPU,
visualize the clusters in the live stream, each bounding box using the dataset acquired from the open image dataset (OID)
is color-coded based on its association with the group where repository [73] maintained by the Google open source com-
people belonging to the same group are represented with munity. The diverse images with a class label as Person are
8
Fig. 6: Data samples showing (a) true samples and (b) false
samples of a “Person” class from the open image dataset.
Fig. 7: Losses per iteration of the object detection models Fig. 8: Sample output of the proposed framework for monitor-
during the training phase on the OID validation set for ing social distancing on surveillance footage of Oxford Town
detecting the person in an image. Center.
downloaded via OIDv4 toolkit [84] along with the annotations. bounding boxes while also simulating the statistical analysis
Fig. 6 shows the sample images of the obtained dataset con- showing the total number of social groups displayed by same
sisting of 800 images which is obtained by manually filtering color encoding and a violation index term computed as the
to only contain the true samples. The dataset is then divided ratio of the number of people to the number of groups. The
into training and testing sets, in 8:2 ratio. In order to make the frames shown in Fig. 8 displays violation index as 3, 2, 2, and
testing robust, the testing set is also accompanied by the frames 2.33. The frames with detected violations are recorded with
of surveillance footage of the Oxford town center [23]. Later the timestamp for future analysis.
this footage is also utilized to simulate the overall approach for
monitoring the social distancing. In case of faster RCNN, the
images are resized to P pixels on the shorter edge with 600 VII. F UTURE SCOPE AND CHALLENGES
and 1024 for low and high resolution, while in SSD and YOLO Since this application is intended to be used in any working
the images are scaled to the fixed dimension P × P with P environment; accuracy and precision are highly desired to
value as 416. During the training phase, the performance of the serve the purpose. Higher number of false positive may raise
models is continuously monitored using the mAP along with discomfort and panic situation among people being observed.
the localization, classification and overall loss in the detection There may also be genuinely raised concerns about privacy and
of the person as indicated in Fig. 7. Table III summarizes the individual rights which can be addressed with some additional
results of each model obtained at the end of the training phase measures such as prior consents for such working environ-
with the training time (TT), number of iterations (NoI), mAP, ments, hiding a persons identity in general, and maintaining
and total loss (TL) value. It is observed that the faster RCNN transparency about its fair uses within limited stakeholders.
model achieved minimal loss with maximum mAP, however,
has the lowest FPS, which makes it not suitable for real-time VIII. C ONCLUSION
applications. Furthermore, as compared to SSD, YOLO v3
achieved better results with balanced mAP, training time, and The article proposes an efficient real-time deep learning
FPS score. The trained YOLO v3 model is then utilized for based framework to automate the process of monitoring the
monitoring the social distancing on the surveillance video. social distancing via object detection and tracking approaches,
where each individual is identified in the real-time with the
help of bounding boxes. The generated bounding boxes aid
VI. O UTPUT in identifying the clusters or groups of people satisfying
The proposed framework outputs (as shown in Fig. 8) the the closeness property computed with the help of pairwise
processed frame with the identified people confined in the vectorized approach. The number of violations are confirmed
9
by computing the number of groups formed and violation [17] K. E. Ainslie, C. E. Walters, H. Fu, S. Bhatia, H. Wang, X. Xi,
index term computed as the ratio of the number of people M. Baguelin, S. Bhatt, A. Boonyasiri, O. Boyd et al., “Evidence of
initial success for china exiting covid-19 social distancing policy after
to the number of groups. The extensive trials were conducted achieving containment,” Wellcome Open Research, vol. 5, no. 81, p. 81,
with popular state-of-the-art object detection models: Faster 2020.
RCNN, SSD, and YOLO v3, where YOLO v3 illustrated [18] S. K. Sonbhadra, S. Agarwal, and P. Nagabhushan, “Target specific
mining of covid-19 scholarly articles using one-class approach,” 2020.
the efficient performance with balanced FPS and mAP score. [19] N. S. Punn and S. Agarwal, “Automated diagnosis of covid-19 with
Since this approach is highly sensitive to the spatial location limited posteroanterior chest x-ray images using fine-tuned deep neural
of the camera, the same approach can be fine tuned to better networks,” 2020.
[20] N. S. Punn, S. K. Sonbhadra, and S. Agarwal, “Covid-19 epidemic
adjust with the corresponding field of view. analysis using machine learning and deep learning algorithms,”
medRxiv, 2020. [Online]. Available: https://www.medrxiv.org/content/
early/2020/04/11/2020.04.08.20057679
ACKNOWLEDGMENT
[21] O. website of Indian Government, “Distribution of the novel
The authors gratefully acknowledge the helpful comments coronavirus-infected pneumoni Aarogya Setu Mobile App,” https://
www.mygov.in/aarogya-setu-app/, 2020.
and suggestions of colleagues. Authors are also indebted to
[22] M. Robakowska, A. Tyranska-Fobke, J. Nowak, D. Slezak, P. Zuratynski,
Interdisciplinary Cyber Physical Systems (ICPS) Programme, P. Robakowski, K. Nadolny, and J. R. Ładny, “The use of drones during
Department of Science and Technology (DST), Government mass events,” Disaster and Emergency Medicine Journal, vol. 2, no. 3,
of India (GoI) vide Reference No.244 for their financial pp. 129–134, 2017.
[23] N. S. Punn and S. Agarwal, “Crowd analysis for congestion control
support to carry out the background research which helped early warning system on foot over bridge,” in 2019 Twelfth International
significantly for the implementation of present research work. Conference on Contemporary Computing (IC3), 2019, pp. 1–6.
[24] N. Sulman, T. Sanocki, D. Goldgof, and R. Kasturi, “How effective
is human video surveillance performance?” in 2008 19th International
R EFERENCES Conference on Pattern Recognition. IEEE, 2008, pp. 1–3.
[25] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pat-
[1] W. H. Organization, “WHO corona-viruses (COVID-19),” https://www. tern recognition letters, vol. 34, no. 1, pp. 3–19, 2013.
who.int/emergencies/diseases/novel-corona-virus-2019, 2020, [Online;
[26] K. A. Joshi and D. G. Thakore, “A survey on moving object detection
accessed May 02, 2020].
and tracking in video surveillance system,” International Journal of Soft
[2] WHO, “Who director-generals opening remarks at the media briefing
Computing and Engineering, vol. 2, no. 3, pp. 44–48, 2012.
on covid-19-11 march 2020.” https://www.who.int/dg/speeches/detail/,
[27] O. Javed and M. Shah, “Tracking and object classification for automated
2020, [Online; accessed March 12, 2020].
surveillance,” in European Conference on Computer Vision. Springer,
[3] L. Hensley, “Social distancing is out, physical distancing is inheres how
2002, pp. 343–357.
to do it,” Global News–Canada (27 March 2020), 2020.
[28] S. Brutzer, B. Höferlin, and G. Heidemann, “Evaluation of background
[4] ECDPC, “Considerations relating to social distancing measures in
subtraction techniques for video surveillance,” in CVPR 2011. IEEE,
response to COVID-19 second update,” https://www.ecdc.europa.eu/
2011, pp. 1937–1944.
en/publications-data/considerations, 2020, [Online; accessed March 23,
2020]. [29] S. Aslani and H. Mahdavi-Nasab, “Optical flow based moving object
[5] M. W. Fong, H. Gao, J. Y. Wong, J. Xiao, E. Y. Shiu, S. Ryu, and detection and tracking for traffic surveillance,” International Journal of
B. J. Cowling, “Nonpharmaceutical measures for pandemic influenza in Electrical, Computer, Energetic, Electronic and Communication Engi-
nonhealthcare settingssocial distancing measures,” 2020. neering, vol. 7, no. 9, pp. 1252–1256, 2013.
[6] F. Ahmed, N. Zviedrite, and A. Uzicanin, “Effectiveness of workplace [30] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition
social distancing measures in reducing influenza transmission: a system- via sparse spatio-temporal features,” in 2005 IEEE International Work-
atic review,” BMC public health, vol. 18, no. 1, p. 518, 2018. shop on Visual Surveillance and Performance Evaluation of Tracking
[7] W. O. Kermack and A. G. McKendrick, “Contributions to the mathe- and Surveillance. IEEE, 2005, pp. 65–72.
matical theory of epidemics–i. 1927.” 1991. [31] M. Piccardi, “Background subtraction techniques: a review,” in 2004
[8] C. Eksin, K. Paarporn, and J. S. Weitz, “Systematic biases in disease IEEE International Conference on Systems, Man and Cybernetics (IEEE
forecasting–the role of behavior change,” Epidemics, vol. 27, pp. 96– Cat. No. 04CH37583), vol. 4. IEEE, 2004, pp. 3099–3104.
105, 2019. [32] Y. Xu, J. Dong, B. Zhang, and D. Xu, “Background modeling methods
[9] M. Zhao and H. Zhao, “Asymptotic behavior of global positive solution in video analysis: A review and comparative evaluation,” CAAI Trans-
to a stochastic sir model incorporating media coverage,” Advances in actions on Intelligence Technology, vol. 1, no. 1, pp. 43–60, 2016.
Difference Equations, vol. 2016, no. 1, pp. 1–17, 2016. [33] H. Tsutsui, J. Miura, and Y. Shirai, “Optical flow-based person tracking
[10] P. Alto, “Landing AI Named an April 2020 Cool Vendor in the Gart- by multiple cameras,” in Conference Documentation International Con-
ner Cool Vendors in AI Core Technologies,” https://www.yahoo.com/ ference on Multisensor Fusion and Integration for Intelligent Systems.
lifestyle/landing-ai-named-april-2020-152100532.html, 2020, [Online; MFI 2001 (Cat. No. 01TH8590). IEEE, 2001, pp. 91–96.
accessed April 21, 2020]. [34] A. Agarwal, S. Gupta, and D. K. Singh, “Review of optical flow
[11] A. Y. Ng, “Curriculum Vitae,” https://ai.stanford.edu/∼ang/ technique for moving object detection,” in 2016 2nd International
curriculum-vitae.pdf. Conference on Contemporary Computing and Informatics (IC3I). IEEE,
[12] L. AI, “Landing AI Named an April 2020 Cool Vendor in the Gartner 2016, pp. 409–413.
Cool Vendors in AI Core Technologies,” https://www.prnewswire.com/ [35] S. A. Niyogi and E. H. Adelson, “Analyzing gait with spatiotemporal
news-releases/, 2020, [Online; accessed April 22, 2020]. surfaces,” in Proceedings of 1994 IEEE Workshop on Motion of Non-
[13] B. News, “China coronavirus: Lockdown measures rise across Hubei rigid and Articulated Objects. IEEE, 1994, pp. 64–69.
province,” https://www.bbc.co.uk/news/world-asia-china51217455, [36] Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep
2020, [Online; accessed January 23, 2020]. learning: A review,” IEEE transactions on neural networks and learning
[14] N. H. C. of the Peoples Republic of China, “Daily briefing on novel systems, vol. 30, no. 11, pp. 3212–3232, 2019.
coronavirus cases in China,” http://en.nhc.gov.cn/2020-03/20/c 78006. [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
htm, 2020, [Online; accessed March 20, 2020]. with deep convolutional neural networks,” in Advances in neural infor-
[15] K. Prem, Y. Liu, T. W. Russell, A. J. Kucharski, R. M. Eggo, N. Davies, mation processing systems, 2012, pp. 1097–1105.
S. Flasche, S. Clifford, C. A. Pearson, J. D. Munday et al., “The effect [38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
of control strategies to reduce social mixing on outcomes of the covid- object detection with region proposal networks,” in Advances in neural
19 epidemic in wuhan, china: a modelling study,” The Lancet Public information processing systems, 2015, pp. 91–99.
Health, 2020. [39] X. Chen and A. Gupta, “An implementation of faster rcnn with study
[16] C. Adolph, K. Amano, B. Bang-Jensen, N. Fullman, and J. Wilkerson, for region sampling,” arXiv preprint arXiv:1702.02138, 2017.
“Pandemic politics: Timing state-level social distancing responses to [40] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
covid-19,” medRxiv, 2020. once: Unified, real-time object detection,” in Proceedings of the IEEE
10
conference on computer vision and pattern recognition, 2016, pp. 779– [63] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
788. Berg, “Ssd: Single shot multibox detector,” in European conference on
[41] M. Putra, Z. Yussof, K. Lim, and S. Salim, “Convolutional neural computer vision. Springer, 2016, pp. 21–37.
network for person and car detection using yolo framework,” Journal [64] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
of Telecommunication, Electronic and Computer Engineering (JTEC), Proceedings of the IEEE conference on computer vision and pattern
vol. 10, no. 1-7, pp. 67–71, 2018. recognition, 2017, pp. 7263–7271.
[42] R. Eshel and Y. Moses, “Homography based multiple camera detection [65] J. R. A. Farhadi and J. Redmon, “Yolov3: An incremental improvement,”
and tracking of people in a dense crowd,” in 2008 IEEE Conference on Retrieved September, vol. 17, p. 2018, 2018.
Computer Vision and Pattern Recognition. IEEE, 2008, pp. 1–8. [66] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
[43] D.-Y. Chen, C.-W. Su, Y.-C. Zeng, S.-W. Sun, W.-R. Lai, and H.- man, “The pascal visual object classes (voc) challenge,” International
Y. M. Liao, “An online people counting system for electronic advertising journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
machines,” in 2009 IEEE International Conference on Multimedia and [67] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
Expo. IEEE, 2009, pp. 1262–1265. P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
[44] C.-W. Su, H.-Y. M. Liao, and H.-R. Tyan, “A vision-based people context,” in European conference on computer vision. Springer, 2014,
counting approach based on the symmetry measure,” in 2009 IEEE pp. 740–755.
International Symposium on Circuits and Systems. IEEE, 2009, pp. [68] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
2617–2620. large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[69] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[45] J. Yao and J.-M. Odobez, “Fast human detection from joint appearance
recognition,” in Proceedings of the IEEE conference on computer vision
and foreground feature subset covariances,” Computer Vision and Image
and pattern recognition, 2016, pp. 770–778.
Understanding, vol. 115, no. 10, pp. 1414–1426, 2011.
[70] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
[46] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially
the inception architecture for computer vision,” in Proceedings of the
occluded humans by bayesian combination of edgelet based part de-
IEEE conference on computer vision and pattern recognition, 2016, pp.
tectors,” International Journal of Computer Vision, vol. 75, no. 2, pp.
2818–2826.
247–266, 2007.
[71] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[47] F. Z. Eishita, A. Rahman, S. A. Azad, and A. Rahman, “Occlusion Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
handling in object detection,” in Multidisciplinary Computational Intelli- scale visual recognition challenge,” International journal of computer
gence Techniques: Applications in Business, Engineering, and Medicine. vision, vol. 115, no. 3, pp. 211–252, 2015.
IGI Global, 2012, pp. 61–74. [72] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
[48] M. Singh, A. Basu, and M. K. Mandal, “Human activity recognition inception-resnet and the impact of residual connections on learning,” in
based on silhouette directionality,” IEEE transactions on circuits and Thirty-first AAAI conference on artificial intelligence, 2017.
systems for video technology, vol. 18, no. 9, pp. 1280–1292, 2008. [73] Google, “Open image dataset v6,” https://storage.googleapis.com/
[49] N. Dalal and B. Triggs, “Histograms of oriented gradients for human openimages/web/index.html, 2020, [Online; accessed 25-February-
detection,” in 2005 IEEE computer society conference on computer 2020].
vision and pattern recognition (CVPR’05), vol. 1. IEEE, 2005, pp. [74] N. S. Punn and S. Agarwal, “Inception u-net architecture for semantic
886–893. segmentation to identify nuclei in microscopy cell images,” ACM Trans-
[50] P. Huang, A. Hilton, and J. Starck, “Shape similarity for 3d video actions on Multimedia Computing, Communications, and Applications
sequences of people,” International Journal of Computer Vision, vol. 89, (TOMM), vol. 16, no. 1, pp. 1–15, 2020.
no. 2-3, pp. 362–381, 2010. [75] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws,
[51] A. Samal and P. A. Iyengar, “Automatic recognition and analysis of L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar et al., “Tensor2tensor
human faces and facial expressions: A survey,” Pattern recognition, for neural machine translation,” arXiv preprint arXiv:1803.07416, 2018.
vol. 25, no. 1, pp. 65–77, 1992. [76] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg,
[52] D. Cunado, M. S. Nixon, and J. N. Carter, “Using gait as a biometric, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep
via phase-weighted magnitude spectra,” in International Conference on speech 2: End-to-end speech recognition in english and mandarin,” in
Audio-and Video-Based Biometric Person Authentication. Springer, International conference on machine learning, 2016, pp. 173–182.
1997, pp. 93–102. [77] A. Sonawane, “Medium : YOLOv3: A Huge Improvement,” https:
[53] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded //medium.com/@anand sonawane/yolo3, 2020, [Online; accessed Dec
scenes,” in 2005 IEEE Computer Society Conference on Computer Vision 6, 2019].
and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 878–885. [78] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
[54] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection tracking with a deep association metric,” in 2017 IEEE international
and people-detection-by-tracking,” in 2008 IEEE Conference on com- conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649.
puter vision and pattern recognition. IEEE, 2008, pp. 1–8. [79] N. Wojke and A. Bewley, “Deep cosine metric learning for person
[55] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm re-identification,” in 2018 IEEE winter conference on applications of
computing surveys (CSUR), vol. 38, no. 4, pp. 13–es, 2006. computer vision (WACV). IEEE, 2018, pp. 748–756.
[80] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.-L. Shyu,
[56] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a
S.-C. Chen, and S. Iyengar, “A survey on deep learning: Algorithms,
local svm approach,” in Proceedings of the 17th International Confer-
techniques, and applications,” ACM Computing Surveys (CSUR), vol. 51,
ence on Pattern Recognition, 2004. ICPR 2004., vol. 3. IEEE, 2004,
no. 5, pp. 1–36, 2018.
pp. 32–36.
[81] A. Brunetti, D. Buongiorno, G. F. Trotta, and V. Bevilacqua, “Com-
[57] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recog-
puter vision and deep learning techniques for pedestrian detection and
nition using motion history volumes,” Computer vision and image
tracking: A survey,” Neurocomputing, vol. 300, pp. 17–33, 2018.
understanding, vol. 104, no. 2-3, pp. 249–257, 2006.
[82] N. S. Punn and S. Agarwal, “Crowd analysis for congestion control
[58] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions early warning system on foot over bridge,” in 2019 Twelfth International
as space-time shapes,” in Tenth IEEE International Conference on Conference on Contemporary Computing (IC3). IEEE, 2019, pp. 1–6.
Computer Vision (ICCV’05) Volume 1, vol. 2. IEEE, 2005, pp. 1395– [83] Pias, “Object detection and distance measurement,” https://github.com/
1402. paul-pias/Object-Detection-and-Distance-Measurement, 2020, [Online;
[59] O. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “The oxford-iiit accessed 01-March-2020].
pet dataset,” 2012. [84] J. Harvey, Adam. LaPlace. (2019) Megapixels: Origins, ethics, and
[60] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, privacy implications of publicly available face recognition image
S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images datasets. [Online]. Available: https://megapixels.cc/
dataset v4,” International Journal of Computer Vision, pp. 1–26, 2020.
[61] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[62] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1440–1448.