Deep Edge Computing For Videos
Deep Edge Computing For Videos
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT This paper provides a modular architecture with deep neural networks as a solution for real-
time video analytics in an edge-computing environment. The modular architecture consists of two networks
of Front-CNN (Convolutional Neural Network) and Back-CNN, where we adopt Shallow 3D CNN (S3D)
as the Front-CNN and a pre-trained 2D CNN as the Back-CNN. The S3D (i.e., the Front CNN) is in charge
of condensing a sequence of video frames into a feature map with three channels. That is, the S3D takes
a set of sequential frames in the video shot as input and yields a learned 3 channel feature map (3CFM)
as output. Since the 3CFM is compatible with the three-channel RGB color image format, we can use
the output of the S3D (i.e., the 3CFM) as the input to a pre-trained 2D CNN of the Back-CNN for the
transfer-learning. This serial connection of Front-CNN and Back-CNN architecture is end-to-end trainable
to learn both spatial and temporal information of videos. Experimental results on the public datasets of UCF-
Crime and UR-Fall Detection show that the proposed S3D-2DCNN model outperforms the existing methods
and achieves state-of-the-art performance. Moreover, since our Front-CNN and Back-CNN modules have
a shallow S3D and a light-weighted 2D CNN, respectively, it is suitable for real-time video recognition in
edge-computing environments. We have implemented our CNN model on NVIDIA Jetson Nano Developer
as an edge-computing device to show its real-time execution.
INDEX TERMS Edge Computing, CNN, IoT, Anomaly Detection, Video Recognition.
VOLUME 4, 2016 1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
information. Since the dimension of 2D CNN with SG3I is 2) Treating the 3CFM as an image with RGB channels,
only of Rc×d×d without the extension to the time-domain, it we use the 3CFM for an input of a pre-trained 2D
is much lighter than C3D and I3D. Also, with the SG3I, no CNN. This naturally forms a cascade network with
optical flow computations are required, enabling fast training the Front-CNN (S3D) and the Back-CNN (2D CNN),
and testing for real-time applications. solving a video recognition problem without 3D filters
Recently, the integration between IoT (Internet of Things) and optical flow computations.
and video surveillance has led to a steep rise for the demand 3) We can use the S3D as a stand-alone network to con-
of IP (Internet Protocol) cameras. However, since IP cameras dense multiple video frames, reducing the transmission
have a limited computing power, it may be necessary to send cost in a client-server framework.
the video to the cloud in order to execute a C3D neural 4) We have evaluated the real-time performance for our
network for the recognition task (see Fig. 1-A). In this case, cascade S3D-MobileNet network via NVIDIA Jetson
the network traffic may hamper the timely detection of any Nano Developer.
anomaly at the cloud. To solve this problem, we can cut the
amount of video data at the edge-computing device before the II. BACKGROUND AND RELATED WORK
transmission. For example, we can adopt the SG3I scheme A. EDGE COMPUTING
[10] at the edge computing side, which needs only a simple Traditional video surveillance systems [12] demand human
process of the frame selection in a video shot. However, intervention to some extent. However, as the number of
as addressed in [10], multiple SG3Is are required at the IP cameras increases explosively, a fully automatic video
inference step to guarantee comparable results to the state-of- recognition framework becomes essential, replacing a man-
the-art performance. Specifically, in [10], CNN outputs for ual monitoring. Many algorithms [4], [5], [13]–[15] have
all 10 SG3Is inputs were fused to make the final decision been developed to handle vast amounts of data automatically.
for each video clip. So, as shown in Fig. 1-B, this requires to These algorithms can be used for the video recognition in a
send multiple SG3Is to the cloud, which may also cause some cloud server. As an example, a violence detection [16] was
delays in the recognition process. Certainly, the best solution performed by transmitting the video data obtained from the
is to complete the anomaly detection at the edge-side (see drone camera to the cloud server. Also, by transmitting the
Fig. 1-C and 1-D). To this end, in this paper, we propose road video obtained from the camera to the cloud server, the
a joint CNN with a shallow 3D CNN, which has fewer license plate of the vehicle was extracted [17].
convolutional layers and network parameters compared to In the above scenarios, the video data captured by the
conventional deep neural networks [11], and a light-weighted camera are transmitted to the cloud server to do the entire
2D CNN for a fast anomaly detection at the edge-computing recognition processes, which may hamper real-time video
environment. recognitions due to transmission delays through the com-
Since a typical CNN model has stacked layers of sub- munication channel. Alternatively, to send key information
networks, we may separate the trained CNN into two parts, only, a simple pre-processing technique can be applied to
Front-CNN and Back-CNN. Also, we can combine two the video acquired from the camera before transmitting to
CNNs of the Front-CNN and the Back-CNN, where the the cloud server. SWEETCAM proposed in [18] has an
Front-CNN is in charge of the pre-processing for the input image processing module in the camera and can perform
of the Back-CNN. In this paper, as the Front-CNN, we use pre-processing tasks such as background subtraction, contour
a Shallow 3D CNN (S3D), which is trained to condense detection, and object classification. Also, one can execute
multiple video frames into a single one with three channels a function of determining occupancy in data acquired from
of feature maps. As a result, the amount of the video data multiple cameras using a local binary pattern with the support
is reduced and the output of the S3D becomes compatible vector machine classifier [19].
with the input of the conventional 2D CNNs. This allows Video pre-processing tasks can be done at the edge-
us to use any pre-trained CNN for the Back-CNN with no computing device [20] [21], which is located in-between the
optical flow computations, making a fast video recognition. camera and the cloud server (see Fig. 2). Without doing
Specifically, as shown in Fig. 1-D, we can embed the S3D video pre-processing at the edge-side, the cloud server should
(Shallow 3D) CNN into an edge computing device, where take all computational loads. Therefore, the purpose of the
multiple video frames are fed into the S3D to produce a video pre-processing is to reduce not only the burden of data
learned 3-channel feature map (3CFM) as the output. Note transmission but also the computational load at the cloud.
that, like the SG3I, the 3CFM fits the input format of the Since the edge computing device is inferior to the cloud
pre-trained CNN with RGB three channels. This enables us in terms of computing power, we can embed only a light-
to fine-tune any pre-trained 2D CNN for video recognition weighted DNN in the edge-side to maximize the effect of the
problems without resorting to a C3D neural network. pre-processing tasks [22], [23].
The contributions of this paper are summarized as follows.
1) We propose a Shallow 3D CNN (S3D) as Front-CNN. B. NEURAL NETWORKS FOR VIDEOS
The S3D is trained to condense multiple video frames Recently, a remarkable performance improvement has been
into a single 3-channel feature map (3CFM). achieved by applying DNNs for video recognition prob-
2 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
FIGURE 1. Four scenarios of incorporating neural networks in an edge computing environment. A: 3D CNN at the cloud, B:
Multiple SG3Is formed by the edge-side and 2D CNN at the cloud, C: A single SG3I formation and 2D CNN at the edge-side D:
A single 3CFM formed by the S3D and 2D CNN at the edge-side.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
Back-CNN for transfer-learning. So, the Front-CNN of the to the Back-CNN may deteriorate the performance of the
S3D CNN enables us to use any pre-trained 2D CNN for end-to-end fine-tuning. To solve this problem, we employ a
the video recognition problem via the transfer-learning. Note skip connection in the Front-CNN, which adds the flavor of
that the Front-CNN can be used as a stand-alone or it can be a color image selected from the video frames to the feature
connected with the Back-CNN as a single network for end- map of the 3CFM (see Fig. 3). Specifically, for a sequence of
to-end training. input frames X = {X1 , X2 , ..., XT }, we have the input of
We introduce two user-scenarios for utilizing the Front- Back-CNN, Y (Xi ), as follows
CNN as a stand-alone network. First, we can train a classifier
Y (Xi ) = f (X1 , X2 , ..., XT ) + Xi , (1)
with the output of the Front-CNN (i.e., 3CFM) to pre-screen
videos. Obtaining any meaningful motion, we can feed the where f denotes the output of the S3D and Xi , i ∈ (1, ..., T ),
3CFM into the input of the Back-CNN to have more detailed is a randomly selected frame from X. So, the input of the
information. So, the additional work for the Back-CNN is Back-CNN is the sum of the 3CFM and a selected RGB
required only when we detect a meaningful motion. In this frame in the video shot.
way, we can avoid any unnecessary computations at the edge- Fig. 4 shows some examples of the 3CFM and Y (Xi ).
computing side. As a second scenario, we can use the Front- Rows 1 through 3 are video shots of Abuse007_x264, Ar-
CNN as a means of compressing video data. Since multiple rest002_x264, Shooting002_x264, respectively, in the UCF-
video frames can be condensed into only 1 frame with 3 chan- Crime dataset. As shown in the figures, the regions of interest
nels by the Front-CNN, we can reduce the transmission cost (RoI) for the anomalies tend to be ruddy in the 3CFM. Then,
in an client-server computing environment. In this scenario, more specific regions are getting brighter by Y (Xi ), which
the video data captured at the camera are transmitted to the are expected to help the following Back-CNN to learn the
server for video analytics and the role of the Front-CNN is to anomalies.
reduce the amount of data at the client-side.
TABLE 1. Details of Shallow 3D convolutional neural network
In the following sub-sections, we explain the Front-CNN, (S3D CNN) architecture used in the Front-CNN. In our S3D,
the Back-CNN, and their serial connection for the end-to-end with the strides of 4, 2, and 2 in order, we set T = 16, Tout1 =
4, and Tout2 = 2.
training in more detail.
Number Filter Stride Output
A. FRONT-CNN FOR VIDEO CONDENSATION Layers
of Filters Size Dimension
In this subsection, we introduce the Front-CNN that can
T
condense a video shot with multiple frames into only three Convolution 16 3x3x3 x1x1 WxHx3@Tout1
Tout1
channels of feature maps using the S3D. The main element of
BatchNorm
our Front-CNN is the 3D convolution filter, which can deal
ReLU
with multiple video frames. As shown in Fig. 3, the Front-
Tout1
CNN receives video frames of W ∗ H ∗ 3@T as an input, Convolution 32 3x3x3 x1x1 WxHx3@Tout2
which are sampled from a video shot and resized such that Tout2
BatchNorm
w ∗h∗3@t with t ≥ T , w 6= W , and h 6= H. Passing through
ReLU
the three 3D convolution layers, the S3D of the Front-CNN
outputs a 3CFM with W ∗ H ∗ 3@1, where T frames are now Convolution 3 3x3x3 Tout2 x1x1 WxHx3@1
condensed into only 1 frame. The specific elements of our BatchNorm
S3D are listed in Table 1. As shown in the table, to reduce ReLU
the number of frames of the input video gradually, we set the
numbers of the intermediate frames, Tout1 and Tout2 , such
that T > Tout1 > Tout2 > 1, where Tout1 and Tout2 should B. BACK-CNN FOR VIDEO RECOGNITION
be factors of T . Two-stream CNN structure with spatial and temporal streams
The output of the S3D (i.e., the 3CFM) has only three is proven to be effective for video action recognition prob-
channels of feature maps, which are compatible with the lems. Here, the temporal stream CNN is trained by the mo-
RGB channels in a color image. This implies that the 3CFM tion information in consecutive video frames. For example,
can be used as the input of any pre-trained 2D CNN. So, optical flow [24], dynamic images [25], and SG3I [10] can
we can directly connect the S3D of the Front-CNN and the be used as the input for the temporal stream CNN. On the
pre-trained 2D CNN of the Back-CNN for the end-to-end other hand, for the spatial stream CNN, a representative
fine-tuning. Note that the network parameters in both the single frame of the video shot is selected as an input for the
S3D (i.e., the Front-CNN) and the last layers of the pre- spatial stream CNN. Then, the two-stream CNN combines
trained CNN (i.e., the Back-CNN) are updated during the the features obtained in both the spatial stream and the
fine-tuning. That is, the network parameters in early layers temporal stream CNNs.
of the pre-trained 2D CNN are fixed without updating. Then, For our Back-CNN, we basically follow the two-stream
since the Back-CNN has been pre-trained by general 2D CNN model with two pre-trained 2D CNNs in [10]. The first
images but not feature maps like the 3CFM, the 3CFM input stream of 2D CNN is for learning the spatial information via
4 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
FIGURE 3. The overall structure of the serially connected Front-CNN and Back-CNN.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
the early layers of MobileNet are frozen but those of all S3D entire video shot into N sub-shots with N = 10. Then, each
and the last layers of MobileNet are trained. sub-shot will have t frames, which are variable depending on
the total number of frames in the video shot. Among t frames
IV. EXPERIMENTS in the sub-shot, we uniformly sub-sample T frames such that
A. DATASETS AND IMPLEMENTATION DETAILS T < t (see Fig. 3). In our experiments, we set T = 16.
For the performance evaluation of the proposed network, In fine-tuning the network, we set the training hyper-
UCF-101 [28], HMDB-51 [29], UCF-Crime [15], and UR- parameters as follows: the stochastic gradient descent (SGD)
FALL Detection [30] video dataset were used, where UCF- optimization with momentum 0.9, minibatch size 16, and
101 and HMDB-51 have been widely used for performance initial learning rate 1e − 3. And, we used the average fusion
evaluation of the video action recognition. The details of the [24] for the results of the spatial-stream (i.e., the result of
above datasets are as follows. Xi ) and the temporal-stream (i.e., the result of Y (Xi )).
• UCF-101 has 13320 video clips with 101 categories in 5 To simulate the edge-computing environment, the inference
action groups (i.e., Sports, Playing Musical Instrument, for the Front and Back-CNN model was implemented on
Human-Object Interaction, Body-Motion, and Human- NVIDIA Jetson nano Kit [33].
Human Interaction).
• HMDB-51 has 6766 video clips with 51 categories,
which are from 5 action groups (i.e., General facial ac-
tions, Facial actions with object manipulation, General
body Movements, Body Movements with object inter-
action, and Body movements for human interaction).
• The UCF-Crime dataset provides 800 normal and 810
anomalous video shots (video clips) for training. Also,
there are 150 normal and 140 anomalous videos for
testing. Although the anomalous videos contain 13 real-
world anomalies, our task is a general anomaly detec-
tion. So, we consider all anomalies as one group and
all normal activities as another one. The UCF-Crime
training dataset has weakly supervised labels in the
sense that a common label is given for each video shot
but not for every individual frame. In our method, each FIGURE 5. Example of the anomaly / normal situations in
the UCF-Crime dataset. Red bounding boxes indicate the
labeled video shot is divided into multiple sub-shots, anomalies.
where each sub-shot is used as a basic unit for the
classification. Therefore, we can just assign the same
label of a video shot for all its sub-shots for training.
Similarly, a sequential set of frames are grouped to form
a sub-shot for testing. Fig. 5 shows two examples of the
anomaly / normal situations in the UCF-Crime dataset.
• The UR Fall Detection Dataset has 40 activities of daily
living (ADL) and 30 fall videos. Fall video has camera-
0 version taken horizontally and camera-1 version taken
vertically, and ADL video only has camera-0 version.
In addition, depth information is provided using Mi-
crosoft’s Kinect when recording, but only RGB images
were used in our experiments. The criteria for dividing
Fall / Not Fall of the data set are classified as (a) pre-
fall (b) critical (c) fall (d) recovery, as in the criteria
used in [31], [32], (a) + (b) + (d) was proceeded to Not
Fall and (c) to Fall. Since the criteria for dividing train
/ validation / test were not provided, our experiments FIGURE 6. Example of the Fall / ADL situations in the UR-Fall
were conducted by dividing the dataset into 5 folds. Fig. Detection dataset.
6 shows examples of the Fall and the ADL situations
of the UR-Fall Detection dataset. The main difference In evaluating the anomaly detection methods, we adopted
between the Fall and the ADL is whether a person lies widely used performance measures such as AUC (Area Un-
down completely, as shown in Fig. 6, or not. der the Curve), Precision, Recall, and Accuracy defined as
In our experiments, since each video shot in the datasets follows
TP
usually has hundreds of video frames, we first divide the P recision = (3)
TP + FP
6 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
TP
Recall = (4)
TP + FN
TP + TN
Accuracy = , (5)
TP + FP + FN + TN
where TP, TN, FP, and FN are True Positive, True Negative,
False Positive, and False Negative, respectively.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
the network parameters in the S3D and the last layers of the V. CONCLUSION
MobileNet are updated. The goal of this paper is to construct a deep neural network
Table 4 shows the anomaly detection results on UCF- that can handle video data in real-time constraints. Our
Crime dataset. Since most of the previous results for UCF- strategy to achieve the goal is to avoid the computationally
Crime were reported in terms of the AUC, we also used the expensive optical flow computation and to minimize the 3D
AUC as the performance measure in Table 4. As shown in convolution operations. Accordingly, the proposed solution
the table, our method achieved the best performance both in is a modular architecture with the Front-CNN and the Back-
AUC and fps. To be specific, our method is about 2% and CNN. The Front-CNN has a shallow 3D (S3D) CNN with
6∼10% superior to BI-LSTM [36] and the methods in [15], only three layers of 3D convolution blocks, which is trained
[37], respectively. Our method is better by about 1-3% than to condense multiple frames of the video into only three
the SG3I [10] in the AUC and is much faster in terms of the channels of feature maps (3CFM). Then, since the 3CFM
fps performance on NVIDIA Jetson Nano. Note that, since the is compatible with a 2D image with RGB channels, we can
SG3I method in [10] needs to fuse all the results for 10 SG3I employ any pre-trained 2D CNN for the following Back-
inputs, it may suffer from latency. In the case of C3D, it is CNN. Our modular S3D-2D CNN architecture was applied
impossible to execute the C3D inference with Jetson Nano for anomaly detections with Jetson Nano developer for an
due to insufficient memory. edge-computing environment. Experimental results confirm
the real-time execution with state-of-the-art performance for
TABLE 4. Comparison of the anomaly detection performance the anomaly detection problem on UCF-Crime and UR Fall
for UCF-Crime. The speeds in fps (frame-per-second) were
measured on Jetson Nano. datasets. Although more 3D convolutional layers can be
added to the S3D for more challenging video tasks, they
UCF-Crime certainly limit the real-time execution and increases the hard-
Method
Detector AUC fps ware cost for the edge-computing device.
Sultani et al. [15] SVM Baseline 50.0 -
Deep REFERENCES
Hasan et al. [38] 50.6 -
Autoencoder [1] M. P. J. Ashby, “The value of CCTV surveillance cameras as an investiga-
Sultani et al. [15] C3D 75.4 N/A tive tool: An empirical analysis,” Eur. J. Criminal Policy Res., vol. 23, no.
Zhu & Newsam [37] C3D 79.0 N/A 3, pp. 441–459, Sep. 2017.
[2] E. L. Piza, B. C. Welsh, D. P. Farrington, and A. L. Thomas, “CCTV
Ullah et al. [36] BI-LSTM 85.5 - surveillance for crime prevention: A 40-year systematic review with meta
Kim & Won [10] MobileNet-v2 84.5 33.2 analysis,” Criminol. Public Policy, vol. 18, no. 1, pp. 135–159, Feb. 2019.
MobileNet-v3 [3] H. Heng, D. Jazayeri, L. Shaw, D. Kiegaldie, A.M. Hill, and M.E. Morris,
Kim & Won [10] 84.3 29.4 “Hospital falls prevention with patient education: a scoping review”, BMC
(Small) geriatrics 20, no.1, pp 1-12, Dec, 2020.
MobileNet-v3 [4] R. Parvathy, S. Thilakan, M. Joy, and K. M. Sameera, “Anomaly detection
Kim & Won [10] 85.9 27.2
(Large) using motion patterns computed from optical flow”, In 2013 Third Inter-
national Conference on Advances in Computing and Communications, pp.
S3D+2DCNN MobileNet-v2 87.4 76.6 58-61, Aug, 2013.
MobileNet-v3 [5] J. F. Kooij, M. C. Liem, J. D. Krijnders, T. C. Andringa, and D. M. Gavrila,
S3D+2DCNN 85.0 89.2 “Multi-modal human aggression detection”, Computer Vision and Image
(Small)
Understanding, 144, pp. 106-120, Mar, 2016.
MobileNet-v3 [6] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for
S3D+2DCNN 87.1 79.9
(Large) human action recognition’,’ IEEE Trans. Pattern Anal. Mach. Intell., vol.
35, no. 1, pp. 221–231, Jan. 2013.
[7] G. Varol, I. Laptev, and C. Schmid, “Long-tern temporal convolutions for
Table 5 shows the performance of the proposed method for action recognition”, IEEE transactions on pattern analysis and machine
UR-Fall detection dataset, where the results are averaged for intelligence”, 40(6), pp. 1510-1517, 2017.
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
all 5 folds of train/test datasets. Since most of the previous
spatiotemporal features with 3D convolutional networks”, in Proc. IEEE
works used the accuracy as the performance measure, we Int. Conf. Comput. Vis. (ICCV), pp. 4489–4497, Dec. 2015.
also used the accuracy as well as the precision and the recall [9] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new
model and the Kinetics dataset,” in Proc. CVPR, pp. 4724–4733 , Jul,
for UR-Fall dataset. As one can see in Table 5, the proposed
2017.
method achieved the state-of-the-art results in the real-time [10] J.-H. Kim and C. S. Won, “Action recognition in videos using pre-
constraints with the average accuracy of 100%. Since the trained 2D convolutional neural networks,” IEEE Access, vol. 8, pp.
‘Fall’ includes only the situation when the person is lying 60179–60188, Mar, 2020.
[11] F. Lei, X. Liu, Q. Dai, and B.W.K. Ling, “Shallow convolutional neural
on the floor, it is much easier to detect the falls in the UR- network for image classification,” SN Applied Sciences, 2(1), 1-8, 2020.
Fall than the crime actions in the UCF-Crime dataset. That is, [12] F. F. Chamasemani, and L. S. Affendey, “Systematic review and classifica-
the crime situations include much more complex anomalies tion on video surveillance systems”, International Journal of Information
Technology and Computer Science(IJITCS), 5(7), P.87, Jun, 2013.
and actions. This is why the ‘Fall’ detection performance of [13] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional
the non-DNN methods such as [39] and [40] also show ex- LSTM for anomaly detection,” in Proc. IEEE Int. Conf. Multimedia Expo
cellent detection performance of 96.6% and the DNN-based (ICME), pp. 439–444, Jul. 2017.
[14] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, “Spatio-temporal
methods such as [41], [31], and [15], including the proposed autoencoder for video anomaly detection,” in Proc. 25th ACM Int. Conf.
method, make only slight performance improvements. Multimedia, pp. 1933–1941, 2017.
8 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
TABLE 5. Comparisons of the anomaly detection performance for UR Fall Detection dataset. The speed of the proposed method
was measured in Jetson Nano by frame-per-second (fps).
UR-Fall Detection
Method
Detector Precision Recall Accuracy fps
Harrou et al. [39] MEWMA(+SVM) 93.6 100.0 96.6 -
Harrou et al. [40] GLR(+SVM) 94.0 100.0 96.6 -
Feng et al. [41] BI-LSTM 94.8 91.4 - -
Li et al. [31] CNN - - 99.9 -
Kim & Won [10] MobileNet-v2 96.9 100.0 97.1 33.2
MobileNet-v3
Kim & Won [10] 96.7 99.8 96.7 29.4
(Small)
MobileNet-v3
Kim & Won [10] 100.0 98.1 98.3 27.2
(Large)
S3D+2DCNN MobileNet-v2 100.0 100.0 100.0 76.6
MobileNet-v3
S3D+2DCNN 99.4 99.8 99.5 89.2
(Small)
MobileNet-v3
S3D+2DCNN 100.0 100.0 100.0 79.9
(Large)
[15] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in human actions classes from videos in the wild." [Online]. Available:
surveillance videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., https://arxiv.org/abs/1212.0402, 2012.
pp. 6479–6488, 2018. [29] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: A
[16] A. Singh, D. Patil, and S. N. Omkar, “Eye in the sky: Real-time drone large video database for human motion recognition," in Proc. Int. Conf.
surveillance system (DSS) for violent individuals identification using Scat- Comput. Vis., pp. 2556-2563, Nov. 2011.
terNet hybrid deep learning network,” in Proc. IEEE/CVF Conf. Comput. [30] B. Kwolek and M. Kepski, “Human fall detection on embedded platform
Vis. Pattern Recognit. Workshops (CVPRW), pp. 1629–1637, Jun. 2018. using depth maps and wireless accelerometer,” Comput. Methods Pro-
[17] R. Polishetty, M. Roopaei, and P. Rad, “A next-generation secure cloud- grams Biomed., vol. 117, no. 3, pp. 489–501, Dec. 2014.
based deep learning license plate recognition for smart cities,” in Proc. [31] X. Li, T. Pang, W. Liu, and T. Wang, “Fall detection for elderly person
15th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), pp. 286–293, Dec. care using convolutional neural networks,” in Proc. 10th Int. Congr. Image
2016. Signal Process., Biomed. Eng. Informat. (CISP-BMEI), Shanghai, China,
[18] K. Abas, C. Porto, and K. Obraczka, “Wireless smart camera networks for pp. 1–6, Oct. 2017
the surveillance of public spaces,” Computer, vol. 47, no. 5, pp. 37–44, [32] N. Noury et al., “Fall detection—Principles and methods,” in Proc. IEEE
2014. 29th Ann. Int. Conf. Eng. Med. Biol. Soc., pp. 1663–1666, Aug. 2007.
[19] S. Vitek and P. Melničuk, “A distributed wireless camera system for the [33] NVIDIA Developer. NVIDIA Jetson—Hardware For Every Situa-
management of parking spaces,” Sensors, vol. 18, no. 1, p. E69, Dec. 2018. tion. [Online]. Available: https://developer.nvidia.com/embedded/ de-
[20] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and velop/hardware, 2019.
challenges,” emphIEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. [34] F. Chollet, "Xception: Deep learning with depthwise separable convolu-
2016. tions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1251-
[21] A. E. Eshratifar, A. Esmaili, and M. Pedram, “BottleNet: A deep learning 1258, Jun. 2017.
architecture for intelligent mobile cloud computing services,” in Proc. [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED), Lausanne, A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
Switzerlan, Jul. 2019, pp. 1–6. Vis. Pattern Recognit. (CVPR), pp. 248–255, Jun. 2009.
[36] W. Ullah, A. Ullah, I. U. Haq, K. Muhammad, M. Sajjad, and S. W. Baik,
[22] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan,
“CNN features with bi-directional LSTM for real-time anomaly detection
“Real-time human detection as an edge service enabled by a lightweight
in surveillance networks,” Multimedia Tools Appl., pp. 1–17, Aug. 2020.
CNN,” in Proc. IEEE Int. Conf. Edge Comput. (EDGE), pp. 125–129, Jul.
[37] Y. Zhu, and S. Newsam, “Motion-aware feature for improved video
2018.
anomaly detection”, arXiv preprint arXiv:1907.10211., 2019.
[23] K. Huang, X. Liu, S. Fu, D. Guo, and M. Xu, “A lightweight privacy-
[38] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,
preserving CNN feature extraction framework for mobile sensing”, IEEE
“Learning temporal regularity in video sequences,” in Proc. IEEE Conf.
Transactions on Dependable and Secure Computing, Apr, 2019.
Comput. Vis. Pattern Recognit., pp. 733–742, Jun. 2016.
[24] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for [39] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “Vision-based fall
action recognition in videos," in Proc. Adv. Neural Inf. Process. Syst., pp. detection system for improving safety of elderly people,” IEEE Instrum.
568-576, 2014. Meas. Mag., vol. 20, no. 6, pp. 49–55, Dec. 2017.
[25] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, "Dynamic [40] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “An integrated vision-
image networks for action recognition," in Proc. IEEE Conf. Comput. Vis. based approach for efficient human fall detection in a home environment,”
Pattern Recognit., pp. 3034-3042, Jun. 2016. IEEE Access, vol. 7, pp. 114966–114974, 2019.
[26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- [41] Q. Feng, C. Gao, L. Wang, Y. Zhao, T. Song, and Q. Li, “Spatio-temporal
bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF fall event detection in complex scenes using attention guided LSTM,”
Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520, Jun. 2018. Pattern Recognit. Lett., vol. 130, pp. 242–249, Feb. 2020.
[27] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
Y. Zhu, R. Pang, and V. Vasudevan, “Searching for mobilenetv3,” arXiv
preprint arXiv:1905.02244, 2019.
[28] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101
VOLUME 4, 2016 9
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access
10 VOLUME 4, 2016
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/