0% found this document useful (0 votes)
63 views10 pages

Deep Edge Computing For Videos

This paper proposes a modular deep learning architecture of front-CNN (S3D) and back-CNN (2D CNN) for real-time video analytics in edge computing environments. The front S3D condenses video frames into a 3-channel feature map (3CFM) which is then input to the back 2D CNN for transfer learning. This allows the model to learn both spatial and temporal information from videos without 3D convolutions. Experimental results on public datasets show the proposed S3D-2DCNN model outperforms existing methods and achieves state-of-the-art performance while being suitable for real-time execution on edge devices like NVIDIA Jetson Nano due to the shallow

Uploaded by

Sangeetha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views10 pages

Deep Edge Computing For Videos

This paper proposes a modular deep learning architecture of front-CNN (S3D) and back-CNN (2D CNN) for real-time video analytics in edge computing environments. The front S3D condenses video frames into a 3-channel feature map (3CFM) which is then input to the back 2D CNN for transfer learning. This allows the model to learn both spatial and temporal information from videos without 3D convolutions. Experimental results on public datasets show the proposed S3D-2DCNN model outperforms existing methods and achieves state-of-the-art performance while being suitable for real-time execution on edge devices like NVIDIA Jetson Nano due to the shallow

Uploaded by

Sangeetha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Deep Edge Computing for Videos


JUN-HWA KIM, NAMHO KIM, AND CHEE SUN WON
Department of Electronics and Electrical Engineering, Dongguk University-Seoul, Seoul 04620, South Korea
Corresponding author: Chee Sun Won ([email protected]).
This work was supported by the Basic Science Research Program of the National Research Foundation of Korea (NRF) funded by the
Ministry of Education under Grant NRF-2018R1D1A1B07043542.

ABSTRACT This paper provides a modular architecture with deep neural networks as a solution for real-
time video analytics in an edge-computing environment. The modular architecture consists of two networks
of Front-CNN (Convolutional Neural Network) and Back-CNN, where we adopt Shallow 3D CNN (S3D)
as the Front-CNN and a pre-trained 2D CNN as the Back-CNN. The S3D (i.e., the Front CNN) is in charge
of condensing a sequence of video frames into a feature map with three channels. That is, the S3D takes
a set of sequential frames in the video shot as input and yields a learned 3 channel feature map (3CFM)
as output. Since the 3CFM is compatible with the three-channel RGB color image format, we can use
the output of the S3D (i.e., the 3CFM) as the input to a pre-trained 2D CNN of the Back-CNN for the
transfer-learning. This serial connection of Front-CNN and Back-CNN architecture is end-to-end trainable
to learn both spatial and temporal information of videos. Experimental results on the public datasets of UCF-
Crime and UR-Fall Detection show that the proposed S3D-2DCNN model outperforms the existing methods
and achieves state-of-the-art performance. Moreover, since our Front-CNN and Back-CNN modules have
a shallow S3D and a light-weighted 2D CNN, respectively, it is suitable for real-time video recognition in
edge-computing environments. We have implemented our CNN model on NVIDIA Jetson Nano Developer
as an edge-computing device to show its real-time execution.

INDEX TERMS Edge Computing, CNN, IoT, Anomaly Detection, Video Recognition.

I. INTRODUCTION sion of Rc×d×d×T , where c is the number of channels, d is


URVEILLANCE cameras have been increasingly de- the spatial size (i.e., d × d) of the filter, and T is the number
S ployed in public places for the purpose of monitoring
abnormal events such as criminal activities and medical
of frames in the video clips. That is, each filter in the inner
layers of the C3D takes a 3D volume input and produces
emergencies [1], [2]. In reality, falls are commonly occurred another 3D volume output, which requires a lot of compu-
for the elderly and hospital inpatient, generally ranging from tations and memory space. This motivates the researchers to
3 to 11 falls per 1,000 bed days [3]. Approximately 25% of simplify the C3D structure. A plausible approach is to expand
inpatient falls result in injury, including fracture, subdural already-trained 2D CNN (Convolutional Neural Network)
hematoma, excessive bleeding, and even death [3]. coefficients into 3D space. For example, in I3D (Inflated
Anomaly detection in a video is one of the challenging 3D) CNN [9], the pre-trained filter coefficients of 2D CNNs
problems and has been studied for a long time in the com- are copied over the temporal direction to form a 3D CNN
puter vision community. The traditional anomaly detection structure. However, the size of I3D is still Rc×d×d×T , which
mainly relied on the motion information between two con- makes no change in the inference complexity. Recently, in
secutive frames extracted by optical flow [4] or dynamic [10], it has been shown that a video recognition can be done
Bayesian Network (DBN) [5]. Recently, deep neural net- by using pre-trained 2D CNNs only. That is, a pre-trained
works (DNN) have been exploited to learn motion infor- CNN is fine-tuned by 3 grayscale frames, which are sub-
mation in a video. For videos, a Convolutional 3D (C3D) sampled from a video shot. Then, the selected 3 grayscale
neural network architecture with 3D convolutional filters is images among multiple video frames form a SG3I (Stacked
considered as a natural extension of 2D filters of the 2D CNN Grayscale 3-channel Image) [10], which is compatible with
[6]–[9]. The C3D learns the temporal motions as well as the the color image with RGB (Red, Green, Blue) channels.
spatial features from video frames. This requires for the C3D Then, the SG3Is formed from the training videos are used
to execute complex 3D convolutions with the kernel dimen- to fine-tune the pre-trained 2D CNN to learn the motion

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

information. Since the dimension of 2D CNN with SG3I is 2) Treating the 3CFM as an image with RGB channels,
only of Rc×d×d without the extension to the time-domain, it we use the 3CFM for an input of a pre-trained 2D
is much lighter than C3D and I3D. Also, with the SG3I, no CNN. This naturally forms a cascade network with
optical flow computations are required, enabling fast training the Front-CNN (S3D) and the Back-CNN (2D CNN),
and testing for real-time applications. solving a video recognition problem without 3D filters
Recently, the integration between IoT (Internet of Things) and optical flow computations.
and video surveillance has led to a steep rise for the demand 3) We can use the S3D as a stand-alone network to con-
of IP (Internet Protocol) cameras. However, since IP cameras dense multiple video frames, reducing the transmission
have a limited computing power, it may be necessary to send cost in a client-server framework.
the video to the cloud in order to execute a C3D neural 4) We have evaluated the real-time performance for our
network for the recognition task (see Fig. 1-A). In this case, cascade S3D-MobileNet network via NVIDIA Jetson
the network traffic may hamper the timely detection of any Nano Developer.
anomaly at the cloud. To solve this problem, we can cut the
amount of video data at the edge-computing device before the II. BACKGROUND AND RELATED WORK
transmission. For example, we can adopt the SG3I scheme A. EDGE COMPUTING
[10] at the edge computing side, which needs only a simple Traditional video surveillance systems [12] demand human
process of the frame selection in a video shot. However, intervention to some extent. However, as the number of
as addressed in [10], multiple SG3Is are required at the IP cameras increases explosively, a fully automatic video
inference step to guarantee comparable results to the state-of- recognition framework becomes essential, replacing a man-
the-art performance. Specifically, in [10], CNN outputs for ual monitoring. Many algorithms [4], [5], [13]–[15] have
all 10 SG3Is inputs were fused to make the final decision been developed to handle vast amounts of data automatically.
for each video clip. So, as shown in Fig. 1-B, this requires to These algorithms can be used for the video recognition in a
send multiple SG3Is to the cloud, which may also cause some cloud server. As an example, a violence detection [16] was
delays in the recognition process. Certainly, the best solution performed by transmitting the video data obtained from the
is to complete the anomaly detection at the edge-side (see drone camera to the cloud server. Also, by transmitting the
Fig. 1-C and 1-D). To this end, in this paper, we propose road video obtained from the camera to the cloud server, the
a joint CNN with a shallow 3D CNN, which has fewer license plate of the vehicle was extracted [17].
convolutional layers and network parameters compared to In the above scenarios, the video data captured by the
conventional deep neural networks [11], and a light-weighted camera are transmitted to the cloud server to do the entire
2D CNN for a fast anomaly detection at the edge-computing recognition processes, which may hamper real-time video
environment. recognitions due to transmission delays through the com-
Since a typical CNN model has stacked layers of sub- munication channel. Alternatively, to send key information
networks, we may separate the trained CNN into two parts, only, a simple pre-processing technique can be applied to
Front-CNN and Back-CNN. Also, we can combine two the video acquired from the camera before transmitting to
CNNs of the Front-CNN and the Back-CNN, where the the cloud server. SWEETCAM proposed in [18] has an
Front-CNN is in charge of the pre-processing for the input image processing module in the camera and can perform
of the Back-CNN. In this paper, as the Front-CNN, we use pre-processing tasks such as background subtraction, contour
a Shallow 3D CNN (S3D), which is trained to condense detection, and object classification. Also, one can execute
multiple video frames into a single one with three channels a function of determining occupancy in data acquired from
of feature maps. As a result, the amount of the video data multiple cameras using a local binary pattern with the support
is reduced and the output of the S3D becomes compatible vector machine classifier [19].
with the input of the conventional 2D CNNs. This allows Video pre-processing tasks can be done at the edge-
us to use any pre-trained CNN for the Back-CNN with no computing device [20] [21], which is located in-between the
optical flow computations, making a fast video recognition. camera and the cloud server (see Fig. 2). Without doing
Specifically, as shown in Fig. 1-D, we can embed the S3D video pre-processing at the edge-side, the cloud server should
(Shallow 3D) CNN into an edge computing device, where take all computational loads. Therefore, the purpose of the
multiple video frames are fed into the S3D to produce a video pre-processing is to reduce not only the burden of data
learned 3-channel feature map (3CFM) as the output. Note transmission but also the computational load at the cloud.
that, like the SG3I, the 3CFM fits the input format of the Since the edge computing device is inferior to the cloud
pre-trained CNN with RGB three channels. This enables us in terms of computing power, we can embed only a light-
to fine-tune any pre-trained 2D CNN for video recognition weighted DNN in the edge-side to maximize the effect of the
problems without resorting to a C3D neural network. pre-processing tasks [22], [23].
The contributions of this paper are summarized as follows.
1) We propose a Shallow 3D CNN (S3D) as Front-CNN. B. NEURAL NETWORKS FOR VIDEOS
The S3D is trained to condense multiple video frames Recently, a remarkable performance improvement has been
into a single 3-channel feature map (3CFM). achieved by applying DNNs for video recognition prob-
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 1. Four scenarios of incorporating neural networks in an edge computing environment. A: 3D CNN at the cloud, B:
Multiple SG3Is formed by the edge-side and 2D CNN at the cloud, C: A single SG3I formation and 2D CNN at the edge-side D:
A single 3CFM formed by the S3D and 2D CNN at the edge-side.

three gray-scale images are extracted from video frames and


stacked in R, G, and B channels to form an image of SG3I.
The SG3I, which is compatible with the input of pre-trained
2D CNNs, requires a very small amount of computation and
is suitable for real-time video recognition problems.
Previously, the optical flows were computed directly from
the video frames [4]. However, recently, the optical flow
computations have been replaced by neural networks. In [5],
DBN (Dynamic Bayesian Network) was adopted to learn
the fused clues of video (motion trajectory, arms and legs
direction information) and audio (scream, yell). Also, in [13],
LSTM (Long Short-Term Memory) was adopted to learn
temporal information, while a CNN was used to extract spa-
FIGURE 2. Edge computing with IoT devices.
tial information in the video. Also, in [14], a spatio-temporal
auto-encoder extracts temporal and spatial information in
video by designing a 3D CNN with an AutoEncoder struc-
ture. The C3D and the ranking models were also employed
lems. For example, in [24], a two-path CNN model was in [15] for detecting video anomalies.
proposed for video action recognition. The two-path CNN
has a spatial-stream CNN for extracting spatial features and III. MODULAR NEURAL NETWORK
a temporal-stream CNN for learning temporal motions. For Fig. 3 shows the overall structure of our Front-CNN and
the input of the spatial-stream CNN, a representative frame Back-CNN modules. The Front-CNN is a shallow 3D con-
is selected from several video frames. On the other hand, the volutional neural network (S3D CNN) with only 3 layer-
input of the temporal stream takes a sequence of optical flows blocks of 3D convolutional filters, batch normalization (BN),
obtained from consecutive video frames. Here, the problem is and ReLU activation function. The Front-CNN is responsible
that the optical flow demands a lot of computations. To avoid for condensing a group of video frames into a three-channel
the optical flow computation, in [25], the dynamic image feature map (3CFM) with the learned motion information.
based on a rank pooling concept, which is a summarized Then, since the 3CFM is compatible with the input of pre-
single frame of the motion information in the multiple frames trained 2D CNNs, we can adopt any pre-trained 2D CNN
of the original video clip, was proposed. Similarly, in [10], for the Back-CNN. That is, we can feed the 3CFM into the
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Back-CNN for transfer-learning. So, the Front-CNN of the to the Back-CNN may deteriorate the performance of the
S3D CNN enables us to use any pre-trained 2D CNN for end-to-end fine-tuning. To solve this problem, we employ a
the video recognition problem via the transfer-learning. Note skip connection in the Front-CNN, which adds the flavor of
that the Front-CNN can be used as a stand-alone or it can be a color image selected from the video frames to the feature
connected with the Back-CNN as a single network for end- map of the 3CFM (see Fig. 3). Specifically, for a sequence of
to-end training. input frames X = {X1 , X2 , ..., XT }, we have the input of
We introduce two user-scenarios for utilizing the Front- Back-CNN, Y (Xi ), as follows
CNN as a stand-alone network. First, we can train a classifier
Y (Xi ) = f (X1 , X2 , ..., XT ) + Xi , (1)
with the output of the Front-CNN (i.e., 3CFM) to pre-screen
videos. Obtaining any meaningful motion, we can feed the where f denotes the output of the S3D and Xi , i ∈ (1, ..., T ),
3CFM into the input of the Back-CNN to have more detailed is a randomly selected frame from X. So, the input of the
information. So, the additional work for the Back-CNN is Back-CNN is the sum of the 3CFM and a selected RGB
required only when we detect a meaningful motion. In this frame in the video shot.
way, we can avoid any unnecessary computations at the edge- Fig. 4 shows some examples of the 3CFM and Y (Xi ).
computing side. As a second scenario, we can use the Front- Rows 1 through 3 are video shots of Abuse007_x264, Ar-
CNN as a means of compressing video data. Since multiple rest002_x264, Shooting002_x264, respectively, in the UCF-
video frames can be condensed into only 1 frame with 3 chan- Crime dataset. As shown in the figures, the regions of interest
nels by the Front-CNN, we can reduce the transmission cost (RoI) for the anomalies tend to be ruddy in the 3CFM. Then,
in an client-server computing environment. In this scenario, more specific regions are getting brighter by Y (Xi ), which
the video data captured at the camera are transmitted to the are expected to help the following Back-CNN to learn the
server for video analytics and the role of the Front-CNN is to anomalies.
reduce the amount of data at the client-side.
TABLE 1. Details of Shallow 3D convolutional neural network
In the following sub-sections, we explain the Front-CNN, (S3D CNN) architecture used in the Front-CNN. In our S3D,
the Back-CNN, and their serial connection for the end-to-end with the strides of 4, 2, and 2 in order, we set T = 16, Tout1 =
4, and Tout2 = 2.
training in more detail.
Number Filter Stride Output
A. FRONT-CNN FOR VIDEO CONDENSATION Layers
of Filters Size Dimension
In this subsection, we introduce the Front-CNN that can
T
condense a video shot with multiple frames into only three Convolution 16 3x3x3 x1x1 WxHx3@Tout1
Tout1
channels of feature maps using the S3D. The main element of
BatchNorm
our Front-CNN is the 3D convolution filter, which can deal
ReLU
with multiple video frames. As shown in Fig. 3, the Front-
Tout1
CNN receives video frames of W ∗ H ∗ 3@T as an input, Convolution 32 3x3x3 x1x1 WxHx3@Tout2
which are sampled from a video shot and resized such that Tout2
BatchNorm
w ∗h∗3@t with t ≥ T , w 6= W , and h 6= H. Passing through
ReLU
the three 3D convolution layers, the S3D of the Front-CNN
outputs a 3CFM with W ∗ H ∗ 3@1, where T frames are now Convolution 3 3x3x3 Tout2 x1x1 WxHx3@1
condensed into only 1 frame. The specific elements of our BatchNorm
S3D are listed in Table 1. As shown in the table, to reduce ReLU
the number of frames of the input video gradually, we set the
numbers of the intermediate frames, Tout1 and Tout2 , such
that T > Tout1 > Tout2 > 1, where Tout1 and Tout2 should B. BACK-CNN FOR VIDEO RECOGNITION
be factors of T . Two-stream CNN structure with spatial and temporal streams
The output of the S3D (i.e., the 3CFM) has only three is proven to be effective for video action recognition prob-
channels of feature maps, which are compatible with the lems. Here, the temporal stream CNN is trained by the mo-
RGB channels in a color image. This implies that the 3CFM tion information in consecutive video frames. For example,
can be used as the input of any pre-trained 2D CNN. So, optical flow [24], dynamic images [25], and SG3I [10] can
we can directly connect the S3D of the Front-CNN and the be used as the input for the temporal stream CNN. On the
pre-trained 2D CNN of the Back-CNN for the end-to-end other hand, for the spatial stream CNN, a representative
fine-tuning. Note that the network parameters in both the single frame of the video shot is selected as an input for the
S3D (i.e., the Front-CNN) and the last layers of the pre- spatial stream CNN. Then, the two-stream CNN combines
trained CNN (i.e., the Back-CNN) are updated during the the features obtained in both the spatial stream and the
fine-tuning. That is, the network parameters in early layers temporal stream CNNs.
of the pre-trained 2D CNN are fixed without updating. Then, For our Back-CNN, we basically follow the two-stream
since the Back-CNN has been pre-trained by general 2D CNN model with two pre-trained 2D CNNs in [10]. The first
images but not feature maps like the 3CFM, the 3CFM input stream of 2D CNN is for learning the spatial information via
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 3. The overall structure of the serially connected Front-CNN and Back-CNN.

Therefore, we use only one of them and share it for learning


both spatial and temporal information. Specifically, as in (2),
the loss function L for the parameter-update is a weighted
sum of two losses of LY (Xi ) for the temporal learning and
LXi for the spatial learning
L = LY (Xi ) + λLXi , (2)
where λ is a weighting factor. The parameter λ is determined
experimentally and set to 0.1 for all our experiments. Now,
before we make a back-propagation for the parameter-update
by L, our coordinated end-to-end training method needs to
execute two forward computations to have LXi by Xi and
LY (Xi ) by Y (Xi ).
A limited computing power of edge-computing devices
forces us to choose a light 2D CNN model for the Back-CNN.
This leads us to consider the pre-trained MobileNet [26], [27]
as the Back-CNN.

C. END-TO-END TRAINING FOR FRONT-CNN AND


FIGURE 4. Visualization of 3CFM and Y (Xi ) for UCF-Crime BACK-CNN
dataset.
As shown in Fig. 3, the Front-CNN and the Back-CNN are
connected in series to complete the network. Again, a shallow
3D CNN of S3D is employed for the Front-CNN and a light-
a single frame chosen from the video shot. The second one weighted 2D CNN such as MobileNet-v2 [26], Mobilenet-
is for learning the temporal motion by the SG3I, which is v3(small), or MobileNet-v3(large) [27] is used for the Back-
formed by stacking the three gray-scale images chosen from CNN. Note that the three versions of the MobileNet, which
the video shot. On the other hand, in this paper, the input were pre-trained by the ImageNet, have relatively fewer pa-
for the temporal 2D CNN is the learned Y (Xi ) from the rameters. So, they are suitable for the real-time applications.
Front-CNN, whereas the input for the spatial 2D CNN is the For the training of the end-to-end network with Front-
original frame Xi in the video shot. Note that the two-stream CNN and Back-CNN, all network parameters except the
model needs to keep two 2D CNNs, which can be a burden early layers of the Back-CNN (i.e., the pre-trained Mo-
on memory complexity in edge-computing environments. bileNet) are updated. That is, the pre-trained parameters of
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

the early layers of MobileNet are frozen but those of all S3D entire video shot into N sub-shots with N = 10. Then, each
and the last layers of MobileNet are trained. sub-shot will have t frames, which are variable depending on
the total number of frames in the video shot. Among t frames
IV. EXPERIMENTS in the sub-shot, we uniformly sub-sample T frames such that
A. DATASETS AND IMPLEMENTATION DETAILS T < t (see Fig. 3). In our experiments, we set T = 16.
For the performance evaluation of the proposed network, In fine-tuning the network, we set the training hyper-
UCF-101 [28], HMDB-51 [29], UCF-Crime [15], and UR- parameters as follows: the stochastic gradient descent (SGD)
FALL Detection [30] video dataset were used, where UCF- optimization with momentum 0.9, minibatch size 16, and
101 and HMDB-51 have been widely used for performance initial learning rate 1e − 3. And, we used the average fusion
evaluation of the video action recognition. The details of the [24] for the results of the spatial-stream (i.e., the result of
above datasets are as follows. Xi ) and the temporal-stream (i.e., the result of Y (Xi )).
• UCF-101 has 13320 video clips with 101 categories in 5 To simulate the edge-computing environment, the inference
action groups (i.e., Sports, Playing Musical Instrument, for the Front and Back-CNN model was implemented on
Human-Object Interaction, Body-Motion, and Human- NVIDIA Jetson nano Kit [33].
Human Interaction).
• HMDB-51 has 6766 video clips with 51 categories,
which are from 5 action groups (i.e., General facial ac-
tions, Facial actions with object manipulation, General
body Movements, Body Movements with object inter-
action, and Body movements for human interaction).
• The UCF-Crime dataset provides 800 normal and 810
anomalous video shots (video clips) for training. Also,
there are 150 normal and 140 anomalous videos for
testing. Although the anomalous videos contain 13 real-
world anomalies, our task is a general anomaly detec-
tion. So, we consider all anomalies as one group and
all normal activities as another one. The UCF-Crime
training dataset has weakly supervised labels in the
sense that a common label is given for each video shot
but not for every individual frame. In our method, each FIGURE 5. Example of the anomaly / normal situations in
the UCF-Crime dataset. Red bounding boxes indicate the
labeled video shot is divided into multiple sub-shots, anomalies.
where each sub-shot is used as a basic unit for the
classification. Therefore, we can just assign the same
label of a video shot for all its sub-shots for training.
Similarly, a sequential set of frames are grouped to form
a sub-shot for testing. Fig. 5 shows two examples of the
anomaly / normal situations in the UCF-Crime dataset.
• The UR Fall Detection Dataset has 40 activities of daily
living (ADL) and 30 fall videos. Fall video has camera-
0 version taken horizontally and camera-1 version taken
vertically, and ADL video only has camera-0 version.
In addition, depth information is provided using Mi-
crosoft’s Kinect when recording, but only RGB images
were used in our experiments. The criteria for dividing
Fall / Not Fall of the data set are classified as (a) pre-
fall (b) critical (c) fall (d) recovery, as in the criteria
used in [31], [32], (a) + (b) + (d) was proceeded to Not
Fall and (c) to Fall. Since the criteria for dividing train
/ validation / test were not provided, our experiments FIGURE 6. Example of the Fall / ADL situations in the UR-Fall
were conducted by dividing the dataset into 5 folds. Fig. Detection dataset.
6 shows examples of the Fall and the ADL situations
of the UR-Fall Detection dataset. The main difference In evaluating the anomaly detection methods, we adopted
between the Fall and the ADL is whether a person lies widely used performance measures such as AUC (Area Un-
down completely, as shown in Fig. 6, or not. der the Curve), Precision, Recall, and Accuracy defined as
In our experiments, since each video shot in the datasets follows
TP
usually has hundreds of video frames, we first divide the P recision = (3)
TP + FP
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TP
Recall = (4)
TP + FN
TP + TN
Accuracy = , (5)
TP + FP + FN + TN
where TP, TN, FP, and FN are True Positive, True Negative,
False Positive, and False Negative, respectively.

B. PERFORMANCE COMPARISON FOR ACTION (a)


RECOGNITION
Note that the proposed method adopts a coordinated training
for the spatial-stream (i.e., the result of Xi ) and the temporal-
stream (i.e., the result of Y (Xi )) with a shared 2D CNN.
So, by setting λ = 0 in (2) of the loss function, only
a temporal-stream network remains. This can be used to
evaluate the performance of the proposed S3D for temporal
motion learning. That is, for the performance comparison
(b)
between the 3CFM (i.e., the output of the S3D) and the SG3I
[10], we trained the temporal stream CNN with the SG3I FIGURE 7. Two-stream CNNs for action recognition with: (a)
SG3I and (b) S3D as temporal-stream CNN.
and the 3CFM, separately, for the UCF-101 and HMDB-51
datasets (see Fig. 7). Since Xception [34] was used in [10]
for the SG3I, we also trained the Back-CNN with the same
with three layers of 3D convolution blocks is used. As
model. Also, with the same spatial stream, the result of each
mentioned already, the Back-CNN can adopt any exist-
temporal stream is fused with the result of the spatial stream.
ing 2D CNN. Considering the real-time constraints in the
As shown in Table 2, after fusing the results of the two
edge-computing environment, however, a relatively light-
streams, our method with 3CFM yields higher mAP (mean
weighted CNN is required. Therefore, for the Back-CNN,
AP) by 3.3% in UCF-101 and 4.3% in HMDB-51 than the
we used MobileNet-V2 [26], MobileNet-V3(Small) [27],
SG3I.
and MobileNet-V3(Large) [27], which were pre-trained by
It is interesting to note that, for the mAP performance of
still image datasets such as ImageNet [35]. Table 3 shows
the temporal stream only, the results of the SG3I are better
the inference speeds for various combinations with differ-
than the 3CFM in Table 2. On the other hand, the 3CFM is
ent layers of S3D (Front-CNN) and different versions of
better than the SG3I for the final fused results. We believe
MobileNet [26], [27] (Back-CNN). As shown in the table,
that this is due to the different motion information contained
our modular network of Front-CNN and Back-CNN is fast
in the SG3I and the 3CFM. That is, since the SG3I is formed
enough to support real-time applications with Jetson Nano
by just stacking 3 gray-scale images in the video shot, it has a
for all versions of MobileNet with 3 layers of S3D.
higher correlation with the input image for the spatial-stream
CNN. On the other hand, the 3CFM is the output of the learnt TABLE 3. Comparison of Inference speeds in terms of fps
S3D and is less correlated with the input of the spatial-stream (frames per second) with the various combinations for the
CNN, being complementary with the spatial-stream CNN. end-to-end Front-CNN and Back-CNN network implemented
on NVIDIA Jetson Nano. UCF-Crime dataset [15] with 256 ×
TABLE 2. Comparison between 3CFM and SG3I as the 256 × 3@16 was used for the speed comparison.
temporal stream CNN for action recognitions with UCF-101
and HMDB-51 datasets. Back-CNN
S3D
Layers MobileNet MobileNet MobileNet Channels
mAP
-v2 -v3(small) -v3(Large)
Method 2D-CNN Dataset Spatial Temporal
Fusion 3 Layers 76.6 89.2 79.9 16, 32, 3
Stream Stream
42.2 50.1 43.9 16, 32, 32, 3
UCF-101 86.1 84.7 87.7 4 Layers
SG3I [10] Xception 34.4 39.4 35.4 16, 32, 64, 3
HMDB-51 65.5 64.8 67.5
UCF-101 86.1 79.2 91.0 35.4 40.6 36.5 16, 32, 32, 32, 3
3CFM Xception 5 Layers
HMDB-51 65.5 55.4 71.8 16.9 18.0 17.1 16, 32, 64, 128, 3

The serially connected Front-CNN and Back-CNN are


C. ANOMALY DETECTION IN EDGE-COMPUTING fine-tuned via the coordinated end-to-end training method.
ENVIRONMENT Note that the S3D for the Front-CNN is untrained, but the
In this sub-section, we applied the proposed network with MobileNet of the Back-CNN is pre-trained. Therefore, the
Front-CNN and Back-CNN to anomaly detection in an pre-trained weights in the early layers (say, up to 10 layers)
edge-computing environment. For the Front-CNN, the S3D of the MobileNet are frozen during the fine-tuning, but only
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

the network parameters in the S3D and the last layers of the V. CONCLUSION
MobileNet are updated. The goal of this paper is to construct a deep neural network
Table 4 shows the anomaly detection results on UCF- that can handle video data in real-time constraints. Our
Crime dataset. Since most of the previous results for UCF- strategy to achieve the goal is to avoid the computationally
Crime were reported in terms of the AUC, we also used the expensive optical flow computation and to minimize the 3D
AUC as the performance measure in Table 4. As shown in convolution operations. Accordingly, the proposed solution
the table, our method achieved the best performance both in is a modular architecture with the Front-CNN and the Back-
AUC and fps. To be specific, our method is about 2% and CNN. The Front-CNN has a shallow 3D (S3D) CNN with
6∼10% superior to BI-LSTM [36] and the methods in [15], only three layers of 3D convolution blocks, which is trained
[37], respectively. Our method is better by about 1-3% than to condense multiple frames of the video into only three
the SG3I [10] in the AUC and is much faster in terms of the channels of feature maps (3CFM). Then, since the 3CFM
fps performance on NVIDIA Jetson Nano. Note that, since the is compatible with a 2D image with RGB channels, we can
SG3I method in [10] needs to fuse all the results for 10 SG3I employ any pre-trained 2D CNN for the following Back-
inputs, it may suffer from latency. In the case of C3D, it is CNN. Our modular S3D-2D CNN architecture was applied
impossible to execute the C3D inference with Jetson Nano for anomaly detections with Jetson Nano developer for an
due to insufficient memory. edge-computing environment. Experimental results confirm
the real-time execution with state-of-the-art performance for
TABLE 4. Comparison of the anomaly detection performance the anomaly detection problem on UCF-Crime and UR Fall
for UCF-Crime. The speeds in fps (frame-per-second) were
measured on Jetson Nano. datasets. Although more 3D convolutional layers can be
added to the S3D for more challenging video tasks, they
UCF-Crime certainly limit the real-time execution and increases the hard-
Method
Detector AUC fps ware cost for the edge-computing device.
Sultani et al. [15] SVM Baseline 50.0 -
Deep REFERENCES
Hasan et al. [38] 50.6 -
Autoencoder [1] M. P. J. Ashby, “The value of CCTV surveillance cameras as an investiga-
Sultani et al. [15] C3D 75.4 N/A tive tool: An empirical analysis,” Eur. J. Criminal Policy Res., vol. 23, no.
Zhu & Newsam [37] C3D 79.0 N/A 3, pp. 441–459, Sep. 2017.
[2] E. L. Piza, B. C. Welsh, D. P. Farrington, and A. L. Thomas, “CCTV
Ullah et al. [36] BI-LSTM 85.5 - surveillance for crime prevention: A 40-year systematic review with meta
Kim & Won [10] MobileNet-v2 84.5 33.2 analysis,” Criminol. Public Policy, vol. 18, no. 1, pp. 135–159, Feb. 2019.
MobileNet-v3 [3] H. Heng, D. Jazayeri, L. Shaw, D. Kiegaldie, A.M. Hill, and M.E. Morris,
Kim & Won [10] 84.3 29.4 “Hospital falls prevention with patient education: a scoping review”, BMC
(Small) geriatrics 20, no.1, pp 1-12, Dec, 2020.
MobileNet-v3 [4] R. Parvathy, S. Thilakan, M. Joy, and K. M. Sameera, “Anomaly detection
Kim & Won [10] 85.9 27.2
(Large) using motion patterns computed from optical flow”, In 2013 Third Inter-
national Conference on Advances in Computing and Communications, pp.
S3D+2DCNN MobileNet-v2 87.4 76.6 58-61, Aug, 2013.
MobileNet-v3 [5] J. F. Kooij, M. C. Liem, J. D. Krijnders, T. C. Andringa, and D. M. Gavrila,
S3D+2DCNN 85.0 89.2 “Multi-modal human aggression detection”, Computer Vision and Image
(Small)
Understanding, 144, pp. 106-120, Mar, 2016.
MobileNet-v3 [6] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for
S3D+2DCNN 87.1 79.9
(Large) human action recognition’,’ IEEE Trans. Pattern Anal. Mach. Intell., vol.
35, no. 1, pp. 221–231, Jan. 2013.
[7] G. Varol, I. Laptev, and C. Schmid, “Long-tern temporal convolutions for
Table 5 shows the performance of the proposed method for action recognition”, IEEE transactions on pattern analysis and machine
UR-Fall detection dataset, where the results are averaged for intelligence”, 40(6), pp. 1510-1517, 2017.
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
all 5 folds of train/test datasets. Since most of the previous
spatiotemporal features with 3D convolutional networks”, in Proc. IEEE
works used the accuracy as the performance measure, we Int. Conf. Comput. Vis. (ICCV), pp. 4489–4497, Dec. 2015.
also used the accuracy as well as the precision and the recall [9] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new
model and the Kinetics dataset,” in Proc. CVPR, pp. 4724–4733 , Jul,
for UR-Fall dataset. As one can see in Table 5, the proposed
2017.
method achieved the state-of-the-art results in the real-time [10] J.-H. Kim and C. S. Won, “Action recognition in videos using pre-
constraints with the average accuracy of 100%. Since the trained 2D convolutional neural networks,” IEEE Access, vol. 8, pp.
‘Fall’ includes only the situation when the person is lying 60179–60188, Mar, 2020.
[11] F. Lei, X. Liu, Q. Dai, and B.W.K. Ling, “Shallow convolutional neural
on the floor, it is much easier to detect the falls in the UR- network for image classification,” SN Applied Sciences, 2(1), 1-8, 2020.
Fall than the crime actions in the UCF-Crime dataset. That is, [12] F. F. Chamasemani, and L. S. Affendey, “Systematic review and classifica-
the crime situations include much more complex anomalies tion on video surveillance systems”, International Journal of Information
Technology and Computer Science(IJITCS), 5(7), P.87, Jun, 2013.
and actions. This is why the ‘Fall’ detection performance of [13] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional
the non-DNN methods such as [39] and [40] also show ex- LSTM for anomaly detection,” in Proc. IEEE Int. Conf. Multimedia Expo
cellent detection performance of 96.6% and the DNN-based (ICME), pp. 439–444, Jul. 2017.
[14] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, “Spatio-temporal
methods such as [41], [31], and [15], including the proposed autoencoder for video anomaly detection,” in Proc. 25th ACM Int. Conf.
method, make only slight performance improvements. Multimedia, pp. 1933–1941, 2017.

8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 5. Comparisons of the anomaly detection performance for UR Fall Detection dataset. The speed of the proposed method
was measured in Jetson Nano by frame-per-second (fps).

UR-Fall Detection
Method
Detector Precision Recall Accuracy fps
Harrou et al. [39] MEWMA(+SVM) 93.6 100.0 96.6 -
Harrou et al. [40] GLR(+SVM) 94.0 100.0 96.6 -
Feng et al. [41] BI-LSTM 94.8 91.4 - -
Li et al. [31] CNN - - 99.9 -
Kim & Won [10] MobileNet-v2 96.9 100.0 97.1 33.2
MobileNet-v3
Kim & Won [10] 96.7 99.8 96.7 29.4
(Small)
MobileNet-v3
Kim & Won [10] 100.0 98.1 98.3 27.2
(Large)
S3D+2DCNN MobileNet-v2 100.0 100.0 100.0 76.6
MobileNet-v3
S3D+2DCNN 99.4 99.8 99.5 89.2
(Small)
MobileNet-v3
S3D+2DCNN 100.0 100.0 100.0 79.9
(Large)

[15] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in human actions classes from videos in the wild." [Online]. Available:
surveillance videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., https://arxiv.org/abs/1212.0402, 2012.
pp. 6479–6488, 2018. [29] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: A
[16] A. Singh, D. Patil, and S. N. Omkar, “Eye in the sky: Real-time drone large video database for human motion recognition," in Proc. Int. Conf.
surveillance system (DSS) for violent individuals identification using Scat- Comput. Vis., pp. 2556-2563, Nov. 2011.
terNet hybrid deep learning network,” in Proc. IEEE/CVF Conf. Comput. [30] B. Kwolek and M. Kepski, “Human fall detection on embedded platform
Vis. Pattern Recognit. Workshops (CVPRW), pp. 1629–1637, Jun. 2018. using depth maps and wireless accelerometer,” Comput. Methods Pro-
[17] R. Polishetty, M. Roopaei, and P. Rad, “A next-generation secure cloud- grams Biomed., vol. 117, no. 3, pp. 489–501, Dec. 2014.
based deep learning license plate recognition for smart cities,” in Proc. [31] X. Li, T. Pang, W. Liu, and T. Wang, “Fall detection for elderly person
15th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA), pp. 286–293, Dec. care using convolutional neural networks,” in Proc. 10th Int. Congr. Image
2016. Signal Process., Biomed. Eng. Informat. (CISP-BMEI), Shanghai, China,
[18] K. Abas, C. Porto, and K. Obraczka, “Wireless smart camera networks for pp. 1–6, Oct. 2017
the surveillance of public spaces,” Computer, vol. 47, no. 5, pp. 37–44, [32] N. Noury et al., “Fall detection—Principles and methods,” in Proc. IEEE
2014. 29th Ann. Int. Conf. Eng. Med. Biol. Soc., pp. 1663–1666, Aug. 2007.
[19] S. Vitek and P. Melničuk, “A distributed wireless camera system for the [33] NVIDIA Developer. NVIDIA Jetson—Hardware For Every Situa-
management of parking spaces,” Sensors, vol. 18, no. 1, p. E69, Dec. 2018. tion. [Online]. Available: https://developer.nvidia.com/embedded/ de-
[20] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and velop/hardware, 2019.
challenges,” emphIEEE Internet Things J., vol. 3, no. 5, pp. 637–646, Oct. [34] F. Chollet, "Xception: Deep learning with depthwise separable convolu-
2016. tions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1251-
[21] A. E. Eshratifar, A. Esmaili, and M. Pedram, “BottleNet: A deep learning 1258, Jun. 2017.
architecture for intelligent mobile cloud computing services,” in Proc. [35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED), Lausanne, A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
Switzerlan, Jul. 2019, pp. 1–6. Vis. Pattern Recognit. (CVPR), pp. 248–255, Jun. 2009.
[36] W. Ullah, A. Ullah, I. U. Haq, K. Muhammad, M. Sajjad, and S. W. Baik,
[22] S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. R. Faughnan,
“CNN features with bi-directional LSTM for real-time anomaly detection
“Real-time human detection as an edge service enabled by a lightweight
in surveillance networks,” Multimedia Tools Appl., pp. 1–17, Aug. 2020.
CNN,” in Proc. IEEE Int. Conf. Edge Comput. (EDGE), pp. 125–129, Jul.
[37] Y. Zhu, and S. Newsam, “Motion-aware feature for improved video
2018.
anomaly detection”, arXiv preprint arXiv:1907.10211., 2019.
[23] K. Huang, X. Liu, S. Fu, D. Guo, and M. Xu, “A lightweight privacy-
[38] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,
preserving CNN feature extraction framework for mobile sensing”, IEEE
“Learning temporal regularity in video sequences,” in Proc. IEEE Conf.
Transactions on Dependable and Secure Computing, Apr, 2019.
Comput. Vis. Pattern Recognit., pp. 733–742, Jun. 2016.
[24] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for [39] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “Vision-based fall
action recognition in videos," in Proc. Adv. Neural Inf. Process. Syst., pp. detection system for improving safety of elderly people,” IEEE Instrum.
568-576, 2014. Meas. Mag., vol. 20, no. 6, pp. 49–55, Dec. 2017.
[25] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, "Dynamic [40] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “An integrated vision-
image networks for action recognition," in Proc. IEEE Conf. Comput. Vis. based approach for efficient human fall detection in a home environment,”
Pattern Recognit., pp. 3034-3042, Jun. 2016. IEEE Access, vol. 7, pp. 114966–114974, 2019.
[26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- [41] Q. Feng, C. Gao, L. Wang, Y. Zhao, T. Song, and Q. Li, “Spatio-temporal
bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF fall event detection in complex scenes using attention guided LSTM,”
Conf. Comput. Vis. Pattern Recognit., pp. 4510–4520, Jun. 2018. Pattern Recognit. Lett., vol. 130, pp. 242–249, Feb. 2020.
[27] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
Y. Zhu, R. Pang, and V. Vasudevan, “Searching for mobilenetv3,” arXiv
preprint arXiv:1905.02244, 2019.
[28] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3109904, IEEE Access

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

JUN-HWA KIM received the B.S. and M.S. de-


grees in electronics engineering from Dongguk
University, Seoul, South Korea, in 2019, and 2020,
respectively. He is currently pursuing the Ph.D
degree in electronics engineering with Dongguk
University. His current research interests include
image/video processing, generative model, facial
expression recognition, object detection/tracking
and edge computing.

NAMHO KIM received the B.S. degrees in elec-


tronics engineering from Dongguk University,
Seoul, South Korea, in 2021. He is currently pur-
suing the master degree in electronics engineering
with Dongguk University. His current research in-
terests include image processing, object detection
and edge computing.

CHEE SUN WON received the B.S. degree in


electronics engineering from Korea University,
Seoul, Korea, in 1982 and the M.S. and Ph.D.
degrees in electrical and computer engineering
from the University of Massachusetts, Amherst,
MA, USA, in 1986 and 1990, respectively.
From 1989 to 1992, he was a Senior Engineer
with GoldStar Co., Ltd. (LG Electronics), Seoul,
Korea. In 1992, he joined Dongguk University,
Seoul, where he is currently a Professor in the
Division of Electrical and Electronics Engineering. He has been a Visiting
Professor at Stanford University, Stanford, CA, USA, and at McMaster
University, Hamilton, ON, Canada. His research interests include computer
vision, deep neural networks, video signal processing, and image feature
detection and matching.

10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like