Computer Vision3
Computer Vision3
Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Architecture of the proposed model.
• Experiments are conducted on the well known bench- used with traditional methodologies or neural networks such
mark: UCF-Anomaly detection dataset. Results have as autoencoders [8, 9].
indicated significant performance gains when com- Approaches discussed above are trained on normal data,
pared with the existing approaches. in a semi-supervised or unsupervised manner with outlier de-
tection at the inference phase for anomaly prediction. Re-
The rest of the paper is organized as follows: Section 2 cent work published by Sultani et al. has developed a dataset
reviews existing methodologies and how they compare with containing both instances of anomalous and normal data, on
our model. Section 3 provides a detailed view of our proposed which classification is performed using CNNs [10]. Using
method. We then elaborate on the implementation process, a 3D convolution network proposed by Tran et al. [11] to
quantitatively assess and compare our results with the state- extract segment-level video features and perform Multiple In-
of-the-art approaches in Section 4. We finally conclude in stance Learning (MIL) on both anomalous and normal data
Section 5. to localize anomaly in surveillance videos. Video feature ex-
traction plays a crucial part in anomaly detection. Inspired by
Youtube-8M challenge and GooleNet Inception-v3 network
2. RELATED WORK
pre-trained on ImageNet, [17] we have made use of its capa-
bility to enhance feature extraction in our model.
Deep learning technology has seen much success in recent
years, owing to the exploitation of non-linear relations in high
dimensional data, for object detection, image classification, 3. PROPOSED METHODOLOGY
pose estimation, etc [14, 4]. CNN based LSTM with its abil-
ity to learn sequential data, crucial for encoding temporal in- Figure 1 illustrates the proposed model for anomaly detection.
formation in videos is used in Autoencoders and Generative There are two main steps: feature extraction using Inception-
Adversarial Networks (GAN) as well [15] for detecting an v3 and bagging based deep learning classification model.
anomaly. Hasan et al. employed autoencoder based architec-
ture in a bid to detect anomaly using reconstruction error [5]. 3.1. Feature Representation using Inception-v3
Similarly, Chong et al. proposed to use CNN based autoen-
State-of-the-art Inception-v3 network [17] offers both frame-
coder for spatial feature learning and LSTM based autoen-
level and video-level feature representations. We have im-
coder for temporal patterns [6]. Along the same path, Ionescu
plemented unsupervised feature extraction via the publicly
et al. employed autoencoders to extract features from frames
available Inception-v3 network 1 . In this research, we use the
after detecting objects and then used SVM for anomaly detec-
video-level features (2048-D), aggregated from frame-level
tion [7]. Zhong et al. employs a weakly supervised anomaly
using simple average pooling, for their standard dimensions.
detection using Multiple Instance Learning (MIL) with in-
Each video is decoded at 1 frame-per-second from the begin-
ner bag loss (IBL) of both positive and negative bags [16].
ning to the first 6 minutes (360 seconds) of its length. De-
Ensemble methods have been utilized to make robust predic-
coded frames are input into the network and the output of the
tions, especially in the semi-supervised anomaly detection to
boost base classifier output; bagging and boosting are mostly 1 https://github.com/google/youtube-8m/tree/master/feature extractor
Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Bagging Ensemble instead of the original 4096-D. The videos are randomly se-
Input: Data set D = (x1 , y1), (x2 , y2 ), . . . , (xm , ym ); lected at every iteration with batch size 60. The objective
Base learning classifier L; function is modified to calculate error using Hinge Loss as
Number of learning rounds T; shown in Equation3.
Process:
for t = 1, . . . , T: h(z) = 1/(1 + exp (−z)) (1)
Dt = Bootstrap(D);
ht = L(Dt ); Given xi ∈ R1024 to be features, extracted using Inception
end.
network, input into the training neural network, with asso-
Output: H(x) = T1 Tt=1 ht (x) ciated weight wi , the objective function we try to minimize,
thus becomes:
N
last hidden layer with Rectified Linear Unit (ReLU) is used.
λw22 + L(yi , h(wi xi )) (2)
The resulting 1024-D vector, after applying Principal Com-
i=1
ponent Analysis (PCA) is then L2 normalized and have quan-
tization applied (1 byte per coefficient). where N is the total number of videos and the learning rate
λ is kept as 0.001. For the loss function L, we employ hinge
3.2. Classification using Ensemble of Fully Connected 3- loss, used in binary SVM classification. The formula is given
Layer Neural Network as:
L(y, ŷ) = max(0, 1 − y.ŷ)) (3)
Ensemble techniques are becoming increasingly significant as
they have repeatedly exhibited the ability to improve upon where y ∈ 0, 1 is the provided ground-truth value of each
the performance of the single model. Ensemble learning pro- video, ŷ is the predicted score calculated as h(wx) by the
vides higher accuracy than individual classifiers if the mem- model, as defined in equation 1.
ber classifiers are accurate and diverse. Bagging is one of
the most popular types in ensemble learning where different
4. EXPERIMENTS AND RESULTS
models are generated by randomly selecting data points from
the training data. These models are trained traditionally using
We perform all experiments using NVIDIA GTX 1050 Ti
the same classifier. In this study, a fully connected 3-Layer
GPUs, on a system with 16 GB RAM. We use Tensorflow for
Neural Network [18] is being used as a base model in our
feature extraction on the Inception-v3 network and Theano
parallel bagging ensemble. Confidence scores obtained from
for classification.
various models are then combined using average operation.
The method is described in Algorithm 1: for a dataset D with
m instances of x samples with groundtruth values y, we cre- 4.1. Dataset and Evaluation Metric
ate T base classifiers using 3-Layer fully connected neural
We use the UCF-Anomaly-Dataset 2 published by Sultani et
network. In each model we bootstrap instances and obtain
al. [10] which consists of 1900 both real-world anomalous
a vector Pk = [p1 , p2 , ....., pk ] with predicted scores of test
as well as normal video data. Anomalous videos span events
samples k which is then aggregated using simple averaging
T such as Abuse, Arrest, Arson, Assault, Fighting, Robbery,
for unseen instance x as T1 t=1 ht (x). Shooting, Stealing, etc. We use the same training/test split
Training and testing on 10 models are performed using a ratio (75/25) as mentioned in [10].
0.632 bootstrap sampling technique which samples 500 in- We use the area under the ROC Curve (AUC) as a perfor-
stances of each class with replacement. We evaluate its per- mance metric, which is a standard for anomaly detection with
formance with both feature extraction methodologies: C3D binary predictions and allows us to compare our results with
[11] and Inception-v3 [17]. previously published by [10].
Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
(a) Precision-recall curve (b) Sensitivity curve (c) Specificity curve
Fig. 2: Graphs depicting performance evaluation measures of bagging ensembles with a single classifier.
5. CONCLUSION
Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
[2] “Crime rate by country 2020,” http:// [12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
worldpopulationreview.com/countries/ Jon Shlens, and Zbigniew Wojna, “Rethinking the in-
crime-rate-by-country, Accessed: 2020-01- ception architecture for computer vision,” in Proceed-
07. ings of the IEEE conference on computer vision and pat-
tern recognition, 2016, pp. 2818–2826.
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton, “Imagenet classification with deep convolutional [13] Zhi-Hua Zhou, “Ensemble learning.,” Encyclopedia of
neural networks,” in Advances in neural information biometrics, vol. 1, pp. 270–273, 2009.
processing systems, 2012, pp. 1097–1105.
[14] Karishma Pawar and Vahida Attar, “Deep learning ap-
[4] Karen Simonyan and Andrew Zisserman, “Very deep proaches for video-based anomalous activity detection,”
convolutional networks for large-scale image recogni- World Wide Web, vol. 22, no. 2, pp. 571–601, 2019.
tion,” arXiv preprint arXiv:1409.1556, 2014.
[15] Lin Wang, Fuqiang Zhou, Zuoxin Li, Wangxia Zuo, and
[5] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Haishu Tan, “Abnormal event detection in videos using
Amit K Roy-Chowdhury, and Larry S Davis, “Learning hybrid spatio-temporal autoencoder,” in 2018 25th IEEE
temporal regularity in video sequences,” in Proceedings International Conference on Image Processing (ICIP).
of the IEEE conference on computer vision and pattern IEEE, 2018, pp. 2276–2280.
recognition, 2016, pp. 733–742.
[16] Jiangong Zhang, Laiyun Qing, and Jun Miao, “Tem-
[6] Yong Shean Chong and Yong Haur Tay, “Abnormal poral convolutional network with complementary inner
event detection in videos using spatiotemporal autoen- bag loss for weakly supervised anomaly detection,” in
coder,” in International Symposium on Neural Net- 2019 IEEE International Conference on Image Process-
works. Springer, 2017, pp. 189–196. ing (ICIP). IEEE, 2019, pp. 4030–4034.
[7] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana- [17] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul
Iuliana Georgescu, and Ling Shao, “Object-centric auto- Natsev, George Toderici, Balakrishnan Varadarajan, and
encoders and dummy anomalies for abnormal event de- Sudheendra Vijayanarasimhan, “Youtube-8m: A large-
tection in video,” in Proceedings of the IEEE Con- scale video classification benchmark,” arXiv preprint
ference on Computer Vision and Pattern Recognition, arXiv:1609.08675, 2016.
2019, pp. 7842–7851. [18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
[8] Ngoc Tu Pham, Ernest Foo, Suriadi Suriadi, Helen Jef- Haffner, “Gradient-based learning applied to document
frey, and Hassan Fareed M Lahza, “Improving per- recognition,” Proceedings of the IEEE, vol. 86, no. 11,
formance of intrusion detection system using ensemble pp. 2278–2324, 1998.
methods and feature selection,” in Proceedings of the [19] Cewu Lu, Jianping Shi, and Jiaya Jia, “Abnormal event
Australasian Computer Science Week Multiconference. detection at 150 fps in matlab,” in Proceedings of
ACM, 2018, p. 2. the IEEE international conference on computer vision,
[9] Bingjun Guo, Lei Song, Taisheng Zheng, Haoran Liang, 2013, pp. 2720–2727.
and Hongfei Wang, “Bagging deep autoencoders with [20] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra
dynamic threshold for semi-supervised anomaly detec- Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
tion,” in 2019 International Conference on Image and George Toderici, “Beyond short snippets: Deep net-
Video Processing, and Artificial Intelligence. Interna- works for video classification,” in Proceedings of the
tional Society for Optics and Photonics, 2019, vol. IEEE conference on computer vision and pattern recog-
11321, p. 113211Z. nition, 2015, pp. 4694–4702.
[10] Waqas Sultani, Chen Chen, and Mubarak Shah, “Real-
world anomaly detection in surveillance videos,” in Pro-
ceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018, pp. 6479–6488.
Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.