0% found this document useful (0 votes)
16 views

Computer Vision3

The document proposes using bagging ensemble and Inception-V3 for anomaly detection in surveillance videos. It extracts holistic video features using Inception-V3 to avoid video segmentation. A bagging ensemble of a 3-layer fully connected network is used to improve robustness and reduce overfitting. Experiments on the UCF-Anomaly dataset show improved performance over existing approaches.

Uploaded by

Moriwam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Computer Vision3

The document proposes using bagging ensemble and Inception-V3 for anomaly detection in surveillance videos. It extracts holistic video features using Inception-V3 to avoid video segmentation. A bagging ensemble of a 3-layer fully connected network is used to improve robustness and reduce overfitting. Experiments on the UCF-Anomaly dataset show improved performance over existing approaches.

Uploaded by

Moriwam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ENSEMBLE LEARNING USING BAGGING AND INCEPTION-V3 FOR ANOMALY

DETECTION IN SURVEILLANCE VIDEOS

Yumna Zahid, Muhammad Atif Tahir and Muhammad Nouman Durrani

School of Computer Science


National University of Computer and Emerging Sciences, Karachi Campus, Pakistan
{yumna.zahid, atif.tahir, muhammad.nouman}@nu.edu.pk

ABSTRACT detection which incorporates holistic anomalous interaction


The prevalent use of surveillance cameras in public places and between multiple entities (such as car collision with a person),
advancements in computer vision warrants most sought-after as compared to an occurrence of abnormality concerning a
research in the domain of anomalous activity detection. Sev- single entity (for instance a vehicle moving in the wrong
eral approaches have been proposed for the detection of an direction).
anomaly in videos. Spatio-temporal features using 3D Con- Existing research has focused mainly on an unsupervised
volutional Network (C3D) is a state-of-the-art approach for methodology for anomaly detection [3, 4, 5, 6, 7, 8, 9], which
this problem where deep multiple instance ranking frame- implicates training on normal data and inferring abnormality
work is being investigated. However, this approach requires during testing using reconstruction error, or outlier detection.
the segmentation of videos before feature extraction that can This although, excludes the requirement for extensive abnor-
produce unstable segmentation results and can have a large mal data collection but does not quantify precise classifica-
memory footprint. In this paper, we extract video features us- tion of the anomaly, especially for suspicious activity. More-
ing the Inception-v3 deep learning network which eliminates over, fine-grained classification cannot be performed using a
segmentation. Moreover, to improve the robustness of the model that has been learned only on normal data. Sultani et al.
backbone classifier we propose to use a homogeneous bag- [10] proposed Multiple Instance Learning (MIL) which uses
ging ensemble of the 3-Layer Fully Connected (FC) Network. the Spatio-temporal features extracted by 3D Convolutional
Experiments are conducted on the UCF-Anomaly detection Network (C3D) [11], and classification using 3-Layer Fully
dataset and exhibit improved performance over existing ap- Connected (FC) network and thus makes use of both normal
proaches. and abnormal videos during training. Video segmentation and
scalability are the two main problems in this study since this
Index Terms— Anomaly Detection, Feature Learning, method requires meticulous segmentation of videos before
Bagging Ensemble feature extraction and has a larger memory footprint. Our
proposed homogeneous ensemble learning method based on
1. INTRODUCTION bagging alleviates the stringent requirement of segmentation
and focuses on video classification deep feature learning us-
Increased security risks in public places have instigated the ing Inception-v3 Network [12] with 3-Layer Fully Connected
proliferation of closed-circuit television (CCTV) cameras Network used at the classification stage. Bagging which is the
installation and monitoring. According to the video surveil- most popular homogeneous ensemble learning technique has
lance market outlook 2020 [1], the burgeoning global surveil- shown to reduce uncertainty in models and increase general-
lance is expected to reach $87,361.8 million by 2025. These ization capabilities [13]. Moreover, sub-sampling can opti-
security risks comprise multifarious types of public crimes mize resource consumption by breaking the data into smaller
such as Arrest, Robbery, Burglary, Automobile Theft, etc, and subsets and training them separately using a single base clas-
are always on the high, with Venezuela topping the world sifier [8]. Below are our main contributions in this paper:
at 84.86 crime index [2]. Therefore, it necessitates the use
of technology such as computer vision to automate crime • Holistic video-level deep features are extracted using
detection. Anomaly is generally characterized as any event the Inception-v3 network and thus remove the require-
deviating from the normal activity. However, anomalous ment of segmentation of videos before feature learning.
event detection is a complex problem due to its variability in
the context-specific definition. Formally, it can be classified • A homogeneous bagging ensemble is proposed to im-
into two main categories such as local anomaly and global prove the generalization capability of the 3-Layer Fully
anomaly [14]. This paper focuses on the global anomaly Connected (FC) Network.

         

Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: Architecture of the proposed model.

• Experiments are conducted on the well known bench- used with traditional methodologies or neural networks such
mark: UCF-Anomaly detection dataset. Results have as autoencoders [8, 9].
indicated significant performance gains when com- Approaches discussed above are trained on normal data,
pared with the existing approaches. in a semi-supervised or unsupervised manner with outlier de-
tection at the inference phase for anomaly prediction. Re-
The rest of the paper is organized as follows: Section 2 cent work published by Sultani et al. has developed a dataset
reviews existing methodologies and how they compare with containing both instances of anomalous and normal data, on
our model. Section 3 provides a detailed view of our proposed which classification is performed using CNNs [10]. Using
method. We then elaborate on the implementation process, a 3D convolution network proposed by Tran et al. [11] to
quantitatively assess and compare our results with the state- extract segment-level video features and perform Multiple In-
of-the-art approaches in Section 4. We finally conclude in stance Learning (MIL) on both anomalous and normal data
Section 5. to localize anomaly in surveillance videos. Video feature ex-
traction plays a crucial part in anomaly detection. Inspired by
Youtube-8M challenge and GooleNet Inception-v3 network
2. RELATED WORK
pre-trained on ImageNet, [17] we have made use of its capa-
bility to enhance feature extraction in our model.
Deep learning technology has seen much success in recent
years, owing to the exploitation of non-linear relations in high
dimensional data, for object detection, image classification, 3. PROPOSED METHODOLOGY
pose estimation, etc [14, 4]. CNN based LSTM with its abil-
ity to learn sequential data, crucial for encoding temporal in- Figure 1 illustrates the proposed model for anomaly detection.
formation in videos is used in Autoencoders and Generative There are two main steps: feature extraction using Inception-
Adversarial Networks (GAN) as well [15] for detecting an v3 and bagging based deep learning classification model.
anomaly. Hasan et al. employed autoencoder based architec-
ture in a bid to detect anomaly using reconstruction error [5]. 3.1. Feature Representation using Inception-v3
Similarly, Chong et al. proposed to use CNN based autoen-
State-of-the-art Inception-v3 network [17] offers both frame-
coder for spatial feature learning and LSTM based autoen-
level and video-level feature representations. We have im-
coder for temporal patterns [6]. Along the same path, Ionescu
plemented unsupervised feature extraction via the publicly
et al. employed autoencoders to extract features from frames
available Inception-v3 network 1 . In this research, we use the
after detecting objects and then used SVM for anomaly detec-
video-level features (2048-D), aggregated from frame-level
tion [7]. Zhong et al. employs a weakly supervised anomaly
using simple average pooling, for their standard dimensions.
detection using Multiple Instance Learning (MIL) with in-
Each video is decoded at 1 frame-per-second from the begin-
ner bag loss (IBL) of both positive and negative bags [16].
ning to the first 6 minutes (360 seconds) of its length. De-
Ensemble methods have been utilized to make robust predic-
coded frames are input into the network and the output of the
tions, especially in the semi-supervised anomaly detection to
boost base classifier output; bagging and boosting are mostly 1 https://github.com/google/youtube-8m/tree/master/feature extractor



Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Bagging Ensemble instead of the original 4096-D. The videos are randomly se-
Input: Data set D = (x1 , y1), (x2 , y2 ), . . . , (xm , ym ); lected at every iteration with batch size 60. The objective
Base learning classifier L; function is modified to calculate error using Hinge Loss as
Number of learning rounds T; shown in Equation3.
Process:
for t = 1, . . . , T: h(z) = 1/(1 + exp (−z)) (1)
Dt = Bootstrap(D);
ht = L(Dt ); Given xi ∈ R1024 to be features, extracted using Inception
end.

network, input into the training neural network, with asso-
Output: H(x) = T1 Tt=1 ht (x) ciated weight wi , the objective function we try to minimize,
thus becomes:
N

last hidden layer with Rectified Linear Unit (ReLU) is used.
λw22 + L(yi , h(wi xi )) (2)
The resulting 1024-D vector, after applying Principal Com-
i=1
ponent Analysis (PCA) is then L2 normalized and have quan-
tization applied (1 byte per coefficient). where N is the total number of videos and the learning rate
λ is kept as 0.001. For the loss function L, we employ hinge
3.2. Classification using Ensemble of Fully Connected 3- loss, used in binary SVM classification. The formula is given
Layer Neural Network as:
L(y, ŷ) = max(0, 1 − y.ŷ)) (3)
Ensemble techniques are becoming increasingly significant as
they have repeatedly exhibited the ability to improve upon where y ∈ 0, 1 is the provided ground-truth value of each
the performance of the single model. Ensemble learning pro- video, ŷ is the predicted score calculated as h(wx) by the
vides higher accuracy than individual classifiers if the mem- model, as defined in equation 1.
ber classifiers are accurate and diverse. Bagging is one of
the most popular types in ensemble learning where different
4. EXPERIMENTS AND RESULTS
models are generated by randomly selecting data points from
the training data. These models are trained traditionally using
We perform all experiments using NVIDIA GTX 1050 Ti
the same classifier. In this study, a fully connected 3-Layer
GPUs, on a system with 16 GB RAM. We use Tensorflow for
Neural Network [18] is being used as a base model in our
feature extraction on the Inception-v3 network and Theano
parallel bagging ensemble. Confidence scores obtained from
for classification.
various models are then combined using average operation.
The method is described in Algorithm 1: for a dataset D with
m instances of x samples with groundtruth values y, we cre- 4.1. Dataset and Evaluation Metric
ate T base classifiers using 3-Layer fully connected neural
We use the UCF-Anomaly-Dataset 2 published by Sultani et
network. In each model we bootstrap instances and obtain
al. [10] which consists of 1900 both real-world anomalous
a vector Pk = [p1 , p2 , ....., pk ] with predicted scores of test
as well as normal video data. Anomalous videos span events
samples k which is then aggregated using simple averaging
T such as Abuse, Arrest, Arson, Assault, Fighting, Robbery,
for unseen instance x as T1 t=1 ht (x). Shooting, Stealing, etc. We use the same training/test split
Training and testing on 10 models are performed using a ratio (75/25) as mentioned in [10].
0.632 bootstrap sampling technique which samples 500 in- We use the area under the ROC Curve (AUC) as a perfor-
stances of each class with replacement. We evaluate its per- mance metric, which is a standard for anomaly detection with
formance with both feature extraction methodologies: C3D binary predictions and allows us to compare our results with
[11] and Inception-v3 [17]. previously published by [10].

3.2.1. Fully Connected 3-Layer Neural Network


4.2. Results and Discussion
Fully connected base classifier contains ReLU activation in
Table 1 shows a comparison between our proposed approach
the first layer, and Sigmoid (Equation 1) in the last layer to
with the existing methods. We specifically compare our pro-
calculate the prediction score. Adagrad optimizer is set with a
posed model with Sultani et al. state-of-the-art approach [10]
0.001 learning rate. In our binary classifier, a high probability
and the baseline SVM. Notably, we achieve an increase by
score indicates the presence of an anomalous event and low an
≈1% with C3D features and ≈16% using Inception-v3 fea-
indication of normality for the entire video. This eliminates
tures.
the need to segment videos for feature extraction. We modify
the base classifier to use the input of 1024-D feature vector 2 https://www.crcv.ucf.edu/projects/real-world/



Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
(a) Precision-recall curve (b) Sensitivity curve (c) Specificity curve

Fig. 2: Graphs depicting performance evaluation measures of bagging ensembles with a single classifier.

Table 1: Comparison of Approaches.


[20]. SVM has shown detrimental effects with C3D feature
M ETHOD AUC extraction due to the sparsity of anomaly occurrence in seg-
C3D + SVM 65.98 mented video frames.
H ASAN et al. [5] 50.60 Figure 3 shows the ROC curve for all the methods. We
L U et al. [19] 65.51 can easily distinguish high true positive rates (0.9-1.0) for
C3D + 3-L AYER FC N ET. [10] 75.41 low false-positive rates (0.2-0.3) in the case of Inception-v3
C3D + 3-L AYER FC N ET. (BAGGING ) 76.49 than with C3D (0.6-0.7). Moreover, bagging ensemble dis-
I NCEPTION - V 3 + 3-L AYER FC N ET. (S INGLE ) 91.28 plays high ROC curves with both feature extraction method-
I NCEPTION - V 3 + 3-L AYER FC N ET. (BAGGING ) 92.06 ologies. In reference to Figure 2a precision-recall curve, bag-
ging provides substantially better results with Inception-v3,
giving good precision rates at high recall values. Bagging has
also shown greater sensitivity (Figure 2b) in accurately iden-
tifying anomalous events in testing videos, as well as in de-
tecting normal situations evident from increasing specificity
(Figure 2c).

5. CONCLUSION

Computer vision and deep learning technology enable exten-


sive research opportunities in the realm of solving real-world
problems such as anomaly detection. We have empirically
proven binary classifier robustness using a bagging ensemble.
We have also deduced how feature learning plays a key role
in a model’s accuracy significantly by integrating Inception-
v3 features in a 3-Layer Fully Connected (FC) neural net-
work. All experiments were performed on the publicly avail-
Fig. 3: ROC curve showing performance comparison between able UCF-Anomaly Detection dataset, which includes both
different methods. real-world anomalous and normal video instances. We be-
lieve incorporating both classes in training is intrinsic for real-
world application, and can be further extended to perform
Applying a bagging ensemble to the originally proposed fine-grained classification.
model by [10] gives an improved result of 76.49 and signif- Acknowledgments. This research work was funded by
icantly better result with Inception-v3 as a feature extractor Higher Education Commission (HEC) Pakistan and Ministry
(92.06). In both cases, we infer using bagging gives an im- of Planning Development and Reforms under the National
proved performance than a single model. This is due to the Center in Big Data and Cloud Computing.
fact ensemble learning increases generalization of the base
classifier and reduces variance, thereby performing well on 6. REFERENCES
unseen test data. We can also deduce Inception-v3 spatial fea-
ture extraction is more powerful than C3D spatial-temporal [1] “Video surveillance market outlook - 2025,”
features. Temporal features tend to show poor performance https://www.alliedmarketresearch.
with the long length of videos and when paired with a simple com/Video-Surveillance-market, Accessed:
classifier, as opposed to a sequence preserving architecture 2020-01-07.



Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.
[2] “Crime rate by country 2020,” http:// [12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
worldpopulationreview.com/countries/ Jon Shlens, and Zbigniew Wojna, “Rethinking the in-
crime-rate-by-country, Accessed: 2020-01- ception architecture for computer vision,” in Proceed-
07. ings of the IEEE conference on computer vision and pat-
tern recognition, 2016, pp. 2818–2826.
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton, “Imagenet classification with deep convolutional [13] Zhi-Hua Zhou, “Ensemble learning.,” Encyclopedia of
neural networks,” in Advances in neural information biometrics, vol. 1, pp. 270–273, 2009.
processing systems, 2012, pp. 1097–1105.
[14] Karishma Pawar and Vahida Attar, “Deep learning ap-
[4] Karen Simonyan and Andrew Zisserman, “Very deep proaches for video-based anomalous activity detection,”
convolutional networks for large-scale image recogni- World Wide Web, vol. 22, no. 2, pp. 571–601, 2019.
tion,” arXiv preprint arXiv:1409.1556, 2014.
[15] Lin Wang, Fuqiang Zhou, Zuoxin Li, Wangxia Zuo, and
[5] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Haishu Tan, “Abnormal event detection in videos using
Amit K Roy-Chowdhury, and Larry S Davis, “Learning hybrid spatio-temporal autoencoder,” in 2018 25th IEEE
temporal regularity in video sequences,” in Proceedings International Conference on Image Processing (ICIP).
of the IEEE conference on computer vision and pattern IEEE, 2018, pp. 2276–2280.
recognition, 2016, pp. 733–742.
[16] Jiangong Zhang, Laiyun Qing, and Jun Miao, “Tem-
[6] Yong Shean Chong and Yong Haur Tay, “Abnormal poral convolutional network with complementary inner
event detection in videos using spatiotemporal autoen- bag loss for weakly supervised anomaly detection,” in
coder,” in International Symposium on Neural Net- 2019 IEEE International Conference on Image Process-
works. Springer, 2017, pp. 189–196. ing (ICIP). IEEE, 2019, pp. 4030–4034.

[7] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana- [17] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul
Iuliana Georgescu, and Ling Shao, “Object-centric auto- Natsev, George Toderici, Balakrishnan Varadarajan, and
encoders and dummy anomalies for abnormal event de- Sudheendra Vijayanarasimhan, “Youtube-8m: A large-
tection in video,” in Proceedings of the IEEE Con- scale video classification benchmark,” arXiv preprint
ference on Computer Vision and Pattern Recognition, arXiv:1609.08675, 2016.
2019, pp. 7842–7851. [18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
[8] Ngoc Tu Pham, Ernest Foo, Suriadi Suriadi, Helen Jef- Haffner, “Gradient-based learning applied to document
frey, and Hassan Fareed M Lahza, “Improving per- recognition,” Proceedings of the IEEE, vol. 86, no. 11,
formance of intrusion detection system using ensemble pp. 2278–2324, 1998.
methods and feature selection,” in Proceedings of the [19] Cewu Lu, Jianping Shi, and Jiaya Jia, “Abnormal event
Australasian Computer Science Week Multiconference. detection at 150 fps in matlab,” in Proceedings of
ACM, 2018, p. 2. the IEEE international conference on computer vision,
[9] Bingjun Guo, Lei Song, Taisheng Zheng, Haoran Liang, 2013, pp. 2720–2727.
and Hongfei Wang, “Bagging deep autoencoders with [20] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra
dynamic threshold for semi-supervised anomaly detec- Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and
tion,” in 2019 International Conference on Image and George Toderici, “Beyond short snippets: Deep net-
Video Processing, and Artificial Intelligence. Interna- works for video classification,” in Proceedings of the
tional Society for Optics and Photonics, 2019, vol. IEEE conference on computer vision and pattern recog-
11321, p. 113211Z. nition, 2015, pp. 4694–4702.
[10] Waqas Sultani, Chen Chen, and Mubarak Shah, “Real-
world anomaly detection in surveillance videos,” in Pro-
ceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2018, pp. 6479–6488.

[11] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Tor-


resani, and Manohar Paluri, “Learning spatiotemporal
features with 3d convolutional networks,” in Proceed-
ings of the IEEE international conference on computer
vision, 2015, pp. 4489–4497.



Authorized licensed use limited to: National University Fast. Downloaded on September 17,2021 at 04:20:14 UTC from IEEE Xplore. Restrictions apply.

You might also like