0% found this document useful (0 votes)
60 views9 pages

View of Real-World Anomaly Detection in Video Using Spatio-Temporal Features Analysis For Weakly Labelled Data With Auto Label Generation

Uploaded by

jitendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
60 views9 pages

View of Real-World Anomaly Detection in Video Using Spatio-Temporal Features Analysis For Weakly Labelled Data With Auto Label Generation

Uploaded by

jitendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
Real-World Anomaly Detection in Video Using Spatio-Temporal Features Analysis for Weakly Labelled Data with Auto Label Generation Original Scientific Paper Rikin J. Nayak VT Patel Dept of E & C Engg, Chandubhai Patel Institute of Technology, Charotar University of Science and Technology, Changa, Ta-Petlad, Anand, Gujarat 388421, India rikinnayak@gmailcom, [email protected] Jitendra P, Chaudhari Charusat Space Research and Technology Center, VT Patel Dept of E & C Engg, Chandubhai s Patel Institute of Technology, Charotar University of Science and Technology, Changa, Ta-Petlad, Anand, Gujarat 388421, India jitendrachaudhariec@charusatacin ‘Abstract - Detecting anomalies in videos i a complex task due to diverse content, noisy labeling, and a lack of rame-level labeling. Toaddress these chellenges in weakly labeled datasets, we propose a nove custom loss function in conjunction with the mul-instance earning (MIL) algorithm. Our approach utes the UCF Crime and Shanghaiech datasets for anomaly detection. The UCF Crime dataset includes labeled videos depicting a range of incidents such as explosions assaults, and burglaries, while the ShanghaiTech datasets one ofthe largest anomaly datasets, with over 400 video clips featuring three different scenes and 130 abnormal evens. We generated pseudo labels for videos using the MIL. technique to detec rame-level anomalies rom video level annotations and to train thenetworkto distinguish between normaland abnormal dases, We onducted extensive experiments on the UCF Crime dataset using (3and 20 features to test our models performance. For the ShanghaiTech dataset, we used 3D features for raining and testing. Our ‘results show that with 30 features, we achieve an 84.6% frame: level AUC score for the UCF Grime dataset and a 92.27% rame-evel AUC score forthe ShanghaiTech dataset, which are comparable to other methods used for similar datasets. Keywords: anomaly detection, spatio-temporal analysis, 3d convolutional neural network, multiinstance learning 1. INTRODUCTION ‘Anomaly detection, or the identification and classi- fication of data patterns that deviate from normal pat- tems, isa crucial aspect of intelligent visual surveillance systems. The deployment of CCTV cameras has become widespread and more affordable which has resulted in increased research attention on video-based anomaly detection. Deployment of CCTV cameras is particularly important for ensuring security in public areas such as railway stations, hospitals, and military bases. With the increasing availabilty of powerful computing resources, Astifcial Intelligence and Deep learning have been in- tegrated into smart video surveillance systems to eff ciently process and analyze vast amounts of video data, Inthis paper, we employ a mult-instance leaning tech nique to address the challenges of anomaly detection in videos, using a custom loss function under weakly supervised learning. We assess the effectiveness of our approach on two different datasets, we compare various feature extraction techniques and performance metrics. Volume 14, Number 5, 2023 Avtficial Intelligence and Deep Learning have signifi- cantly enhanced smart video surveillance systems. The effectiveness of these approaches relies on substan tial processing power, large datasets, and advanced resources, which have become increasingly acces- sible due to powerful GPUs and high-RAM systems. Although convolutional neural networks (CNNs) excel at processing spatial information in images, they face limitations when analyzing temporal information in videos. Recurrent neural networks, such as Long Short: Term Memory (LSTM) networks, can address this chal lenge by modeling sequence information in video data, where each frame depends on its predecessors. Various studies have explored anomaly detection [1-5], employing two main approaches based on the availability of labeled data. The traditional method, sulted for situations where labeled data is unavailable, trains the model using known normal data. Alterna~ tively if labeled data is available, it can be used to train the model and predict abnormal classes for future test data. According to D. Elliott (2010), after 12 hours, sin- 565 le person can miss up to 80% of the activity between ‘two cameras. This highlights the importance of effec- tive anomaly detection systems. Deep learning out- performs other methods when the available dataset is large [6]. Anomaly detection or outlier detection is Useful in various applications, including detecting ille- gal traffic flow [7], retinal damage [8], and loT big-data anomaly detection [9]. However, deep learning-based methods often face difficulties in anomaly detection because of the complex structure of the data and the narrow boundary between normal and abnormal data. In this paper, we make the following contributions: 1. We utilize multt.instance leaming technique with auto label generation loss to tackle the challenge of anomaly detection in videos, particularly when, video-level labels are available, but anomalies oc- cur at the frame level. 2. Weintroducea custom oss function for usein weak- ly supervised learning, designed to enable more effective extraction of discriminative features and thereby improve anomaly detection performance. 3. We incorporate a mean squared error function on auto-generated labels, which aids in separating in- terclass features and increasing intraclass feature closeness. 4. Our experiments, conducted without sparsity and temporal smoothness constraints, show that our proposed model is robust and effective, We evalu- ate our model on two benchmark datasets, UCF Crime and Shanghaitech, using various feature ex- traction techniques and comparing proposed loss functions in different environments. 5. We demonstrate that the 13D feature extractor out- performs the C3D feature extractor in our experi- ‘ments, and we assess the model's performance us- ing the area under the curve (AUC) metric. The paper is organized as follows: Section 2 reviews related work and presents the problem statement, our proposed approach duscussed in Section 3, Section 4 discusses experimental results, and Section 5 con- cludes the paper. 2. RELATED WORK Deep learning has proven to be a superior approach compared to traditional machine learning in several areas, particularly in image and video processing. Despite deep learning's superiority in various areas, detecting anoma- lies in image and video processing remains a challenging ‘ask, with many researchers making significant contribu- tions to this field [10-14}.In (10), particle tralectories were uitiized to model normal motion, and deviations from the norm were defined as anomalies. The author in 15] pro- vides an in-depth analysis of deep anomaly detection in the medical domain, Researchers have also explored vio- lence and aggression detection [16-18] 566 International Journal of Electrical and Computer Engi Feature leaming is a conventional approach for infer- ring normality from data, However, due to difficulties as- sociated with tracking objects in videos, many research- ers have employed alternative methods such as motion pattern analysis using a histogram-based method [19], kernel density estimation methods [20], social force models [21], context-driven methods [22], and hidden Markov models (23). These techniques offer different ‘waysto address the difficulties of understanding motion and detecting deviations from normal patterns. Dur- ing the testing phase, videos with lower probability are classified as anomalies, while normal videos are used for training, In [24], researchers focused on the problem of online detection of unusual events in videos using dy- namic sparse coding, The main idea is that sparse rep- resentation can help us learn about normal behaviour in videos, which can then be used to detect unusual or abnormal events. Developing a video action classifica- tion model using deep leaming has been proposed in [25]. However, video classification is more challenging than deep leaming-based image classification due to the difficulty of obtaining annotations for training the model and the extensive efforts required to generate frame-level labels. To address the challenges posed by ‘weakly labeled datasets, researchers have explored vari- ‘ous approaches, as discussed in [26-29}. The RIFM(Robust Temporal Feature Magnitude learning) method [271 enhances detection by training a specialized function to recognize rare events and con- sider their timing, resulting in better accuracy and effi- ciency for detecting subtle anomalies. The MIST frame- ‘work [28] focuses on using video-level annotations to refine important features, making the anomaly detec tion process more effective. Furthermore, the authors in [29] introduced the LAD database, a comprehensive collection of video sequences for anomaly detection, along with a multi-task deep neural network that lever- ages spatiotemporal features, achieving superior per- formance compared to existing methods in the field The author in [26] utilized a multiinstance leaning (MIL) model to address the issue of weakly labeled da- tasets. Similar approaches have been employed for de- tecting anomalies, as discussed in [30-33]. The author in 131] proposes the Anomaly Regression Net (ARNet) framework for video anomaly detection, which only re- quires video-level labels in training and uses multiple instance learning loss and centre loss for discrimina- tive features. (32] proposes a weakly supervised deep temporal encoding-decoding solution using multiple instance learning for anomaly detection in surveillance videos and employs a new smoother loss function. [33] focuses on reducing false alarms in abnormal activity detection using 3D ResNet and deep multiple instance learning with a new ranking loss function, achieving the best performance on the UCF-Crime benchmark dataset. All three papers present novel approaches for video anomaly detection and achieve advanced results on challenging benchmark datasets. Detecting anomalies with accuracy is a challeng: ing task, primarily due to its subjective nature, which varies based on location and individual perspectives. Researchers have approached anomaly detection as a means of identifying low-probebility patterns, as evi- dent in studies conducted by [34-36]. In this research, we address the problem of anomaly detection as a re- gression issue and propose a customized loss function, coupled with multiinstance learning techniques. (Our proposed loss function aimsto increase the gap be: ‘tween the normaland abnormal frames while minimizing ‘computational complexity. This is achieved by removing the sparsity and temporal smoothness constraints typi ‘ally present in similar techniques. The proposed meth- codology section will detail our approach to addressing the challenge of detecting anomalies with high accuracy. 2.1, PROBLEM STATEMENT Our research tackles the challenge of frame-level ‘anomaly detection in videos using the UCF Crime and Shanghai tech datasets. These datasets provide anom aly labels at the video level, complicating frame-level detection. To address this, we employ multr-instance teaming (MIL) and split the dataset into two parts: ‘one with normal frames and another with a mixture of normal and abnormal frames grouped under a single anomaly class. Our aim is to effectively detect anoma- lies atthe frame level by utilizing MIL and a custom loss function that minimizes false anomaly detections. We will detail our techniques, their application to the da- tasets, and our experimental results in the subsequent sections. By enhancing frame-level anomaly detection, ‘our research contributes to the field of video surveil- lance and has potential applicationsin security systems and public safety measures 3, PROPOSED METHOD This section of the paper aims to define the problem of anomaly detection in video, describe the feature Fig. 1. Model with two different loss functions Volume 14, Number 5, 2023 extraction method, and provide a detailed description of the proposed loss function. To detect anomalies in the video, we utilize the UCF Crime and Shanghaitech datasets, which contain a range of videos of different lengths categorized as normal, explosive, burglary, fighting, and arrest, Similar to (261, anomaly detection is treated as a regression problem, where a sequence of frames serves as the input and an anomaly score be- tween Oand 1 is the output for each frame. In this work, we present a deep learning method: based approach for detecting anomalies. We begin by converting the input video into a fixed-size array and then extract features using both three-dimensional convolutional features [37] and inflated three-dimen: sional (13D) features [38]. Each video is then segmented into a fixed number of non-overlapping temporal seg- ments, and each segment is treated as a "baa" instance for feature extraction. We extract 3D convolutional and 18D features from each video segment. We utilized two types of pre-processed video data, namely C3D and 13D features, to extract features for ‘our model. These models were chosen due to their ef ficiency in learning spatiotemporal features, which are ‘crucial for further processing, C3D features consist of two-stream pre-processed video data with a feature dimension of 4096. On the other hand, [3D features are composed of RGB and optical-flow features, with a feature dimension of 2048 for each. During the train- ing process, we concatenated the RGB and optical flow features to create a unified input. To visualize our pro- posed approach, we have included a diagram of the model with the custom loss function in Fig. 1 The loss function of the support vector machine modelis Low) id i | WTX, + b|) + All o max(0,1 = ells 567 where wis the weight vector, bis the bias term, and T denotes the transpose of the weight vector. The loss is accumulated over all taining data points and is often combined with a regularization term to prevent overfit- ting, The loss function used in the model has two com- ponents: the hinge loss and the regularization term. During training, the learning parameter w is adjusted ‘to minimize the hinge loss, which generates a positive loss for incorrectly classified features. In this supervised learning approach, the labels Viand features xi are used along with the bias b to determine the loss. However, as the video frames lack annotation, this approach is not applicable, To address ths issue, the MIL approach was adopted, as discussed in [26]. Under this approach, each video's divided into abag, with the positive bag contain- ing both normal and abnormal frames and the negative bag containing only normal frames. Similar to [26, 39], ‘the maximum (wTAi + b) is considered for both types of bags. This approach allows the model to learn to identify abnormal frames without the need for individual frame annotations, which will modify equation 1 to x Lw)=t) (why + BD) + Allwil3 Here, Wis the bag-evel label. where y is the true la- bel of the bag (+1 for positive bags and -1 for negative bags), lis the decision function of the SVM, Bis the set of instances in the bag, and max...) (wx -+b) isthe maximum predicted score for any instance in the bag max(0,1 — yumax.srom bag Q) ur proposed loss function aims to maximize the dis- tance between positive and negative bags, with only the maximum distance feature considered for each bag. The selection of the maximum feature is based on the assumption that each abnormal bag should con- tain at least one abnormal instance, while a normal bag should only contain normal instances. Building on this approach, we developed a custom loss function that combines multi-instance learning with the residual dif- ference between actual and predicted labels to train the network. In this case, the actual label is determined through maximum selection in the MIL process. This can be explained by equation 3: aa 1 FAD, (iit OF (wai + B+ x 8) 1 — Ran 2, (BO + wtat+ bY) Here, the first term represents the average sum of the maximum distanced feature from each normal video, and the second term represents the average sum of the maximum distanced feature from each abnormal vid- €0, and the third term represents the hyperparameter. The objective of this loss function is to maximize the difference between the abnormal and normal features, as represented by the first two terms of the equation. To further refine this approach, we introduce a custom loss function that combines multi-instance learning 568 International Journal of Electrical and Computer Engi with the residual difference between the actual and predicted labels. The following equation explains how pseudo-labels: are generated for each instance: Y.Ciato = Youn)? @ i Here... label generated based on the distance of the feature from the line measured By Ly. Yyny 25° signs a label of 1 (abnormal) to an instance if the maxi- mum absolute value of the weighted sum for all in- stances in the bag is greater than a certain threshold. Otherwise, it assigns a label of O (normal). This method helps identify the most representative instances within each bag, which, in turn, assists in the training of the network to maximize the difference between normal and abnormal features. V,,,, is calculated as follows: Yooro = 1 if max; trombag( IW"Xi + bl) else 0 Yyg is the actual distance calculated for each feature in the bag. The final loss function isthe sum of equa- tions (3) and (4). 6) Lunwse = Lain + base 6) Combining multi-instance learning with the residual difference The custom loss function incorporates both the multi-instance learning component (L,,,) and the residual difference between actual labels and pseudo- labels (L,..). This combination allows the model to bet- ter learn the relationship between the features and the labels, resulting in improved anomaly detection. Using ¥.z;. a8 a pseudo-label helps the model learn better decision boundaries by leveraging the informa- tion from the most representative instances. This aidsin training the model to effectively distinguish between normal and abnormal instances, improving its overall anomaly detection capability. 4, EXPERIMENTAL RESULTS This section describes the use of C3D and [30 as fea- ture extractors for video anomaly detection on the UCF Crime dataset and the ShanghaiTech dataset. C3D is a neural network that extracts spatiotemporal features from videos, while [3D isa modified version of C3D that achieves advance results in video recognition tasks. The proposed approach extracts Features using pre-trained C3D and 13D networks and uses a one-class SVM clas- sifier with a custom loss for anomaly detection. The ‘one-class SVM classifiers a popular choice for anomaly detection as it is designed to distinguish between nor- mal and abnormal instances. The experimental results show that [3D outperforms C3D in all evaluation met- rics, and the system's performance improves with an increase in the number of frames used in feature ex- traction. The proposed approach achieves competitive results compared to state-of-the-art methods on the UCF Crime dataset and the ShanghaiTech dataset ing Systems The UCF Crime dataset and the ShanghaTech dataset are both challenging and widely used benchmark data- sets for video anomaly detection. The UCF Crime data- set consists of 1,900 real-world surveillance videos that encompass various crime types, such as theft, robbery, vandalism, and fights. This diverse dataset poses a chal- lenge for models to accurately detect and clas ent types of anomalous behaviours in realistic settings On the other hand, the ShanghaTech dataset contains 437 high-resolution surveillance videos from diverse environments lke streets, parks, and commercial areas, featuring anomalies such as jaywalking, loitering, and i- legal parking Its difficulty arises from the high variability in video content, camera angles, and lighting conditions, making ita robust dataset for evaluating video anomaly detection model performance across different scenarios. 4.1. C3D NETWORK This section introduces a video anomaly detection approach utilizing C3D features extracted from the UCF Crime dataset, as outlined in [26], The C3D features capture both the appearance and dynamics of moving objects for video action recognition. Each video is seg mented into non-overlapping fixed-size segments to create a 4096x32 feature matrix. A neural network hav- ing four fully interconnected layers with 256, 64, and 16 neurons and a single output neurons is employed, Using an Adagrad optimizer and a learning rate of 0.01 ‘The performance is assessed by the area under the re~ ceiver operating characteristic (AUC-ROC) curve, en- bling fair comparisons. This approach computes the ROC curve based on the frame-level anomaly score. 4.2. 13D NETWORK This experiment adopts the Inflated 3D (130) model, pre-trained on the Kinetics dataset, as the feature ex traction network. The [3D network output for each video includes RGB and optical flow features, which are concatenated, producing a 2048x32 feature output size. A fourayer fully connected neural network with 128, 32, and 16 units and a single output layer is used. Training is conducted with the Adagrad optimizer and 20.01 learning rate. Tables 1 and 2 display the results of ‘our custom loss function. Table 1 highlights the effectiveness of incorporating 13D features into the model for video anomaly detection. ‘The I3D features-based approach achieves an AUC score of 84.66, surpassing other methods in the comparison, thus demonstrating its superiority. Tests were also con ducted using C3D features and 3D with only RGB fea- tures, Table 2 summarizes the corresponding AUC, F1, ‘and EER scores, providing insights into the performance of different feature sets in video anomaly detection and ‘emphasizing the advantages of 130 features. Our experiments, conducted using the open-source code by Sultani et al. [26], are based on established re search and methods. A confusion matrix in Table 3 adds Volume 14, Number 5, 2023 context and understanding to our findings, detailing the rates of true and false predictions, enabling read- ers to evaluate the model's effectiveness in detecting video anomalies comprehensively. Overall, our results in Tables 1, 2, and 3 strongly sup: port [3D features for video anomaly detection. The high ‘AUC score, F1 score, and EER emphasize the effective ness of our approach compared to others. Incorporating IBD features yields the best performance, as indicated by the highest AUC score. These findings have impor. tant implications for future research. Fig. 2 and 3 display results for various test dataset videos. Fig. 3 illustrates the anomaly score graph for abnormal frames when the model generates higher scores compared to nor- ‘mal frames. This figure presents the results for two spe-

You might also like