0% found this document useful (0 votes)
39 views9 pages

Captionomaly A Deep Learning Toolbox For Anomaly Captioning in Social Surveillance Systems

Research paper for caption anomaly

Uploaded by

ayeshaamjid78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views9 pages

Captionomaly A Deep Learning Toolbox For Anomaly Captioning in Social Surveillance Systems

Research paper for caption anomaly

Uploaded by

ayeshaamjid78
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 11, NO.

1, FEBRUARY 2024 207

Captionomaly: A Deep Learning Toolbox for


Anomaly Captioning in Social Surveillance Systems
Adit Goyal , Murari Mandal , Vikas Hassija, Moayad Aloqaily, Senior Member, IEEE,
and Vinay Chamola , Senior Member, IEEE

Abstract— Real-time video stream monitoring is gaining huge problems such as criminal investigation, insurance checks,
attention lately with an effort to fully automate this process. On illegal activities, and security [3]. Monitoring a video stream
the other hand, reporting can be a tedious task, requiring manual manually for hours and spotting any suspicious activity are
inspection of several hours of daily clippings. Errors are likely to
occur because of the repetitive nature of the task causing mental quite inefficient. To automate this process, numerous anom-
strain on operators. There is a need for an automated system that aly detection methods are proposed in the literature review
is capable of real-time video stream monitoring in social systems [4], [5], [6], [7]. There has been steady progress in the
and reporting them. In this article, we provide a tool aiming algorithmic development of visual anomaly detection over
to automate the process of anomaly detection and reporting. the years [8], [9]. More recently, the deep learning methods
We combine anomaly detection and video captioning models to
create a pipeline for anomaly reporting in descriptive form. A have further improved the performances in diverse in-the-wild
new set of labels by creating descriptive captions for the videos scenarios as well [4], [10].
collected from the UCF-Crime (University of Central Florida- Image or video captioning is another field of research that
Crime) dataset has been formulated. The anomaly detection holds a lot of significance in real-world applications [11], [12].
model is trained on the UCF-Crime, and the captioning model The task is to automatically generate an English description
is trained with the newly created labeled set UCF-Crime video
description (UCFC-VD). The tool will be used for performing the or caption for an image or video clip. It involves analyzing its
combined task of anomaly detection and captioning. Automated visual content and producing a description in natural language
anomaly captioning would be useful in the efficient reporting that captures the essence of the image or video. This could be
of video surveillance data in different social scenarios. Sev- very helpful in commerce, the military, and education. For
eral testing and evaluation techniques were performed. Source instance, visually impaired people can better understand the
code and dataset: https://github.com/Adit31/Captionomaly-Deep-
Learning-Toolbox-for-Anomaly-Captioning. content of images if the description is available. Advances
in deep learning have led to improved performance in image
Index Terms— Anomaly detection, deep learning, surveillance, captioning as well [11], [13], [14].
toolbox, video captioning, UCF-Crime.
Anomalous events usually have rare occurrences, and a
I. I NTRODUCTION raised false alarm can lead to a loss of time and energy for
the concerned stakeholders. The aim of introducing Captiono-
T HE increasing need to ensure safety and privacy through
surveillance has led to the rapid deployment of visual
sensing devices in both public and private spaces such as
maly is to report any unusual activity along with a one-line
description of that event to the concerned entity so that it can
offices, traffic intersections, housing apartments, shopping be cross-checked and acted upon in a timely manner. The task
malls, and airports. The deployed closed-circuit television of anomaly detection involves video understanding, including
(CCTV) cameras continuously generate a large amount of differentiating normal events from abnormal ones and detect-
visual data. This has given rise to the need for efficient video ing the time-stamp-aware duration (start and end time) of the
analytics solutions for the large volume of recorded or live abnormal segment. Several classification techniques have been
surveillance videos [1], [2]. The task of video surveillance developed for anomaly detection. However, due to the wide
can be tedious, with a huge margin of error for human range of anomalous events, only class labels might not be
operators. Different agencies use surveillance data to solve sufficient for the operator to determine the seriousness of the
event. To alleviate such concerns, a small description of the
Manuscript received 6 July 2022; revised 8 November 2022; accepted flagged video segment can be useful.
12 December 2022. Date of publication 18 January 2023; date of current
version 31 January 2024. (Corresponding author: Vinay Chamola.)
Adit Goyal is with the Department of Computer Science, North- A. Motivation
western University, Evanston, IL 60201 USA (e-mail: aditgoyal2024@ Anomaly detection and video captioning have been explored
u.northwestern.edu).
Murari Mandal and Vikas Hassija are with the School of Computer in great depth in the literature review. Several frameworks have
Engineering, Kalinga Institute of Industrial Technology, Bhubaneswar 751024, been proposed in an attempt to get improved performance.
India (e-mail: [email protected]; [email protected]). However, combining the two frameworks for an end-to-end
Moayad Aloqaily is with the Machine Learning Department, Mohamed
Bin Zayed University of Artificial Intelligence (MBZUAI), Masdar City, Abu surveillance application has not been proposed yet. Recently,
Dhabi, United Arab Emirates (e-mail: [email protected]). the UCF-Crime, along with the baseline model [15] presented
Vinay Chamola is with the Department of Electrical and Electronics 13 different classes of anomaly data in video surveillance. This
Engineering and APPCAIR, BITS-Pilani, Pilani 333031, India (e-mail:
[email protected]). made it possible to analyze the anomaly detection problem in
Digital Object Identifier 10.1109/TCSS.2022.3230262 the wild. However, the textual description for these anomalous
2329-924X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
208 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2024

events is not available. In order to learn the descriptions patterns of the users, data from social media could be used
or captions for the individual anomaly type, a new set of in the process of anomaly detection to understand human
captioning labels is needed. Furthermore, a single pipeline to mobility and people’s behavior. Social media data can be used
train and infer the anomaly captioning problem has not been to identify emergency or emergency-like events in an area that
proposed in the literature review. can help isolate anomalies with higher accuracy.
Video captioning models require large datasets and a huge On the other hand, several deep-learning models have been
amount of time and computation power to train them. The used for video captioning. During video captioning, it is
model proposed in [16] took 6 h to train on the microsoft important to maintain the relation between sentence semantics
research video to text (MSR-VTT) dataset using GeForce and the video content. To this end, Pan et al. [26] proposed
GTX 1080 Ti. For anomaly captioning, however, datasets like an long short-term memory (LSTM) with visual-semantic
MSR-VTT [18], are not ideal. For this use-case, UCFC-VD is embedding, and [27] introduced an LSTM with Transferred
introduced in this article. UCFC-VD takes 20 min to train on Semantics Attributes. Baraldi et al. [28] also modified the
the Tesla V100 x4, and provides optimal results on the same LSTM framework by adding a special time boundary-aware
model as [16]. LSTM cell to detect the hierarchical structure of the video,
which helps during captioning. A sparse boundary-aware
B. Contributions
transformer-based model was proposed by Jin et al. [29] to
Motivated by the above-mentioned discussions and the need reduce the redundancy for video captioning. Zheng et al. [30]
to address these issues, the contributions of our work are as used syntax-aware action targeting (SAAT) model for video
follows. captioning, and a video-to-commonsense dataset along with a
1) The proposed Captionomaly is a first-of-its-kind tool captioning model was introduced in [31]. Chen et al. [32]
offering a pipeline to train and deploy anomaly caption- proposed an encoder-decoder based reinforcement learning
ing models for video surveillance applications. framework that rewards for picking the frames with more
2) A new set of labels for UCF-Crime dataset videos diversity and ones that cause less discrepancy between the
has been formulated. We denote these captioned video generated captions and ground truth. Attention mechanisms
set as UCF-Crime video description (UCFC-VD). The have been used for video captioning in several frameworks
UCFC-VD provides only extensive and comprehensive recently, like [33]. However, attention mechanisms have a ten-
labels for anomaly captioning. dency to overfit on the training set [16]. Self-attentive encoder-
3) A framework has been proposed to be used as a tool- decoder frameworks also would not provide any improvement
box to automate the process of video surveillance for on performance, given the nature of our task wherein we
anomaly detection and reporting. The model requires have small clips, and corresponding one-line descriptions are
less than 2.5% of captioning data as compared to some sufficient.
of the most widely used captioning datasets like MSR- Sultani et al. [15] proposed a weakly supervised multiple
VTT [18] and still generates reasonably good results. instance learning (MIL) method for anomaly detection. They
generate anomaly scores from a ranking loss function. The
II. R ELATED W ORK video segment-wise score generated can be used to locate the
Several works have been carried out in the literature toward temporal segments that show any unusual incident. We adopt
the improvements in surveillance systems. For top-view sur- this approach and use the video segments to train a video
veillance, an edge-based person detection system using trans- captioning model [16]. The captioning model uses an encoder-
fer learning was proposed by Ahmed et al. [19]. A framework decoder framework. For the encoder part, video-level feature
for anomaly explanation with random forests is presented extraction, word embeddings, and a tagging network have been
in [20]. The Explainer explains the anomalous sample using a used. In the decoder, variational dropout [34] and layer nor-
set of classification rules. Anomalies are divided into groups malization [35] are used. A hybrid learning scheme, explained
for explaining them by characterizing subspace rules [21]. in Section III, is used on top of the normal training method
Image captioning of some specific classes of anomalies was for improved accuracy.
proposed in [13]. The anomaly detection (AD) dataset pro-
posed in [13] consists of more than 1000 captioned images III. C APTIONOMALY F RAMEWORK
and contains anomalies belonging to classes: broken windows, The Captionomaly framework can be divided into two parts,
car accidents, domestic violence, fights, fire, guns, and injured as depicted in Fig. 1 and explained in Algorithm 1. More-
people. The model was trained by combining their AD dataset over, an average of bilingual evaluation understudy (BLEU),
with the IAPR-2012 (International Association of Pattern metric for evaluation of translation with explicit ORdering
Recognition) dataset [22]. Pang et al. [23] propose a regression (METEOR), consensus-based image description evaluation
neural network to learn whether the instance pair consists of (CIDEr), and recall-oriented understudy for gisting evaluation
anomalies or not. A self-trained deep ordinal regression was longest (ROUGE-L) metrics is measured, and the results are
applied on videos for anomaly detection [24]. Similarly, there reported.
have been several attempts to propose frameworks that are
efficient, while taking care of the problems like high costs
of false negative [4]. Furthermore, anomaly detection could A. Anomaly Detection
be assisted using social media data [25]. By examining the We combine the anomaly detection model proposed in [15]
impact of changes in the external environment on the check-in with a video captioning model to generate captions for
Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
GOYAL et al.: CAPTIONOMALY: DEEP LEARNING TOOLBOX FOR ANOMALY CAPTIONING 209

Fig. 1. Framework used for Captionomaly tool [15], [16], [17].

anomalous video clips. A MIL framework is used in [15].


For the MIL approach, a weakly labeled dataset is considered
for training the model. Labeling on the dataset is done at the
video level rather than temporal annotations. The entire video
clip is labeled as normal or anomalous (at least one segment
contains an anomaly). Each surveillance clip can be considered
as a bag and the 32 segments of the clip as instances in the
bag. The objective function can be optimized based on the
maximum scored instance in each bag [36]
1    1
z
minw max 0, 1−Y B j maxi B j (w · φ(x i )) − b + w2
z j =1 2
(1)
where w is the classifier to be learned, b is the bias, z is
the number of bags, φ(x) is the feature representation of the
segment, and Y B j is the bag-level label.
Fig. 2. Distribution of the videos in the training set for UCF-Crime dataset
A deep MIL ranking method is used to generate scores for on the basis of their duration [15].
each segment. To ensure that the scores across segments of
a video vary in a smooth manner, a smoothness constraint,
given by λ1 n−1i ( f (V i
a )− f (Vai+1 ))2 , is considered
in the loss B. Anomaly Captioning
function. Second, a sparsity constraint, given by λ2 ni f (Vai ),
is required because anomaly lasts only for a small duration in Traditional recurrent neural networks (RNNs) often suffer
a video. The loss function is given by the following equation: from over-fitting, and variational dropout can help solve the
     problem [34]. The slow-down caused in the training process
l(Ba , Bn ) = max 0, 1 − maxiBa f Vai + maxiBn f Vni due to variational dropout can be taken care of by layer

n−1
  2 
n
  normalization by stabilizing the internal state dynamics in
+λ1 f (Vai ) − f Vai+1 + λ2 f Vai (2) RNNs [35]. A gated recurrent unit (GRU) is capable of learn-
i i ing to acquire temporal dependencies across various scales.
where Va represents the anomalous segment, f (Vai ) is the Based on these considerations, Chen et al. [16] developed
score generated between 0 and 1, similar representation fol- a variational-normalized semantic GRU (VNS-GRU). Two
lows for corresponding normal (n) video segments. VNS-GRU layers are stacked together, and the internal embed-
The objective function to predict anomaly scores is given ding dimensions, n f , are reduced. This enhances the decoder’s
by the following equation: ability while having fewer parameters than the single-layer
structure.
L(W) = l(Ba , Bn ) + λ3 ||W|| F (3)
Instead of using the normal learning method of optimizing
where W represents the model weights. the losses calculated through the training data items equally,

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
210 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2024

TABLE I
D ESCRIPTION OF THE P ROPOSED UCFC-VD DATASET

Fig. 3. Graphical representation of the statistics for UCFC-VD dataset.

this model switches to a “professional learning” scheme after where γ is a hyper-parameter to balance the length-related and
32 epochs. Professional learning is defined in terms of the way cross entropy-related probability distributions.
cross-entropy loss function is calculated. n human annotations The first part of (6) ensures the generation of captions of
A(k) are sampled for a video, and the probability distribution the average length (avglen) set by us, and the second part
P (k) for each token in the caption C (k) is calculated. The ensures that high-loss value generating samples have a lower
cross-entropy loss is given by the following equation: probability and vice versa.
 
l (k) = mean −ai(k) log pi(k) . (4) IV. P ROPOSED DATASET
The weighted loss is formulated by combining (4) with In this section, we discuss a few datasets that have been
weights β (k) to optimize the captioning model introduced earlier for the tasks of anomaly detection or cap-
tioning. Further, we give a brief description along with statis-
1  (i) (i)
bs−1
Loss(A, S, V ; θ ) = β l (5) tical information regarding the videos used for Captionomaly.
bs i=0
A. Relevant Datasets
where S represents semantic information, V represents visual
features, and θ denotes the model parameters. A small value Arriaga et al. [13] created an image captioning and classi-
of loss li(k) indicates the strength of the model and is given fication dataset for dangerous situations like fire and car acci-
more weight to put emphasis on it. A large value indicates dents. A traffic anomaly detection (DoTA) has been introduced
the weakness of the model, and less weight is given to those in [37] with 4677 videos, along with temporal, categorical, and
captions, thus ignoring them. The weight β consists of two spatial annotations. The University of California San Diego
parts (UCSD) anomaly detection dataset is made by a surveillance
   camera installed at one place [38]. However, the videos are
β (i) = (1 − γ )softmax −abs len(i) − avglen relatively straightforward, with simple anomalies like jaywalk-
 
+γ softmax −l (i) (6) ing. Some of the most widely used video captioning datasets

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
GOYAL et al.: CAPTIONOMALY: DEEP LEARNING TOOLBOX FOR ANOMALY CAPTIONING 211

Fig. 4. Few video-caption pairs from the UCFC-VD dataset.

Fig. 5. Results generated by Captionomaly on a few random videos (from YouTube) belonging to different classes.

include MSR-VTT [18], and microsoft research video descrip- 1) UCF-Crime: The UCF-Crime dataset consists of
tion (MSVD) [39]. Although these datasets are extensive, the 1900 surveillance videos, with a total duration of 128 h
samples are generalized for everyday activities. Similarly, none of real-life video footage [15]. The dataset is divided into
of the datasets available can be used for anomaly detection and 950 normal videos and 950 anomaly videos, which are further
captioning. divided into 13 classes of anomalies: abuse, arson, arrest,
assault, explosion, burglary, fighting, robbery, road accident,
B. UCFC-VD Dataset shooting, stealing, shoplifting, and vandalism. Fig. 2 is a
We use the videos from UCF-Crime and create the anomaly graphical representation of the distribution of the videos in
description labels for the relevant video clips. The newly the training set of the UCF-Crime dataset on the basis of
created labels and video clips are named as UCFC-VD dataset. their duration in minutes [15]. A total of 800 normal and

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
212 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2024

810 anomaly videos constitute the training set, and 150 normal Algorithm 1 Captionomaly
and 140 anomaly videos make the testing set. Both training // Anomaly Detection
and testing sets contain videos from all 13 classes of anom- Data: v videos from UCF-Crime, v t number of videos,
alies. For evaluating the performance, temporal annotations batch_si ze batch size, num iters number of iterations
indicating the starting and ending points of the anomalous for iteration
parts are given. // Feature Extraction
2) Anomaly Caption Generation for UCFC-VD: We select i d x curr ← 0;
only the anomaly video clips from the UCF-Crime dataset while i d x curr < v t do
by cropping a maximum duration of 30 s. We have observed C3D Feature Extractor;
that 30 s are sufficient duration to clearly depict the anomaly. Dividing Features into 32 segments and storing in txt files;
The average duration of the videos in the dataset is kept at i d x curr + = 1;
approximately 20 s to ensure optimal performance by the end
captioning model in terms of speed and accuracy. Fig. 3 is a // Labeling for MIL
graphical representation of the average duration of videos and i v ← 0;
the average sentence length for different classes of anomalies while i v < 32 ∗ batch_si ze do
present in the UCFC-VD dataset. For each video, we make if i v < 16 ∗ batch_si ze then
five captions and save them in their respective comma sep- Label abnormal videos as 0;
arated value (CSV) files for each class of anomaly. Fig. 4 if i v > (16 ∗ batch_si ze) − 1 then
shows a few examples from our dataset. Similar to the UCF- Label normal videos as 1;
Crime dataset, the newly created UCFC-VD is divided into i v+ = 1;
13 classes, namely, abuse, arson, arrest, assault, explosion, end
burglary, fighting, robbery, road accident, shooting, stealing, // Deep MIL Ranking Model
shoplifting, and vandalism. totaliter ← 0;
3) Experiment Setting: We divide the UCFC-VD dataset while totaliter < num iters do
Train the model;
into three parts: 4085 captions (817 videos) for training,
Calculate MIL ranking loss with sparsity and smoothness
285 (57 videos) captions for validation, and 380 (76 videos)
constraints;
captions for testing. The average word length of the captions
totaliter + = 1;
in UCFC-VD is approximately nine words per sentence. end
Table I provides a statistical summary of the UCFC-VD dataset // Get the Anomalous Segment
used for the captioning model. The dataset comprises a total i v ← 0;
vocabulary of 1589 words. Annotation for the dataset has while i v < v t do
been carefully done to maintain the average word length Clip the anomalous segment from the entire video and save
while describing the event appropriately. The set of labels is the file;
crosschecked among multiple annotators. However, to further i v+ = 1;
improve the performance of this model, UCFC-VD can be end
further extended by adding more captioning labels. All the // Anomaly Captioning
computations are done on the Tesla V100 (32GB) graphical Data: i inputs, n a annotations, av annotation- video pairs,
processing unit (GPU). total number of epochs epoch t and switch point from
general learning to professional learning epoch s
V. E XPERIMENTS , R ESULTS , AND A NALYSIS Extract ResNeXt-101 Features for UCFC-VD in npy file;
A. Evaluation Metric // Tagging Network
We use the standard metrics Bleu-4, METEOR, CIDEr, i av ← 0;
and ROUGEL to evaluate the proposed anomaly captioning while i av < n a do
Create tags for all av;
method. Each metric measures how well the candidate sen-
i av + = 1;
tence matches with a set of five reference sentences written end
by human. We propose to use a unified single score metric Train the tagging network;
by combining all the four metrics. We take the average of the // Professional Learning Algorithm
four metrics as given in the following equation: i d x curr ← 0;

1 B4i Ci Mi Ri while i d x curr < epoch t do
capscore = + + + (7) if i d x curr < epoch s then
4 B4b Cb Mb Rb
Optimizing the model by giving equal weights to all
where x i /x b denotes the ratio of the checkpoint score and
video-caption pairs;
the best score on the metric. The sampling number n for else
annotations of each video is fixed at 5. Optimizing the model by giving more weights to sam-
ples producing small loss, and vice versa;
B. Network Architecture for Anomaly Detection end
We use the C3D-v1.0 feature extractor for the UCF-Crime i d x curr + = 1;
dataset [40]. Videos are divided into segments, and 3-D end

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
GOYAL et al.: CAPTIONOMALY: DEEP LEARNING TOOLBOX FOR ANOMALY CAPTIONING 213

TABLE II
C OMPARISON OF S TATISTICAL D ATA I NCLUDING V OCABULARY S IZE AND THE N UMBER OF D ISTINCT S ENTENCES
IN THE T EST S ET OF T HREE D ATASETS T RAINED ON THE S AME M ODEL

convolution features are extracted for each segment using a


pretrained 3-D ConvNet from the fully connected FC6 layer.
The frames are of dimensions 240 × 320 pixels, and the
frame rate is set to 30 frames/s [15]. Features are calculated
for 16-frame clips at a time, followed by l2 normalization,
and the mean is taken for all 16-frame clips in a segment
to get that segment’s features. The features of 4096D are
given as input to a three layer fully connected neural network.
A 60% dropout regularization is used between the layers.
The ReLU and sigmoid activations are used for the first and
last layers, respectively. The Adagrad optimizer is employed
with an initial learning rate of 0.001. λ1 and λ2 are set to
8 × 10−5 , and λ3 is set to 0.01. 30 positive and negative bags
are picked at random as a mini-batch. The receiver operating
characteristic (ROC) curve and the corresponding area under
the curve (AUC) are considered for evaluating the performance
in [15].

C. Network Architecture for Anomaly Captioning


The ResNeXt-101 image-level feature extractor is used on
our UCFC-VD dataset [17]. It has 64 paths in each block
and is pretrained on the ImageNet dataset. The feature map
of 2048 dimensions is taken from the global pooling layer in
ResNeXt [16]. The features are scaled to a range of 0–1, and Fig. 6. Comparison between the captions generated by our framework and
average pooling is applied to the frames from each video. The the ground-truth captions in UCFC-VD.
averaged probability distribution for each video, resulting in
TABLE III
a 1000-dim vector, is used as semantic feature. 300 keywords
Q UANTITATIVE R ESULTS ON C APTIONOMALY
are chosen from the vocabulary as tags for every video for AND I TS A BLATIVE VARIATIONS
the tagging network, which is used to predict tags as a
300-dim vector. GloVe-840B-300d embeddings are used for
the captions in our dataset.

D. Implementation Details
For the UCFC-VD dataset, the embeddings are in
300-dim vectors. Values of other parameters are: n h (hidden
state dimension) = 512, n f (mid-input dimension) = 64,
n v (vocabulary size) = 1589, n t (tagging dimension) = 300,
proposed method on several parameters such as learning rate
n z2 (ResNeXt features dimensions) = 2048, and γ = 0.8.
and threshold. Our method could achieve the best capscore
The model is trained for 55 epochs (32 epochs of general
of 0.9602. The best results were obtained with a learning
training, followed by the professional learning scheme). Adam
rate of 2 × 10−3 and threshold = 32. We also compare
algorithm is used for optimization with an initial learning rate
this with an anomaly detection+vanilla LSTM-based image
lr of 2 × 10−3 . A weight decay wd of 0.9 every 1000 steps is
captioning model. Our method comfortably outperforms the
set for this model.
LSTM baseline model. We show a comparative analysis
of the results produced by the Captionomaly framework in
E. Quantitative Results Table II. It shows that the model performs significantly bet-
The quantitative results for anomaly captioning are pre- ter on UCFC-VD as compared to MSR-VTT and MSVD,
sented in Table III. In Table III, we show the results of the even without the video-level efficient convolutional network
Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
214 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2024

(ECN) features [41]. After factoring in the performance dif- [4] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
ferences between the systems used for training the models by anomaly detection: A review,” ACM Comput. Surv., vol. 54, no. 2,
pp. 1–38, Mar. 2021.
Chen et al. [16], our framework takes significantly less amount [5] F. Kong, J. Li, B. Jiang, H. Wang, and H. Song, “Integrated generative
of time to train while producing better results. The quantitative model for industrial anomaly detection via bidirectional LSTM and
results for anomaly captioning are presented in Table III. attention mechanism,” IEEE Trans. Ind. Informat., vol. 19, no. 1,
pp. 541–550, Jan. 2023.
In Table III, we show the results of the proposed method [6] K. Agrawal, T. Alladi, A. Agrawal, V. Chamola, and
on several parameters such as learning rate and threshold. A. Benslimane, “NovelADS: A novel anomaly detection system
Our method could achieve the best capscore of 0.9602. The for intra-vehicular networks,” IEEE Trans. Intell. Transp. Syst., vol. 23,
no. 11, pp. 22596–22606, Nov. 2022.
best results were obtained with a learning rate of 2 × 10−3 [7] T. Alladi, B. Gera, A. Agrawal, V. Chamola, and F. R. Yu, “DeepADV:
and threshold = 32. We also compare this with an anomaly A deep neural network framework for anomaly detection in VANETs,”
detection+vanilla LSTM-based image captioning model. Our IEEE Trans. Veh. Technol., vol. 70, no. 11, pp. 12013–12023, Nov. 2021.
[8] A. A. Sodemann, M. P. Ross, and B. J. Borghetti, “A review of anomaly
method comfortably outperforms the LSTM baseline model. detection in automated surveillance,” IEEE Trans. Syst., Man, Cybern.,
The model gives a score (7) of 0.96 on UCFC-VD, which can C (Appl. Rev.), vol. 42, no. 6, pp. 1257–1272, Nov. 2012.
be improved further by increasing the size of the training set. [9] T. D. Ngo, T. T. Bui, T. M. Pham, H. T. B. Thai, G. L. Nguyen, and
T. N. Nguyen, “Image deconvolution for optical small satellite with deep
learning and real-time GPU acceleration,” J. Real-Time Image Process.,
F. Qualitative Results on UCFC-VD and Youtube Videos vol. 18, no. 5, pp. 1697–1710, Oct. 2021.
[10] N. T. Le, J.-W. Wang, C.-C. Wang, and T. N. Nguyen, “Novel framework
We collect some random anomaly-related videos from based on HOSVD for ski goggles defect detection and classification,”
Sensors, vol. 19, no. 24, p. 5538, Dec. 2019.
Youtube and test the model for these videos. Fig. 5 depicts [11] S. Islam, A. Dash, A. Seum, A. H. Raj, T. Hossain, and F. M. Shah,
the results of Captionomaly on these videos belonging to “Exploring video captioning techniques: A comprehensive survey on
different classes like abuse, burglary, shooting and stealing. deep learning methods,” Social Netw. Comput. Sci., vol. 2, no. 2,
pp. 1–28, Apr. 2021.
Fig. 6 shows how our model performs on the UCFC-VD
[12] C. Yan et al., “STAT: Spatial-temporal attention mechanism for video
dataset test set. The captions generated by our framework captioning,” IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241,
are compared with the ground truth. It can be seen that the Jan. 2020.
captions generated utilize the vocabulary learned from the [13] O. Arriaga, P. Plöger, and M. Valdenegro-Toro, “Image captioning and
classification of dangerous situations,” 2017, arXiv:1711.02578.
annotations from other videos, and the words are used in [14] A. Khamparia, B. Pandey, S. Tiwari, D. Gupta, A. Khanna, and
the sentence with the appropriate grammar. However, some J. J. Rodrigues, “An integrated hybrid CNN–RNN model for visual
of the results had grammatical errors, for example, “a bus is description and generation of captions,” Circuits, Syst., Signal Process.,
vol. 39, no. 2, pp. 776–788, 2020.
fight in the bus.” The occurrence of such mistakes was rare and [15] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in
can be alleviated by having an even more extensive training surveillance videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
set. Recognit., Jun. 2018, pp. 6479–6488.
[16] H. Chen, J. Li, and X. Hu, “Delving deeper into the decoder for video
captioning,” in Proc. ECAI. Amsterdam, The Netherlands: IOS Press,
VI. C ONCLUSION 2020, pp. 1079–1086.
[17] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual
In this article, we borrow the existing methods for anomaly transformations for deep neural networks,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1492–1500.
detection [15] and video captioning [16] to create a single
[18] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video description
framework for anomaly captioning. Since the relevant labeled dataset for bridging video and language,” in Proc. IEEE Conf. Comput.
data was not available, a new set of labels were produced for Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5288–5296.
the videos collected from the UFC-Crime dataset, and a new [19] I. Ahmed, M. Ahmad, J. J. P. C. Rodrigues, and G. Jeon, “Edge
computing-based person detection system for top view surveillance:
dataset named UCFC-VD is presented. The proposed frame- Using CenterNet with transfer learning,” Appl. Soft Comput., vol. 107,
work can be used to produce automated anomaly description Aug. 2021, Art. no. 107489.
reports from surveillance videos. The segment-wise anomaly [20] M. Kopp, T. Pevný, and M. Holeňa, “Anomaly explanation with random
forests,” Expert Syst. Appl., vol. 149, Jul. 2020, Art. no. 113187.
scores are thresholded to obtain the anomalous clips from [21] M. Macha and L. Akoglu, “Explaining anomalies in groups with
the video. The detected video clip is further used to produce characterizing subspace rules,” Data Mining Knowl. Discovery, vol. 32,
the relevant description of the event. In future, we plan no. 5, pp. 1444–1480, Sep. 2018.
[22] M. Grubinger, P. Clough, H. Müller, and T. Deselaers, “The IAPR TC-12
to extend the UCFC-VD dataset to include more anomaly benchmark: A new evaluation resource for visual information systems,”
categories and further improve the model’s sentence-formation in Proc. Int. Workshop Image, vol. 2, 2006, pp. 1–13.
capabilities in the wild. [23] G. Pang, C. Shen, H. Jin, and A. van den Hengel, “Deep weakly-
supervised anomaly detection,” 2019, arXiv:1910.13601.
[24] G. Pang, C. Yan, C. Shen, A. van den Hengel, and X. Bai, “Self-
R EFERENCES trained deep ordinal regression for end-to-end video anomaly detection,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
[1] J. Yang, C. Wang, B. Jiang, H. Song, and Q. Meng, “Visual perception Jun. 2020, pp. 12173–12182.
enabled industry intelligence: State of the art, challenges and prospects,” [25] C. Comito, C. Pizzuti, and N. Procopio, “Online clustering for topic
IEEE Trans. Ind. Informat., vol. 17, no. 3, pp. 2204–2219, Mar. 2021. detection in social data streams,” in Proc. IEEE 28th Int. Conf. Tools
[2] R. Nawaratne, S. Kahawala, S. Nguyen, and D. De Silva, “A generative with Artif. Intell. (ICTAI), Nov. 2016, pp. 362–369.
latent space approach for real-time road surveillance in smart cities,” [26] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding
IEEE Trans. Ind. Informat., vol. 17, no. 7, pp. 4872–4881, Jul. 2021. and translation to bridge video and language,” in Proc. IEEE Conf.
[3] T. G. Nguyen, T. V. Phan, D. T. Hoang, T. N. Nguyen, and C. So-In, Comput. Vis. Pattern Recognit., Jun. 2016, pp. 4594–4602.
“Federated deep reinforcement learning for traffic monitoring in SDN- [27] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferred
based IoT networks,” IEEE Trans. Cognit. Commun. Netw., vol. 7, no. 4, semantic attributes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
pp. 1048–1065, Dec. 2021. (CVPR), Jul. 2017, pp. 6504–6512.

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.
GOYAL et al.: CAPTIONOMALY: DEEP LEARNING TOOLBOX FOR ANOMALY CAPTIONING 215

[28] L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware Vikas Hassija received the M.E. degree from the
neural encoder for video captioning,” in Proc. IEEE Conf. Comput. Vis. Birla Institute of Technology and Science-Pilani,
Pattern Recognit. (CVPR), Jul. 2017, pp. 1657–1666. Pilani, India, in 2014.
[29] T. Jin, S. Huang, M. Chen, Y. Li, and Z. Zhang, “SBAT: Video captioning He is an Associate Professor at the School of
with sparse boundary-aware transformer,” 2020, arXiv:2007.11888. Computer Engineering, Kalinga Institute of Indus-
[30] Q. Zheng, C. Wang, and D. Tao, “Syntax-aware action targeting for video trial Technology, Bhubaneswar, India. He was a
captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Post-Doctoral Research Fellow at the National Uni-
(CVPR), Jun. 2020, pp. 13096–13105. versity of Singapore (NUS), Singapore. His current
[31] Z. Fang, T. Gokhale, P. Banerjee, C. Baral, and Y. Yang, research is in blockchain, non fungible tokens, the
“Video2Commonsense: Generating commonsense descriptions to enrich IoT, pivacy and security, and distributed networks.
video captioning,” in Proc. Conf. Empirical Methods Natural Lang.
Process. (EMNLP), 2020, pp. 1–21.
[32] Y. Chen, S. Wang, W. Zhang, and Q. Huang, “Less is more: Picking
informative frames for video captioning,” in Proc. Eur. Conf. Comput.
Vis. (ECCV), 2018, pp. 358–373.
[33] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y.-W. Tai,
“Memory-attended recurrent network for video captioning,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 8339–8348.
[34] Y. Gal and Z. Ghahramani, “A theoretically grounded application of
dropout in recurrent neural networks,” in Proc. Adv. Neural Inf. Process.
Syst., vol. 29, 2016, pp. 1019–1027. Moayad Aloqaily (Senior Member, IEEE) received
[35] J. Lei Ba, J. Ryan Kiros, and G. E. Hinton, “Layer normalization,” 2016, the Ph.D. degree in computer engineering from the
arXiv:1607.06450. University of Ottawa, Ottawa, ON, USA, in 2016.
[36] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector He is currently working with the Machine Learn-
machines for multiple-instance learning,” in Proc. NIPS, vol. 2, 2002, ing Department, Mohamed Bin Zayed University
pp. 561–568. of Artificial Intelligence (MBZUAI), Masdar City,
[37] Y. Yao, X. Wang, M. Xu, Z. Pu, E. Atkins, and D. Crandall, “When, United Arab Emirates. He is a Professional Engi-
where, and what? A new dataset for anomaly detection in driving neer Ontario (P.Eng.). His current research interests
videos,” 2020, arXiv:2004.03044. include the applications of AI and ML, connected
[38] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and and autonomous vehicles, blockchain solutions, and
localization in crowded scenes,” IEEE Trans. Pattern Anal. Mach. Intell., sustainable energy and data management.
vol. 36, no. 1, pp. 18–32, Jan. 2014.
[39] D. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase
evaluation,” in Proc. 49th Annu. Meeting Assoc. Comput. Linguistics:
Hum. Lang. Technol., 2011, pp. 190–200.
[40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3D convolutional networks,” in Proc. IEEE
Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4489–4497.
[41] M. Zolfaghari, K. Singh, and T. Brox, “ECO: Efficient convolutional
network for online video understanding,” in Proc. Eur. Conf. Comput.
Vis. (ECCV), 2018, pp. 695–712.
Vinay Chamola (Senior Member, IEEE) received
the B.E. degree in electrical and electronics engi-
neering and the master’s degree in communication
Adit Goyal received the B.Tech. degree from the engineering from the Birla Institute of Technol-
Computer Science Department, Jaypee Institute of ogy and Science-Pilani (BITS-Pilani), Pilani, India,
Information Technology (JIIT), Noida, India, in in 2010 and 2013, respectively, and the Ph.D.
2022. He is currently pursuing the M.S. degree degree in electrical and computer engineering from
in computer science with Northwestern University, the National University of Singapore, Singapore,
Evanston, IL, USA. in 2016.
His research interests include machine learning, In 2015, he was a Visiting Researcher with the
data science, and quantum computing. Autonomous Networks Research Group (ANRG),
University of Southern California, Los Angeles, CA, USA. He also worked
as a Post-Doctoral Research Fellow at the National University of Singapore.
He is currently an Assistant Professor with the Department of Electrical and
Electronics Engineering, BITS-Pilani, where he heads the Internet of Things
Research Group/Laboratory. His research interests include the IoT security,
Murari Mandal received the B.E. degree from the blockchain, UAVs, VANETs, 5G, and healthcare.
Birla Institute of Technology and Science-Pilani, Dr. Chamola is a fellow of the IET. He is listed in the World’s Top
Pilani, India, in 2011, and the M.E. degree from 2% Scientists identified by Stanford University. He is a Co-Founder and
Thapar University, Patiala, India, in 2015. the President of the healthcare startup Medsupervision Pvt. Ltd. He serves
He is an Assistant Professor at the School of as the Co-Chair of various reputed workshops like IEEE GLOBECOM
Computer Engineering, Kalinga Institute of Indus- Workshop 2021, IEEE INFOCOM 2022 workshop, IEEE ANTS 2021, and
trial Technology, Bhubaneswar, India. He was a IEEE ICIAfS 2021, to name a few. He serves as an Area Editor for the Ad
Post-Doctoral Research Fellow at the National Uni- Hoc Networks journal (Elsevier) and the IEEE Internet of Things Magazine.
versity of Singapore (NUS), Singapore. His current He also serves as an Associate Editor for the IEEE T RANSACTIONS ON
research is in privacy and security in machine learn- I NTELLIGENT T RANSPORTATION S YSTEMS , IEEE N ETWORKING L ETTERS ,
ing, machine unlearning, data privacy, synthetic data IEEE Consumer Electronics Magazine, IET Quantum Communications, IET
generation, deep learning, and computer vision. Networks, and several other journals.

Authorized licensed use limited to: COMSATS INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on December 01,2024 at 10:44:14 UTC from IEEE Xplore. Restrictions apply.

You might also like