1571056368 revisedhighlighted

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Human Activity Recognition: Fight Detection

using Long term Recurrent Convolutional


Network (LRCN)
Ngrddy Jhnvi .S.K.Bhrdwj Ajay Sharma
School of Computer Science School of Computer Science Delivery and Student Success
and Engineering and Engineering upGrad Education Private Limited
Lovely Professional University Lovely Professional University Bangalore, Karnataka, India
Phagwara, Punjab, India Phagwara, Punjab, India [email protected]
[email protected] [email protected]

Shamneesh Sharma Ankur Sodhi Saikat Gochhait


Delivery and Student Success Enterprise Businees and Symbiosis Institute of Digital and
upGrad Education Private Limited Career Services Telecom Management
Bangalore, Karnataka, India upGrad Education Private Limited Constituent of Symbiosis
[email protected] Bangalore, Karnataka, India International Deemed University
[email protected] Pune, India
[email protected]

Abstract—Human Activity Recognition for Fight Detection surveillance videos. Human behavior identification in real-
is an important research domain aimed at automatically world settings holds vast applicability, aiding endeavors
identifying patterns indicative of physical altercations. Lever- such as smart security cameras and understanding con-
aging deep learning models like CNNs and RNNs, this ap-
proach extracts features from video frames to recognize fight- sumer behavior in retail settings. Surveillance cameras are
related behavioural patterns. The primary objective is to instrumental in ensuring public safety, both indoors and
develop a machine learning model capable of autonomously outdoors. The constant monitoring of these cameras proves
detecting instances of fights in video footage, using techniques impractical, and locating specific video clips following an
such as the Long-term Recurrent Convolutional Network incident demands substantial time investment. Fortunately,
(LRCN). Training involves a comprehensive dataset encom-
passing examples of both fights and non-fight activities, with advanced technology now facilitates automated video and
model performance evaluated using standard metrics. Each audio analysis, swiftly identifying unusual events and en-
frame undergoes individual analysis by the model to predict hancing safety measures [2]. Detection of human behavior
the presence of any indications of a fight. The model predicts in video surveillance systems is an intelligent approach
the action with an accuracy of 98.03% and the movements to identifying suspicious activities. Various efficient algo-
given as input can be categorized as fights or no fights.
Index Terms—Humn Activity Rgnitin, Fight Dttin, Dp
rithms exist for automatically detecting human behavior
Lrning, Surity, Publi Sfty, Vid Prssing, TnsrFlw, Krs, NN, in public areas such as bus stations, railway stations,
Mtin Dttin. banks, offices, airports, and also colleges [3]. Automated
surveillance systems play a major role in security by
I. I NTRODUCTION detecting and tracking moving objects, thereby identifying
potential security threats. Current researches of computer
The CCTV surveillance stands as a paramount and vision mainly concentrate on creating automatic video
impactful security measure for various premises, including surveillance systems capable of handling dynamic scenes
hospitals, universities, malls, and more. It serves as the as well as the usage of AI with machine learning and deep
most prevalent means of preventing and detecting unfa- learning [4]. A fact of AI is the ability to mimic human-
vorable activities. Human activity recognition proves in- like thinking processes without obligating to limitations.
valuable in numerous scenarios, particularly in identifying However, ML focuses on learning from samples of past
abnormal behavior within security systems [1]. With the data provided to shape the algorithm outcomes about
escalating demand for security, surveillance cameras have unknown future. Presently, with the availability of GPU
become ubiquitous for analyzing video footage. Many processors and extensive datasets, deep learning princi-
establishments have integrated CCTV cameras to moni- ples are extensively employed. Deep Neural Networks
tor individuals and their actions. However, a significant (DNNs) represent one of the most effective architectures
challenge lies in discerning unusual occurrences within
for tackling challenging learning tasks. The Convolutional experiments are being carried out to evaluate its effec-
Neural Networks can help in learning visual patterns di- tiveness and computational efficiency, with the final results
rectly from image pixels, while Long Short-Term Memory highlighting the advantages of the proposed model in both
models are adept at capturing long-term dependencies, recognition accuracy and computational performance [8]
possessing the ability to retain information over time [9]. The primary objective of generating a model is to
[5]. The proposed system aims to utilize security camera identify instances of physical alterations, a critical task
footage to monitor activities on a college campus. Unusual with applications in various industries such as security,
behavior triggers alerts, with a primary focus on promptly sports, and healthcare, driven by the rapid advancement
identifying anomalies and understanding typical human of smart devices. Convolutional neural networks (CNNs)
behavior patterns such as aggression or swift reflexes. and long short-term memory (LSTM) networks emerge as
This paper is organized in six sections. Section II two promising deep learning methodologies for accurately
provides an in-depth review of the current progress in detecting physical alterations. The research study com-
automated human activity recognition, with a focus on bines these approaches to achieve heightened recognition
the detection of fights using deep learning methodologies, accuracy [10]. To capture the temporal dynamics inherent
particularly Convolutional Neural Networks (CNNs) and in fights, the methodology initially extracts features from
Long Short-Term Memory (LSTM) networks. Section III raw sensor data using a CNN, which are then fed into an
outlines the proposed system’s architecture and methodol- LSTM network for further analysis [11]. Differentiating
ogy, including data collection, preprocessing techniques, abnormal incidents in video surveillance, especially phys-
and model design. Section IV presents the research iological deviations, is a difficult but an important function
methodology, detailing the workflow and training of the to ensure security systems’ effectiveness [12]. However,
fight detection model. Section V discusses the results this is not a simple process. Identification of these unusual
obtained from the developed model, including its per- occurrences is clearly a priority. Wang proposes an appli-
formance metrics and deployment considerations. Section cation of a computer program that would help to deal with
VI concludes the paper by summarizing key findings and this challenge in an efficient way. Firstly, the researcher
examining the future scope and potential applications of devised an algorithm that can successfully deal with this
the model in various domains, such as public safety and issue by applying a movement descriptor system, along
security monitoring. with classification techniques [13]. The identification of
the abnormal events is done with the help of an anomaly
II. L ITERATURE R EVIEW indicator constructed by the hidden Markov model [14].
This model evaluates the similarity between observed
Many researchers have developed many models and frames and normal frames using histogram of optical
theories to detect human activities. One among them is flow orientations. Extensive video testing has verified the
fight detection, which is a serious and yet frequently efficacy of this approach.
occurring phenomenon. The primary objective is to de-
velop a comprehensive system capable of performing real-
time video analysis to detect the presence of physical Few research studies focused on developing Convo-
alterations and promptly notify the relevant authorities. lutional Neural Networks (CNNs) specifically designed
Despite the reputation for poor temporal expandability to detect fights in surveillance videos [15]. Their work,
of CNN, the usage of the models together with good- outlined in IEEE Transactions on Video Technology, un-
organized architecture and real-time video footage capture derscores the significance of utilizing CNNs to accu-
and processing logics regarding the suggested architecture, rately detect instances of physical alterations, contribut-
CNNs are space-feature extraction equipment, whose fea- ing to advancements in video surveillance technology.
tures are input by LSTM networks with stochastic nodes Few researchers also investigated real-time human activity
connections [6]. This strategy specifically deals with the recognition from videos with limited training data by
knowledge transfer including fine-tuning of the CNN that using transfer learning methods to address data scarcity
extracts spatial features through feature transfer. [16]. The study, presented at the Workshops on Com-
In some research studies, streamlined yet robust bottle- puter Vision and Pattern Recognition, emphasizes the
neck units for learning motion patterns are incorporated potential of transfer learning in enhancing the robustness
in the enhanced internal designs, leveraging the DenseNet of activity recognition systems. Foundational studies by
architecture to facilitate feature reusability and channel some researchers established fundamental understandings
interaction [7]. This method has demonstrated excellence of human dynamics in video sequences, laying the ground-
in capturing spatiotemporal features while requiring rela- work for modern activity recognition systems [17]. Further
tively few parameters. The performance of the proposed expanded this field by other researchers on investigating
model is thoroughly assessed on three standard datasets, audio-visual fusion techniques, emphasizing the multi-
demonstrating superior recognition accuracy compared faceted approach required for effective violence detection
to other state-of-the-art approaches. Additionally, extra [18].
III. P ROPOSED S YSTEM undergoes training on a comprehensive dataset containing
Long-term Recurring Convolutional Networks (LRCNs) examples of both fights and non-fight behaviors, with its
is a deep learning architecture that merges both con- performance evaluated using standard evaluation metrics
volutional and recurrent neural networks effectively for such as accuracy and validation loss. The objective is to
processing sequential data with spatial and temporal de- attain high certainty in detecting fights and also minimize
pendencies. Here’s an overview of LRCN. false positive and negative results. The dataset that we use
contains both sporting and non-sporting videos so when
• Convolutional Layers: LRCN begins with convolu-
we analyze our dataset we get the difference between two
tional layers, typically pre-trained on image datasets [1].
like ImageNet. These layers are responsible for ex-
tracting spatial features from individual frames of
sequential data, such as images or video frames.
• Recurrent Layers: Following the convolutional lay-
ers, recurrent layers are employed to capture temporal
dependencies across sequential data. Besides, differ-
ent types of recurrent neural networks are employed,
such as LSTM (Long Short-Term Memory) and GRU
(Gated Recurrent Unit) cells [2]. They enable the
model to remember information from previous frames
and learn long-term dependencies.
• Combination of Convolutional and Recurrent
Layers: The present study proposes a model that
can learn both spatial features and temporal features
Fig. 1. Workflow diagram showing the CNN and LSTM based neural
combined, effectively capturing complex patterns and network for the fight detection
developing curiosities within sequential data. Using
LRCNs is advantageous as it provides efficiency and
the assurance of knowledge. V. W ORKFLOW
• Fully Connected Layers: After the recurrent layers,
A. Data Collection
fully connected layers may be added to the archi-
tecture for further processing and classification tasks. • Record video data of fight and non-fight activities by
These layers integrate with the features trained from performing actions clearly or collect data from open
both convolutional and recurrent layers and provide sources.
the final output prediction of the developed deep • Capture corresponding body motion data and move-
learning model. ments.
• Training and Optimization: The learning rate of • Collect various sequences of actions, including fight-
LRCN is trained using gradient-based optimization ing and non-fighting activities like walking, running,
algorithms like SGD or Adam. During the training playing, etc.
stage, the model optimizes its loss function (typically • Annotate start and end times of each activity in the
cross-entropy loss for classification tasks) by updat- synchronized video data streams.
ing its parameters through convolutional, recurrent, • The data collected should be well-lit to avoid com-
and completely connected layers. plications while training the data to the model.
• Applications: LRCN is a tool that has been used in • Split the collected data into different classes to effec-
various domains including video activity recognition, tively train the model.
video captioning, and human activity recognition.
Such ability to capture both spatial and temporal B. Data Pre-processing
characteristics makes it perfect for the tasks in which The videos from the data set are normalized during
sequential data analysis is required. data preprocessing. The image resizing process is included
in the video. The next steps involve changing the color
IV. RESEARCH METHODOLOGY format from BGR (Blue, Green, Red) to RGB (Red,
The purpose of fight detection is to develop a ma- Green, Blue). The format of the dimension was changed
chine learning model that can identify physical alterations to (224, 224) pixels. The current work took 25 frames per
automatically, just like human recognition capabilities. second.
A dataset for training the model could comprise video Resizing the pixels of video frames to a standard
recordings extracted from surveillance cameras or other dimension of (64, 64) ensures consistency in the data
sources, and the output would be binary classification format and serves as the first step in structuring the
indicating the presence or absence of a fight. The model data for video processing. Furthermore, we configured the
sequence length to include 20 frames per sequence, which loss is interested only in the hidden data revealed
enables a full temporal information representation. during the validation set. Such a balance as training
The first step in the frames extraction phase is to and testing allows us to assess the efficiency of the
retrieve the total number of frames in the movie, which model if not overfit on the training data:
provides information needed for the processing processes X
that follow. After that, the Skip Window is computed, Loss = − (yi log(ŷi )) (4)
which establishes the time period at which frames will where
P
represents summation over all classes.
be sampled for analysis. Ultimately, the process of frame
– yi : The actual probability for class i (one-hot
normalization is carried out to guarantee consistency in
encoded).
pixel values over frames.
– ŷi : Predicted class probability by the model.
Prior to training, several prerequisites must be ad-
dressed. First, video frames need to be labeled for training, • The validation loss can be calculated as:
providing ground truth annotations essential for supervised N
1 X
learning. Next, a one-hot encoding model is developed, Lossval = Loss(ŷ i , yi ) (5)
N i=1
enabling efficient representation of categorical data. Sub-
sequently, architectures for both 2D and 3D convolutional where Loss is the chosen loss function:
neural networks (CNNs) are designed to model video – ŷ i represents the predicted output for the i-th
data effectively. Additionally, LSTM and GRU networks sample.
are constructed to model temporal sequences from sensor – yi shows the true label for the i-th sample.
data accurately. Experimentation with attention models – N counts the total number of validation samples.
and temporal convolutional networks (TCNs) is conducted
to enhance model performance. Finally, multimodal fusion
TABLE I
models are architected to seamlessly integrate both video A RCHITECTURE OF S EQUENTIAL M ODEL
and ambient data sources, ensuring comprehensive data
utilization for improved insights and predictions. Type of Layer Shape of Output Params #
time distributed (TimeDistributed) (None, 20, 64, 64, 32) 896
C. Model Training time distributed 1 (TimeDistributed) (None, 20, 16, 16, 32) 0
time distributed 2 (TimeDistributed) (None, 20, 16, 16, 64) 18,496
• Train independent models on video frames, body time distributed 3 (TimeDistributed) (None, 20, 4, 4, 64) 0
motion for violence detection. time distributed 4 (TimeDistributed) (None, 20, 4, 4, 128) 73,856
time distributed 5 (TimeDistributed) (None, 20, 2, 2, 128) 0
• Optimize models using stochastic gradient descent time distributed 6 (TimeDistributed) (None, 20, 2, 2, 256) 295,168
and Adam algorithms. time distributed 7 (TimeDistributed) (None, 20, 1, 1, 256) 0
• Jointly train multimodal networks using divided time distributed 8 (TimeDistributed) (None, 20, 256) 0
lstm (LSTM) (None, 32) 36,992
learning and supervised learning. dense (Dense) (None, 2) 66
• The ReLU (Rectified Linear Unit) check is a standard
activation function widely used in neural networks
because it offers simplicity and computational effi- D. Evaluation
ciency: 1) Evaluate classification accuracy per activity and
f (x) = max(0, x) (1) calculate average precision/recall.
• The main function of validation accuracy is to mea- 2) Visualize confusion matrices to analyze model per-
sure how accurate test data gets processed with the formance across activity classes.
model. The validation set is the very specific portion 3) Compare to benchmarks and prior state-of-the-art
of your data, which is not applied to model training: methods.
4) Perform ablation studies to identify most predictive
Ncorrect modalities and network components.
Accuracy = (2)
Ntotal
where Ncorrect refers to the total number of examples E. Deployment
where the model’s predicted class label matches the 1) Package optimized models into apps and services for
actual class label. Ntotal represents the sum of all the real-time violence or physical aggressive movements
data points in the dataset used for calculating. recognition.
• The validation accuracy is given by: 2) Develop APIs and interfaces to utilize predictions
  and interact with the systems.
Ncorrect
Accuracyval = × 100 (3)
Ntotal F. Streamlit
where Ncorrect is the number of correctly classified Deploying a machine learning model with Streamlit in-
samples on the validation set and Ntotal is the total volves several straightforward steps. Initially, you prepare
number of validation samples. The field of validation your model, training it with appropriate libraries. Then,
you install Streamlit via pip. Next, you craft a Python [4] S. Sharma, B. Sudharsan, S. Naraharisetti, V. Trehan, and K.
script to create the Streamlit web application, integrating Jayavel, “A fully integrated violence detection system using CNN
and LSTM.,” Int. J. Electr. & Comput. Eng., vol. 11, no. 4, 2021.
your trained model for predictions, user interaction, and [5] X. Yin, W. Shen, J. Samarabandu, and X. Wang, “Human activity
result visualization. Once your app is ready, you select a detection based on multiple smart phone sensors and machine
deployment platform such as Streamlit Sharing. Through learning algorithms,” in 2015 IEEE 19th international conference
on computer supported cooperative work in design (CSCWD),
this process, the model can be effectively deployed using 2015, pp. 582–587.
Streamlit. Moreover, it offers a user-friendly interface to [6] D. Thakur and S. Biswas, “Smartphone based human activity
deploy and run the model over the web. monitoring and recognition using ML and DL: a comprehensive
survey,” J. Ambient Intell. Humaniz. Comput., vol. 11, no. 11, pp.
5433–5444, 2020.
VI. R ESULTS AND D ISCUSSION [7] C. Chatzaki, M. Pediaditis, G. Vavoulas, and M. Tsiknakis, “Human
daily activity and fall recognition using a smartphone’s acceleration
The model is well trained, developed, and also deployed sensor,” in Information and Communication Technologies for Age-
on the web. The purpose of the project includes developing ing Well and e-Health: Second International Conference, ICT4AWE
a deep learning system that is trained on a diverse dataset 2016, Rome, Italy, April 21-22, 2016, Revised Selected Papers 2,
2017, pp. 100–118.
and would allow for certain physical alterations as well [8] H. E. Azzag, I. E. Zeroual, and A. Ladjailia, “Real-Time Human
as aggressive behavior to be detected within video data. Action Recognition Using Deep Learning,” Int. J. Appl. Evol.
Upon deployment, the mechanism hassle-free integrates Comput., vol. 13, no. 2, pp. 1–10, 2022.
[9] S. Sharma and S. Charbathia, “Multimedia Technologies: An
into surveillance systems, delivering real-time detection Integration of Precedent, Existing & Inevitable Systems,” Int. J.
and alerting decision-makers to potential hazards. The Emerg. Res. Manag.& Technology, vol. 4, no. 12, pp. 70–75, 2015.
model provides an accuracy of 98.03%. [10] T. Wang et al., “Abnormal event detection based on analysis of
movement information of video sequence,” Optik (Stuttg)., vol.
VII. C ONCLUSION AND F UTURE S COPE 152, pp. 50–60, 2018.
[11] M. O. Gani et al., “A light weight smartphone based human activity
Human activity recognition for fight detection presents recognition system with high accuracy,” J. Netw. Comput. Appl.,
vol. 141, pp. 59–72, 2019.
a double-edged sword. On the positive side, it offers [12] M. O. Gani, “A novel approach to complex human activity recog-
the potential to bolster public safety through faster inter- nition,” Marquette University, 2017.
vention, optimize resource allocation, and provide valu- [13] V. P. Chaubey, A. Sharma, T. Sharma, S. Sharma, and A. Kumar,
“Link Prediction for Social Network Analysis Using Random
able data for prevention strategies. However, concerns Forest and XG-Boost Algorithm,” in 2023 Second International
regarding privacy, potential biases, and over-reliance on Conference on Informatics (ICI), 2023, pp. 1–6.
technology necessitate careful consideration and respon- [14] L. Bao and S. S. Intille, “Activity recognition from user-annotated
acceleration data,” in International conference on pervasive com-
sible implementation to ensure its benefits outweigh the puting, 2004, pp. 1–17.
potential drawbacks. This model can be linked to the [15] J. Yang, “Toward physical activity diary: motion recognition using
surveillance cameras to detect, alert, and prevent fights. simple acceleration features with mobile phones,” in Proceedings
of the 1st international workshop on Interactive multimedia for
Moreover, it can be used to monitor patients in hospitals consumer electronics, 2009, pp. 1–10.
as well as athletes for their health training. This research [16] S.-W. Lee and K. Mase, “Activity and location recognition using
represents a significant advancement in enhancing secu- wearable sensors,” IEEE pervasive Comput., vol. 1, no. 3, pp.
24–32, 2002.
rity, improving public safety, and optimizing surveillance [17] A. Mannini and A. M. Sabatini, “Machine learning methods for
efficiency through the automated detection of fights and classifying human physical activity from on-body accelerometers,”
aggressive behavior. By leveraging deep learning tech- Sensors, vol. 10, no. 2, pp. 1154–1175, 2010.
[18] J. Parkka, M. Ermes, P. Korpipaa, J. Mantyjarvi, J. Peltola, and I.
niques, the developed model contributes to creating safer Korhonen, “Activity classification using realistic data from wear-
environments, supporting forensic analysis, and deterring able sensors,” IEEE Trans. Inf. Technol. Biomed., vol. 10, no. 1,
potential aggressors. Overall, its implementation promises pp. 119–128, 2006.
[19] T. Gunasekar and P. Raghavendran, ”Applications of the R-
to have a transformative impact on security measures Transform for Advancing Cryptographic Security,” in Driving
and community well-being. However, there are still some Transformative Technology Trends With Cloud Computing, IGI
challenges associated with violence detection using human Global, 2024, pp. 208-223.
[20] P. Raghavendran and T. Gunasekar, ”Advancing Cryptographic
action recognition, such as data availability and privacy Security With Kushare Transform Integration,” in Driving Trans-
concerns. formative Technology Trends With Cloud Computing, IGI Global,
2024, pp. 224-242.
R EFERENCES [21] S. Gochhait, H. Patil, T. Hasarmani, V. Patin, and O. Maslova,
”Automated Solar Plant using IoT Technology,” in Proc. 2022
[1] V. P. Chaubey, S. Sharma, A. Kumar, A. Malik, P. Malik, and 4th Int. Conf. Electrical, Control and Instrumentation Engineering
I. Batra, “Activity Recognition in Video Frames for Enhancing (ICECIE), Nov. 2022, pp. 1-6.
Human-Robot Collaboration: A Machine Learning Perspective,” in [22] I. Paliwal and S. Gochhait, ”Classification of Machine Learning and
2023 3rd International Conference on Advancement in Electronics Power Requirements for HAI (Human Activity Identification),” in
& Communication Engineering (AECE), 2023, pp. 554–558. Proc. 2023 Int. Conf. Inventive Computation Technologies (ICICT),
[2] S. S. Sonit Singh and A. Saini, “Evaluation of Feature Extraction Apr. 2023, pp. 199-204.
Techniques in Content Based Image Retrieval (CBIR) System,” Int.
J. Electron. Commun. Technol., vol. 4, no. SPL-2, pp. 81–83, 2013.
[Online]. Available: http://iject.org/vol4/spl2/sonit.pdf
[3] M. Hanzl and S. Ledwon, “Analyses of human behaviour in public
spaces,” ISOCARP, Portland, Oregon, USA, 2017.

You might also like