Video Based Fight Detection Using Deep Learning
Video Based Fight Detection Using Deep Learning
Project Report
In
Submitted by
Project Guide:
Guide: Ms. Karma Kelzang Eudon
Co-guide: Mr. Duk Bdr Powdyel
CERTIFICATE
This is to certify that the B.E. project titled “Video-Based Fight Detection using Deep
Learning”, which is being submitted by Mr. Dawa (02190109), Ms. Kuenzang Lhaden
(02190117), Mr. Sangay Thinley (02190126) and Mr. Sonam Drukpa (02180319), the
students of BE Final year Electronics and Communication Engineering, during the
academic year 2019-2023 in partial fulfilment of the requirement for the award of
“Bachelor of Engineering in Electronics and Communication Engineering” as a record of
students work carried out at College of Science and Technology, Phuentsholing under my
supervision and guidance.
Firstly, our gratitude to the College of Science and Technology and Royal University of
Bhutan for giving us an incredibly valuable opportunity to gain hands-on experience with
practical experiments and theoretical knowledge. We owe the accomplishment of our
project's objectives to the unwavering guidance and support provided by our esteemed
mentors. We would like to extend our gratefulness to Madam Karma Kelzang Eudon,
Lecturer in the Electronic and Communication Department (ECED), and Sir Duk Bdr.
Powdyel, Assistant Lecturer in the same department. Their constant assistance and
invaluable support throughout the project duration were instrumental in our success.
We would also like to extend our appreciation to the Electronics and Communication
Engineering Department, as well as the members of the review panel. Their guidance,
constructive feedback, and valuable suggestions played a significant role in shaping our
project and enabling its successful completion.
Our deep appreciation to Mr. Kuenzang Thinley, the project coordinator, for his consistent
reminders, timely recommendations, and provision of all essential resources.
Finally, we express our gratitude and appreciation to the CST FabLab for their invaluable
assistance in fabricating the system case using 3D printing technology.
Group Members:
Dawa (02190109)
Kuenzang Lhaden (02190117)
Sangay Thinley (02190126)
Sonam Drukpa (02180319)
i
Abstract
The project “Video-Based Fight Detection using Deep Learning” aimed to address the
limitations of traditional surveillance systems in identifying and preventing violent
incidents in real-time. The current reliance on human operators to monitor multiple cameras
has proven to be inefficient and error-prone, resulting in the possibility of missing crucial
instances of violent behavior. This project proposed the utilization of the deep learning
techniques, specifically the LRCN model, to automate the detection of fights in video
surveillance, providing an intelligent system that enhances public safety and security. By
leveraging the LRCN model, which combines the power of CNNs for spatial analysis and
recurrent neural networks (RNNs) for temporal modeling, the system can analyze video
footage and identify patterns of behavior associated with fighting, even in crowded or busy
environments. The project also included implementing an alert mechanism using a GSM
module to notify the relevant authorities upon detecting a fight. Additionally, a spontaneous
alarm system is incorporated to provide immediate alerts when fights are detected. By
automating the detection process, this project can make a valuable contribution to public
safety efforts, enabling timely intervention and proactive measures to combat violence and
maintain security.
ii
Nomenclature / Terminology
1. Deep Learning Model: A computational model that utilizes deep neural networks to
architecture that is a fusion of the Convolutional Neural Networks (CNNs) and the
altercations or fights.
analyzing and extracting features from visual data, such as images or videos.
6. Prediction: The output or inference made by the deep learning model regarding the
7. Training: The process of optimizing the parameters and weights of the model using a
8. Dataset: A collection of labeled examples used for training and evaluating the deep
between predicted outputs and the true label in the training process.
10. Activation Function: A function used on the output of the layer in a neural network to
make the model more flexible and better at learning. It adds curves and bends to the
data, allowing the model to capture more complex patterns and relationships. This helps
the neural network to understand and process information in a way that is more similar
iii
11. Hyperparameters: Configurable settings and parameters that are not learned by the
model during training, but set by the user to control the learning process.
12. Epoch: One complete iteration over the entire training dataset during the training phase.
13. Batch Size: Refers to how many training data are grouped together and processed at
14. Overfitting: This phenomenon refers to a situation in which the deep learning model
15. Evaluation Metrics: The metrics employed to evaluate the model's performance, such
as accuracy.
16. LSTM: short for long short-term memory, refers to a specific architecture within
data.
17. Kernel: Kernel is a small matrix used for convolutional operations in a CNN.
iv
Abbreviations – Epithets
Abbreviation Description
AT Attention
DL Deep Learning
FC Fully Connected
IP Internet Protocol
ML Machine Learning
OS Operating System
Pi Raspberry Pi
v
List of Figures
vi
List of Tables
vii
Contents
Acknowledgement i
Abstract ii
Nomenclature / Terminology iii
Abbreviations – Epithets v
List of Figures vi
List of Tables vii
CHAPTER 1: INTRODUCTION 1
1.1 Introduction 1
1.2 Problem Statement: 1
1.3 Motivation and Need of the Project 2
1.4 Aim 2
1.5 Project Objectives 2
CHAPTER 2: LITERATURE SURVEY 3
2.1 Introduction 3
2.1.1 Related Work: 3
2.2 Artificial Intelligence (AI) 5
2.2.1 Deep Learning (DL) 5
2.2.2 Convolutional Neural Network (CNN) 6
2.2.3 Layers in CNN 7
2.2.4 Recurrent Neural Networks (RNN) 10
2.2.5 LSTM 11
CHAPTER 3: PROJECT METHODOLOGY 12
3.1 Introduction 12
3.2 Project Methodology 13
CHAPTER 4: DESIGN OF THE PROPOSED SYSTEM 15
4.1 Introduction 15
4.2 System Architecture 15
4.2.1 Long-term Recurrent Convolutional Network-LRCN 16
4.3 Hardware Requirements 17
4.3.1 Raspberry Pi 4B 18
4.3.2 Pi Camera 19
4.3.3 GSM Module 19
4.4 Software Requirement 20
viii
4.4 Development Process and Implementation Details 22
4.4.1 System Flowchart 22
4.4.2 Development Process 23
4.4.3 Implementation 25
4.5 Case Design 27
CHAPTER 5: RESULTS AND ANALYSIS 29
5.1 Introduction 29
5.2 Statistics of Datasets 29
5.3 Comparative Study of Models 30
5.4 Model Evaluation 31
5.5 Hyper-parameter Tuning Analysis 31
5.6 System Testing Result 33
5.7 System Performance Analysis 33
5.8 System Performance and Reliability for the GSM communication 35
5.9 Cost Analysis 36
CHAPTER 6: CONCLUSION AND FUTURE WORK 38
6.1 Conclusion 38
6.2 Future Work and Recommendation 38
REFERENCES 40
ix
CHAPTER 1: INTRODUCTION
1.1 Introduction
Closed Circuit Television (CCTV) is mainly utilized for observing and overseeing in order
to combat crimes. Its primary objective is to reduce criminal activity and social misconduct
while also enhancing security. A closed-circuit television (CCTV) system consists of a
camera mounted remotely without human presence and an operator. The camera records
video footage and sends it to a central monitoring station where the operator watches a
television screen to detect any suspicious activities or gather evidence. Nevertheless, the
operator's capacity to detect suspicious behavior is restricted by the attention they can
dedicate to each video feed displayed on the screen. Given the limited ratio of operators to
screens, it becomes impractical for the CCTV operator to consistently and fully focus on
every video feed, thereby increasing the risk of overlooking certain abnormal activities.
After considering video processing as a potential solution for the problem, it was
determined that utilizing deep learning for video classification and recognition, would be a
more effective approach to solving the problem. Video-based fight detection using deep
learning is an emerging technology that aims to enhance security surveillance by detecting
and alerting security personnel to potential altercations or violent behavior. Traditional
surveillance systems rely on human operators to monitor video feeds, which can be tedious
and error-prone, especially in busy environments. However, with deep learning algorithms
and computer vision techniques, it is now possible to automate the detection of violent
behavior in real-time, providing a more efficient and reliable security solution. This
technology utilizes complex neural networks to analyze video footage and detect patterns
of behavior associated with fighting. By integrating this technology with existing
surveillance systems, security personnel can quickly identify potential threats and take
action to prevent violent incidents from occurring.
In general, the implementation of deep learning for video-based fight detection shows great
promise as a technology that can greatly enhance public safety and security. It has the
potential to become an indispensable tool in combating crime and violence.
Advancement in technology does not ensure prevention of crimes around the world. The
rise in physical altercations and violent incidents in public spaces has become a major
1
concern for law enforcement agencies and public safety organizations worldwide.
Traditional surveillance systems, which rely on human operators to monitor multiple
cameras, have proven to be inefficient in identifying and alerting authorities to potential
fights or acts of violence in real-time. According to the Statistical Yearbook 2020 by Royal
Bhutan Police, Battery accounted 81% of all crimes committed against person and 78% in
2022. In 2022, there were 4327 persons arrested for various crimes. Of the 4327 people
arrested, 1038 people were arrested for battery. Similarly, 995 individuals in 2020 and 981
in 2019 were arrested for indulging in battery crime (RBP STATISTICAL, 2022). Hence, in
order to enhance the effectiveness of CCTV monitoring and surveillance, it is necessary to
automate the process of detecting suspicious activity in video surveillance. To that solution,
this project aims at automatic detection of fights based on video using deep learning
technology.
1.4 Aim
To design a system for detecting fights and an alert system using Deep Learning.
2
CHAPTER 2: LITERATURE SURVEY
2.1 Introduction
Given the increasing emphasis on safety and security, the exploration of intelligent systems
to identify violent behavior has become a critical domain of investigation. In this review of
relevant literature, we will examine the latest progress in utilizing deep learning techniques
for fight detection in security surveillance.
Several researchers have studied the problem of fight detection using different approaches.
Some of the notable works in this area are discussed below:
The suggested approach involved employing quadcopter surveillance and video streaming
to identify anomalies within received video streams using deep learning models. The
researchers made adjustments to a widely recognized FasterRCNN algorithm to streamline
the initial feature extraction process and facilitate rapid learning. They assessed the
performance of four distinct CNNs, namely GoogleNet, ResNet-18, ResNet-50, and
SqueezeNet, for detecting relevant objects in surveillance images. The FasterRCNN
algorithm based on ResNet-50 attained the highest average accuracy, establishing itself as
3
a solution for the detection of threat. The system achieved an average accuracy of 79%
across all categories.
“Human Violence Detection Using Deep Learning Techniques” by Arun Akash et al.
(2022)
The detection of moving objects from CCTV footage was regarded as a highly impactful
computer vision algorithm. This research employed deep learning techniques as a computer
vision methodology to forecast and identify actions and attributes from videos. The study
utilized the Inception-v3 and Yolo-v5 models to identify instances of violence, count the
number of individuals involved, and detect the presence of weapons in a specific situation.
These deep learning models were employed to create a video detection system as part of
the research. The results of the study indicated that the proposed model achieved a 74%
accuracy rate.
The study involved a comparison of the performance of the MobileNet model proposed in
this research with that of the AlexNet, VGG-16, and GoogleNet models using
Convolutional Neural Network (CNN) models. Simulations were conducted in Python, and
accuracy and loss values were assessed for each model. AlexNet demonstrated an accuracy
of 88.99% and a loss of 2.480%. VGG-16 achieved an accuracy of 96.49% with a loss of
0.1669%, while GoogleNet achieved 94.99% accuracy and a loss of 2.92416%. In contrast,
the proposed MobileNet model achieved an accuracy of 96.66% and a loss of 0.1329%.
When applied to the hockey fight dataset, the MobileNet model proposed in this study
exhibited exceptional performance in terms of accuracy, loss, and computational time.
“Real-Time Violent Action Recognition Using Key Frames Extraction and Deep
Learning” by Ahmed et al. (2021)
The objective of the research was to investigate the application of convolutional neural
networks (CNNs) and Inception V4 for the detection and recognition of violence in video
data. The proposed framework involved extracting key frames to eliminate redundant
consecutive frames, thereby reducing the training data size and computational
requirements. For feature selection and classification, a sequential CNN with a single
kernel size was utilized, while the Inception V4 CNN employed multiple kernels across
4
different layers of its architecture. The study performed empirical analysis using four
standard datasets that encompassed various activities. The results showcased that the
proposed approach achieved a 98% accuracy, significantly reduced computational costs,
and surpassed existing techniques in violence detection and recognition.
This section encompasses the concepts, applications, and various related terms associated
with deep learning and artificial intelligence (AI).
John McCarthy initially put the concept of Artificial Intelligence (AI) forward in 1955. AI
is described as the science and engineering of creating intelligent machines. AI is known
for its ability to provide reliability, cost-effectiveness, and solutions to complex problems
while also preventing data loss. In today's world, artificial intelligence (AI) is being used
in many different areas, like business and engineering. One useful technique in AI is
reinforcement learning, which involves testing things out in real-life situations to see what
works and what doesn't. This helps make AI applications more dependable and reliable.
Figure 2.1 shows the relationship between deep learning and machine learning. It shows
how these two areas of AI are related to each other.
Deep learning is a subset of machine learning, that focuses on training artificial neural
networks (ANNs) to directly learn features and perform tasks from data. These ANNs are
designed to imitate the way our brains work. (LeCun et al., 2015). An artificial neural
5
network (ANN) is composed of interconnected nodes known as neurons, which analyze
and acquire knowledge from input data.
.
Figure 2.2: Fully connected artificial neural network
Figure 2.2 shows a fully connected ANN. The fully connected deep neural networks
comprises the input layer and the consecutive hidden layers. In this architecture, each
neuron in the hidden layers receives input from either the preceding layer or the input layer.
The output from one neuron in a layer serves as input for the neurons in the subsequent
layer, continuing this pattern until the final layer produces the network's output. By
applying a series of nonlinear transformations, the layers of a neural network modify the
input data, enabling the network to comprehend intricate representations of the data.
Widely used type of DNNs are the convolutional neural networks (CNNs) which can process
data with a known grid topology, and are particularly popular for computer vision tasks such
as object classification. Unlike other neural networks, CNNs don't require manual feature
extraction as they automate the process of feature extraction (Goodfellow et al., 2016).
The Convolutional neural network (CNN) comprises various layers, including convolution,
pooling, and fully connected layers. The network implements a backpropagation algorithm to
learn spatial hierarchies of features autonomously and adjust to new data (Patil & Rane, 2021).
This system is comprised of three primary components: an input layer, a feature extraction
module, and a classification module. The feature extraction component is composed of multiple
6
layers that perform operations such as the convolution, pooling, and ReLu functions. These
operations help identify and differentiate various features within the input images during the
network's training process. The latter stages of the network includes the fully connected layer
and the output layer, which plays a crucial role in classifying the input images. The CNN
architecture is shown in figure 2.3.
Convolutional Layer
In a Convolutional Neural Network (CNN), the initial layer is responsible for extracting
diverse features from input images. This is accomplished through a mathematical operation
called convolution, wherein the input image is convolved with a filter of a specific size
denoted as KxK. By sliding the filter across the input image, the dot product is computed
between the filter and corresponding portions of the image, based on the filter size (KxK).
The resulting output is referred to as a feature map, which contains information about
various aspects of the image, such as its edges and corners. Afterwards, this feature map is
passed on to subsequent layers to learn additional features of the input image.
Once the convolution operation is applied to the input, the resulting output is transmitted
to the subsequent layer in the Convolutional Neural Network (CNN). The convolutional
layers in CNN have a vital role in preserving the spatial relationship between the pixels in
7
the input image, ensuring the integrity of the image's spatial information as it progresses
through the network.
Pooling Layer
Following the Convolutional Layer is a Pooling Layer that serves the purpose of down
sampling the feature map after convolution and decreasing the computational requirements.
The reduction in connections between the layers is accomplished to minimize the
complexity, and the pooling layer performs its operations on each feature map
independently. Max Pooling selects the largest element within a specific region of the
feature map, while Average Pooling computes the average of the elements in that region.
Likewise, Sum Pooling calculates the total sum of the elements within the defined region.
The Pooling Layer serves as a connecting link between the Convolutional Layer and the
FC (Fully Connected) Layer.
8
Figure 2.5: Illustration of a max pooling
Figure 2.5 illustrates a max pooling. Max pooling is the method utilized in deep learning
and the convolutional neural networks (CNNs) to reduce the data size. It works by dividing
the input into smaller sections and choosing the biggest value from each section. By picking
the biggest value, max pooling keeps the most important features while making the data
smaller. This helps make the computations faster, capture strong characteristics, and ensure
that the network can recognize objects regardless of their position in an image.
The FC layer in a CNN is made up of neurons, weights, and biases and functions as a
connection between two distinct layers. In a typical CNN design, the FC layers are
positioned towards the end, just before the output layer. The input image, after going
through preceding layers, is flattened and transmitted to the FC layer. Subsequently, it
traverses a series of FC layers where mathematical operations are commonly performed.
This stage marks the initiation of the classification process. Connecting two fully connected
layers is often preferred over a single connected layer since it tends to yield better results,
reducing the reliance on human supervision in CNNs.
Dropout
Connecting all the features to the FC layer can lead to overfitting in the training dataset,
where the model performs well on the training data but poorly on unseen data. To mitigate
this issue, a dropout layer can be utilized. During training, the dropout layer randomly
excludes certain neurons from the neural network, effectively reducing the model's size.
9
For example, if the dropout value is set at 0.3, 30% of the nodes are randomly eliminated
from the neural network. This technique improves the performance of the machine learning
model by streamlining the network and preventing overfitting.
Activation Functions
To make a CNN model work well, it is necessary to choose the right activation function.
An activation function helps the model learn complicated relationships between different
parts of the network. It decides which information should be used to make predictions and
which should not. There are different types of activation functions, and each has its own
purpose. Some are better for binary classification, while others are better for multi-class
classification. The activation function uses math to decide which information is important
for making predictions.
The Recurrent Neural Networks (RNNs) are a kind of neural networks that is best suited
for working with data that comes in a sequence, such as natural language or time series
data. The main idea behind RNNs is that they have the ability to store and recall past inputs,
which they can then use to make predictions. RNNs use previous outputs as new inputs and
have hidden states that help them remember previous information.
10
Figure 2.6 represents RNN. An RNN has a repeating module that can process a sequence
of inputs one by one. At each step, the module takes in an input and a hidden state from the
previous step, and it calculates a new hidden state and an output. The hidden state keeps
track of information from previous inputs, allowing the network to remember patterns in
the sequence. This process repeats for each step in the sequence, producing a sequence of
outputs. Different types of RNNs have additional mechanisms to help them remember long-
term dependencies and avoid problems with training.
The memory of an RNN, also known as the hidden state, maintains all the information that
has been processed up to a certain time step.
2.2.5 LSTM
LSTM is an abbreviation for Long Short-Term Memory. It is a type of neural network that
help computers understand sequences of information, like sentences or music. It's special
because it can remember important things from the past and decide which things to keep or
forget. This helps the computer make better predictions about what comes next in the
sequence. Think of it like a person trying to remember a long story - they only remember
the important parts and forget the less important details, which helps them understand the
story better. Similarly, LSTMs help computers understand and remember important details
in sequences, which is useful for many different applications.
11
CHAPTER 3: PROJECT METHODOLOGY
3.1 Introduction
The chapter includes an in-depth explanation of the methodology chosen for the project. It
introduces the specific approach or framework that was used to manage the project from
start to finish. The project methodology is depicted in Figure 3.1 and includes the following
steps: problem statement, literature review on machine learning and video classification,
design of block diagram and flowchart, model training in Google Colab, testing of the
trained network, deployment of the system, hardware implementation and analysis, and
documentation.
Problem Statement
Literature Review
System Design
Model Training
Hardware Implementation
Documentation
12
3.2 Project Methodology
The methodology for the project is divided into the several stages, each stage which
includes various activities. The project began by identifying and clearly defining a problem
statement, which served as the foundation for the project's focus and objectives. This
problem statement precisely outlined the specific issue or challenge that the project aimed
to address, providing a clear direction for the project's activities.
To gain the comprehensive understanding of the field and relevant techniques, a thorough
literature review was conducted. This review focused specifically on machine learning,
deep learning, and convolutional neural networks, with a particular emphasis on various
video classification papers. Through this literature review, the project team gained valuable
insights, acquired knowledge about existing approaches, and identified gaps or areas where
the project could contribute.
Next, the project moved into the design phase. During this phase, a block diagram and
flowchart were created to visually outline the structure and sequence of operations within
the system. This includes determining the arrangement and connections between the
Raspberry Pi camera, the GSM module, and the video classification model, outlining how
they interact with each other. These visual representation played a vital role in
conceptualizing and planning the architecture of project, helping the team understand the
various components and their interactions.
Model training encompasses the creation and refinement of a video classification model
designed to identify fight scenes. This process entails choosing suitable deep learning
models like the Convolutional Neural Network or the Recurrent Neural Network,
developing the model's structure, curating the dataset for training, and fine-tuning the
model's parameters through iterative optimization. The model is then trained using labeled
videos containing both fight scenes and non-fight scenes from the dataset. Google Colab, a
popular platform for machine learning development, was utilized for implementing and
training the model.
Following the training phase, the output or the trained network underwent thorough testing
to assess its accuracy, efficiency, and overall performance. The testing phase involved
evaluating the model against various test datasets or real-world scenarios to measure its
effectiveness and reliability in different contexts.
13
To ensure the practicality and viability of the solution, the project included deploying the
system in a real-world setting. This deployment allowed the team to observe and validate
the system's functionality and performance in practical scenarios, ensuring that it aligned
with the intended objectives and requirements.
14
CHAPTER 4: DESIGN OF THE PROPOSED SYSTEM
4.1 Introduction
This chapter includes the system architecture and the components used in the project will
be provided. In the subsequent sections of this chapter, a detailed implementation and
integration of each component will be discussed, including software and hardware
requirements, as well as any challenges or considerations encountered during the
development process.
The deep learning project focused on fight and non-fight video detection and utilized a
GSM module, Pi camera, Raspberry Pi, and speaker for alerts. The system implemented
the Long-term Recurrent Convolutional Network (LRCN) model to successfully train and
classify videos into two categories: "fight" and "non-fight". The Raspberry Pi serves as the
core processing unit, while the Pi camera captures high-resolution video footage. The GSM
module enables communication for alerts, and the speaker provides audible notifications.
15
4.2.1 Long-term Recurrent Convolutional Network-LRCN
Jeff Donahue and colleagues introduced the Long-term Recurrent Convolutional Network,
or LRCN, in 2016 (Donahue et al., 2017). LRCN is particularly useful for tasks that require
large-scale visual understanding, such as action recognitions, image captioning, and video
classification. The LRCN combines both the Convolutional Neural Networks (CNNs) and
the Long Short-Term Memory (LSTM) networks. Its purpose is to analyze video frames
and capture both the spatial and temporal information, which makes it ideal for analyzing
videos.
In the LRCN architecture, the CNN component acts as a visual feature extractor. It applies
convolutional layers to the input video frames, detecting visual patterns and features.
Techniques like batch normalization and max pooling are often utilized to enhance
performance and reduce overfitting. The CNN produces a sequence of high-level visual
features as its output. The LSTM component takes the sequence of visual features generated
by the CNN and analyzes them to capture the temporal dynamics within the video. LSTM
cells have a memory state that enables them to retain information over time, making them
capable to learn long-term dependencies. The LSTM processes the sequence of visual
features, updating its memory state, and generating an output at each time step.
16
To obtain a final prediction or classification, the LRCN architecture typically incorporates
a fully connected layer on top of the LSTM. This layer maps the LSTM outputs to the
desired output classes, such as “fight” and “non-fight”, utilizing appropriate activation
functions. The LRCN architecture combines the strengths of CNNs in extracting
meaningful visual features and the sequential modeling capabilities of LSTMs. This
combination allows the model to comprehend both the static visual content and the
temporal evolution of videos, making it suitable for various tasks like action recognition,
video captioning, and video classification.
For this project, LRCN (Long-term Recurrent Convolutional Network) model was found
to be highly suitable for the task of classifying videos into two categories: "fight" and "non-
fight." The LRCN architecture provided a robust framework for effectively analyzing both
visual and temporal information in videos. The CNN component of the LRCN model
extracted crucial visual features from video frames, capturing significant visual cues. On
the other hand, the LSTM component effectively modeled the temporal dependencies
between these frames, facilitating the recognition of patterns specific to fight or non-fight
sequences.
Materials Specifications
17
4.3.1 Raspberry Pi 4B
The Raspberry Pi 4 Model B (Pi4B) stands as the latest iteration in the Raspberry Pi series,
offering significant improvements in performance, memory capacity, and connectivity.
Released by the Raspberry Pi Foundation in June 2019, The Pi4B is a cost-effective,
compact, and highly capable single-board computer that has gained significant popularity
for its power and versatility.
With its GPIO pins, Raspberry Pi 4B enables hardware interfacing and expansion, making
it ideal for IoT projects and prototyping. It supports various operating systems, with
Raspberry Pi OS (formerly Raspbian) being the recommended and officially supported
option. Additionally, it is compatible with popular Linux distributions such as Ubuntu and
third-party operating systems tailored for specific use cases.
To supply power to the Pi4B, a reliable USB-C power supply capable of delivering 5V at
3A is required. Nevertheless, if the USB devices connected to the Pi4B consume less than
500mA, a 5V, 2.5A power supply is adequate.
18
4.3.2 Pi Camera
The Pi Camera module is designed specifically for Raspberry Pi boards, including the
Raspberry Pi 4B, to provide a dedicated camera functionality. It provides a compact and
user-friendly solution for capturing images and videos.
For this project, the pi camera is used to capture video footage for analysis and
classification. It is connected to a Raspberry Pi board and configured to capture live video
streams or record video clips. This involves initializing the camera module, setting up
parameters such as resolution and frame rate, and starting the video capture.
Specifications:
Model : 5 Megapixel Omni-vision 5647 Camera Module
Picture resolution : 2592 by 1944 Pixels
Size : 25mm by 23mm by 8mm
Weight : 3 grams
A GSM module is a compact electronic device that incorporates GSM functionality into a
device or system, enabling it to connect to GSM networks and facilitate wireless data
transfer. It comprises essential components such as a GSM modem, an antenna for signal
reception and transmission, a SIM card slot for authentication and identification, and
19
interface circuitry. The primary purpose of GSM modules is to establish connectivity and
enable communication over GSM networks, allowing devices to send and receive data,
make voice calls, and exchange SMS messages. These modules are known for their
lightweight design, user-friendly operation, and low power consumption.
The Sim900 GSM Module with an SMA Antenna has been chosen for use. This module
enables GSM/GPRS communication across four frequency bands: 850 MHz, 900 MHz,
1800 MHz, and 1900 MHz. It offers reliable voice call support, SMS messaging
capabilities, TCP/IP connectivity, and an AT command interface for easy control and
configuration. The SMA antenna connector allows for improved signal reception, making
it suitable for applications requiring robust GSM communication in various locations.
In this project, the primary application of the GSM module is to enable mobile
communication, facilitating voice calls and text messaging (SMS) for mobile phones and
smartphones. It plays a vital role in establishing connectivity, ensuring continuous
communication between the system and the relevant authorities, during emergencies.
The software required for this project are listed in the table 4.2.
20
2 VNC Viewer 7.1.0
3 PuTTY 0.76
5 Fritzing 0.9.10
6 Python 3.7.9
7 Tensorflow 2.11
8 NumPY 1.23.5
9 Keras 2.12.0
10 OpenCV 4.7.0
11 AT commands 2.2.0.
13 Thonny 3.2
14 Minicom 3.0
Model training was done on Google Colab since it provides the use of free GPU and
preinstalled libraries. VNC viewer stands for Virtual Network Computing. Its purpose is to
enable the sharing of a remote desktop over a network connection. When both a laptop and
Pi4B are connected to the same network, the laptop can be utilized to control the Pi4B.
PuTTY is a Windows-based software that serves as an implementation of SSH and Telnet
protocols. It is employed to establish a connection between a laptop and Pi4B using SSH,
enabling communication and remote access between the two devices.
The Raspberry Pi Imager is a software tool that allows users to easily install operating
systems on Raspberry Pi devices. It provides a user-friendly interface to select and write
different operating system images onto an SD card or other storage media. This enables
users to quickly set up their Raspberry Pi with the desired operating system without the
need for complex manual installation procedures. The Fritzing app is a software tool that
assists in the design and documentation of electronic circuits. OpenCV is a programming
function library that is focused on real-time computer vision. It is a library that is utilized
for processing of images.
21
TensorFlow is a free open-source library that is utilized for computing numerical and large-
scale machine learning tasks. Additionally, it can be used to train and run deep neural
networks (DNNs). Python is the language used in TensorFlow due to its simplicity in
learning and implementation processes. NumPy, another Python library, is utilized for
working with arrays. Keras is a Python-based open-source software library used to build
artificial neural networks, acting as a bridge between users and the TensorFlow library.
Thonny, on the other hand, is an integrated development environment for Python
programming.
Raspian Bullseye was selected among the Raspbian operating system. Raspbian Bullseye
is an operating system specifically built for Raspberry Pi devices, utilizing a 64 -bit
architecture. It is an upgraded version of the Raspbian OS, tailored to provide improved
performance and compatibility with newer Raspberry Pi models.
The AT command set is a set of commands utilized for managing and establishing
communication with devices that are compatible with AT commands. On the contrary,
Minicom is a terminal emulation software that enables users to interact with devices using
a serial connection, such as modems, routers, or embedded systems.
22
The flowchart helped in guiding the development and the implementation of the project. It
enables efficient decision-making and helps streamline the implementation of various
components, such as the Pi camera, Raspberry Pi, and GSM module.
Real-Time Video Input: The flowchart begins with the Pi camera capturing real-time
videos. This ensures that the system is continuously monitoring the surroundings for
any potential fight incidents.
Video Processing: the Raspberry Pi then processes the captured videos. Using
appropriate algorithms, the system analyzes the video frames to determine whether they
contain a fight or not.
Fight detection: Based on the video processing results, the system identifies whether
the video depicts a fight or a non-fight scenario. If it is determined to be a fight video,
the flow moves to the next step.
GSM Activation: Upon detecting a fight video, the flowchart triggers the activation of
the GSM module. The GSM module allows the Raspberry Pi to communicate via
cellular networks. This activation enables the system to send notifications or alerts
regarding the detected fight, providing a means to inform relevant authorities about the
incident.
Speaker Activation: Along with the GSM activation, the flowchart includes the
activation of a speaker. This can be utilized to emit an audible alert or warning in the
vicinity, notifying nearby individuals or authorities about the ongoing fight and
possibly deterring further escalation.
Continuous Monitoring: If a fight is not detected in the current video, the flow returns
to the beginning, and the camera continues capturing videos for further analysis. This
ensures that the system remains vigilant and continues to monitor the environment until
a fight video is detected.
The development process began with collecting and labeling a dataset of fight and non-
fight videos. Colab, a cloud-based Jupyter notebook environment, was used for training the
deep learning model due to its computational resources and pre-installed libraries. Once
trained, the model was deployed and executed on the Raspberry Pi.
23
After conducting thorough research on existing methods and technologies, these were the
key aspects of development:
Data Collection: Gathered a diverse dataset of videos containing both fight and non-
fight scenarios. These dataset served as the basis for the training and testing the
detection model. These datasets formed the foundation for training and evaluating the
detection model. A collection of 900 videos was obtained, with 450 videos in each
category: fight and non-fight.
Model Training: To train video datasets using LRCN in Colab, we set up our
environment, installed necessary libraries, and preprocessed the data. Next, we defined
the model architecture, created data generators, and compiled and trained the model.
After training, we evaluated the model's performance and fine-tuned it if necessary to
improve results.
Model Evaluation: This process included assessing the performances of the trained
model using evaluation metrics like accuracy. Next, the validation stage involves
testing the model's capability to correctly detect videos that depict fights and non-fights.
Limited GPU resources: While Google Colab provides access to GPUs, the allocated
resources are limited. Training a video dataset using LRCN is computationally
intensive and time consuming.
Limited computational power: Running complex deep learning models like LRCN was
resource-intensive, which led to slower performance.
Limited dataset availability: Developing a robust deep learning model for our project
required a diverse and sufficiently large dataset.
Model optimization for real-time processing: Real-time processing of video streams
from a Pi Camera requires careful optimization of the LRCN model to efficiently handle
the incoming video frames in real-time.
Apart from the constraints mentioned above, there were additional challenges to consider,
such as managing dependencies, handling connectivity issues, addressing interruptions in
Colab sessions, ensuring proper version control, and dealing with the labor-intensive task
of labeling and annotation.
24
To overcome the issues, we meticulously fine-tuned the training parameters, by tweaking
hyper-parameters such as epochs, batch size, and regularization techniques to achieve
optimal performance. Additionally, we spent a lot of time and effort assembling a larger
and more diversified dataset, ensuring it had a variety of important occurrences, edge
circumstances, and potential challenges. By using this technique, we were able to train the
model on a sizable and varied set of data, increasing its ability to generalize and handle a
range of situations.
4.4.3 Implementation
The implementation involved installing the necessary software libraries and dependencies
on the Raspberry Pi, ensuring compatibility and smooth execution. Integration with the Pi
camera, GSM module, and speaker required establishing appropriate connections and
interfaces based on hardware specifications.
25
Figure 4.7: Circuit design for hardware implementation
The above figure 4.7 depicts the circuit design done using the Fritzing software for
hardware implementation of the project.
1. Downloaded the trained model files from Colab to the local machine in .h5 format.
2. Transferred the model files to the Raspberry Pi.
We had to conduct rigorous testing of the integrated system to verify its performance and
reliability by iterating and refining the algorithms. Following the system implementation,
video frames captured by the Pi camera undergo preprocessing before being inputted into
the deep learning model. The model's output activates alerts and notifications through the
GSM module and speaker, according to predetermined conditions.
For the compact enclosure of the system, a case was designed. After specifying the
dimensions and shape, the design was done in fusion 360 and 3D printed in FabLab. The
case was designed to make the system compact and portable while providing access to
necessary ports and functionality. The dimension of the case is 10 cm by 10 cm by 11cm.
27
Figure 4.10: System case
28
CHAPTER 5: RESULTS AND ANALYSIS
5.1 Introduction
This chapter comprises the results obtained from training and testing the models, which
were conducted to replicate a real scenario. Additionally, the analysis of the system through
testing in various scenarios and cases is included. The chapter also covers a discussion on
the performance of the model and the methods used to improve its performance.
Furthermore, the results of the system's performance are analyzed using various
parameters. The chapter concludes with a discussion on the cost analysis of the system.
The pie chart below represents the datasets that we have collected for training the model.
The datasets are collected from various sources like GitHub, Kaggle and custom-made
videos.
The dataset statistics indicate that it contains 900 video samples, with an equal distribution
of 450 fight videos and 450 non-fight videos. Each video sample has an average duration
of 10 seconds, resulting in a substantial amount of data for the training and testing the fight
scene detection system. The dataset was split into training and testing ratio of 80:20 in
percentage.
29
5.3 Comparative Study of Models
The table below provides a comparative analysis of five different models namely LRCN,
CNN+RNN, MobileNetV2, ConvLSTM, and MobileNet+LSTM. These models were
evaluated based on their accuracy, which we got during the model training and real-time
testing processing time per batch when we tested it in the system.
These results indicate that the LRCN and MobileNetV2 models achieved the highest
accuracy while maintaining real-time processing capabilities. However, for the system we
deployed the LRCN model considering the trade-offs between accuracy and processing
time of the LRCN model.
The model was evaluated based on the accuracy for the LRCN model that we deployed in
the system to detect fights.
The above graph represents the accuracy of the LRCN model trained. The accuracy got
while training the model was 88.12% and it shows that the model has high accuracy.
31
Recurrent Convolutional Network) architecture. The hyper-parameters that were tuned
includes the number of frame sequence and the number of epochs for training.
20 45 88.12% 25ms
25 50 86.23% 27ms
From the analysis, it can be observed that the LRCN model achieves the highest accuracy
of 88.12% when trained with 20-frame sequence and for 45 epochs. The real-time testing
results show that the processing time per batch ranges from 23ms to 27ms, indicating that
the model can process video frames within seconds in real-time.
Therefore, it shows that by tuning the hyper-parameters of the LRCN model, it was possible
to optimize its performance and achieve high accuracy in detecting fights. Furthermore,
these findings provide insights into the ideal configuration for the fight detection system,
enabling effective deployment and real-time monitoring.
32
5.6 System Testing Result
To evaluate the system accuracy, the system was tested on real-time input video. The
following figures shows the result of system testing done.
The detection accuracy or confident score for non-fight incidents is 94%, while for fight
incidents, it is 99.6%. These result indicate that the model was effective in accurately
identifying and detecting fights.
The system performance is affected by various parameters like camera angle, range of the
camera and the light intensity of the scene to be captured and analyzed. The analysis of the
system performance was based on these parameters, and conducted system testing on
various parameters in different scenarios. The table below shows the experimental data that
were collected during system analysis.
33
Table 5.3: Data obtained from system analysis
Medium(4-6m) 26 89.7
Long(7-10m) 30 82.1
Narrow(2°) 30 83.2
34
Figure 5.8: Response time versus various parameters
The table presents an overview of the system's performance based on different parameters,
including light intensity, distance, and camera angle. It demonstrates that as the light
intensity increases from low to medium and high, the system's accuracy improves to 91.6%
and 95.8%, respectively, from an initial value of 85.2%. Similarly, as the distance increases
from short to medium and long, the accuracy decreases to 89.7% and 82.1%, respectively,
from an initial value of 93.4%. The camera angle also influences accuracy, with wider
angles (180°) achieving a higher accuracy of 92.7% compared to moderate angles (88.4%)
and narrow angles (83.2%). The corresponding performance times in milliseconds indicate
the speed at which the system processes the data. These findings emphasize the significance
of considering environmental factors and parameter settings for optimizing the
performance of the fight detection system.
The table presents experimental results for violence detection, specifically focusing on
communication delay and response time. These results offer valuable insights into the
performance of the communication system, highlighting the delay in transmitting
information and the time required to generate a response.
35
Table 5.4: Reliability of GSM communication
1 10 2
2 20 3
3 15 2.5
4 12 2.2
5 18 3.1
6 25 2.8
The recorded values for communication delay and response time reflect the performance
and reliability of the system in transmitting alerts and receiving timely responses. A lower
communication delay signifies efficient communication between the GSM module and
relevant authorities, enabling faster alert transmission. Similarly, a faster response time
demonstrates the responsiveness and effectiveness of the concerned authorities in
addressing detected fight scenes.
Analyzing the system based on communication delay and response time provides valuable
insights into its performance and reliability in facilitating timely communication and
response to fight scenes. By optimizing the system to minimize communication delays and
improve response times, it can significantly enhance public safety by enabling prompt
actions from the authorities.
36
Table 5.5: Cost analysis
Speaker 1 Nu.1000
SD Card 1 Nu.999
The cost analysis above includes the prices for all the materials required for the project.
The Raspberry Pi processor, Pi Camera, Speaker, GSM Module, SD Card, Connecting
Wires, and Power Adapter have been listed along with their respective quantities and costs.
The total cost of all the materials combined is Nu. 18,995.
37
CHAPTER 6: CONCLUSION AND FUTURE WORK
6.1 Conclusion
The primary objective of this project was to employ a deep learning model to develop a
fight detection system. By putting in place a reliable system using a Raspberry Pi 4B, a Pi
camera, a speaker, and a GSM module, the goal was accomplished. In order to categorize
the incoming video feeds from the camera as fight or non-fight, the Long-term Recurrent
Convolutional Networks (LRCN) architecture, which combines the Convolutional Neural
Networks (CNN) and the Recurrent Neural Networks (RNN) was trained and deployed.
Custom dataset and a Kaggle and GitHub dataset were utilized for this. The system
efficiently detects fights by analyzing video feeds in real-time and is made to be put in
various locations. Upon detection, it automatically starts a call to the police, sending them
the particular location, and plays an alert at the scene of the incident.
The project focused on creating a reliable and effective fight detection system and assessing
its performance in practical situations. By providing an automated and pro-active method
for recognizing and resolving disputes in public places, the system is an effective adoption
to lower the crimes and increase public safety and security. It has the advantage of real-
time fight detection over conventional CCTV systems, removing the need for human
monitoring and speeding up response times.
Future work for improving the system's performance includes addressing challenges related
to lighting conditions, camera angles, distance variations, and the use of multiple cameras.
To improve categorization in varied light situations, methods such as picture enhancement,
adaptive thresholding, and dynamic lighting modifications can be explored. Preprocessing
video frames to equalize lighting can help maintain accuracy across different lighting
scenarios. Enhancing performance in various camera angles involves training the model o n
a diverse dataset with videos captured from different viewpoints. This enables the system
to handle varying camera angles and classify conflicts more precisely. Additionally,
investigating changes to the dataset or synthesizing different camera angles can further
improve the system's ability to handle different viewpoints.
38
To address differences in distance from the camera, future research can focus on developing
techniques to calculate distances between objects or people in video frames. This can
enhance the system's accuracy in scenarios where fights occur at various distances. By
utilizing distance estimation algorithms, the system can adjust its categorization strategy
based on the proximity of the fight, improving overall performance and accuracy. Another
potential avenue for improvement is using multiple cameras to capture the scene from
different angles. This approach provides a more comprehensive view of the incident,
enhancing the accuracy of fight detection and classification.
39
REFERENCES
Ahmed, M., Ramzan, M., Khan, H. U., Iqbal, S., Khan, M. A., Choi, J. I., Nam, Y., &
Kadry, S. (2021). Real-time violent action recognition using key frames extraction and
deep learning. Computers, Materials and Continua, 69(2), 2217–2230.
https://doi.org/10.32604/cmc.2021.018103
Akti, S., Tataroglu, G. A., & Ekenel, H. K. (2019). Vision-based Fight Detection from
Surveillance Cameras. 2019 9th International Conference on Image Processing
Theory, Tools and Applications, IPTA 2019.
https://doi.org/10.1109/IPTA.2019.8936070
Arun Akash, S. A., Sri Skandha Moorthy, R., Esha, K., & Nathiya, N. (2022). Human
Violence Detection Using Deep Learning Techniques. Journal of Physics: Conference
Series, 2318(1). https://doi.org/10.1088/1742-6596/2318/1/012003
Bhagya Divya, P., Shalini, S., Deepa, R., & Reddy, B. S. (2017). Inspection of Suspicious
Human Activity in the Crowdsourced Areas Captured in Surveillance Cameras.
International Research Journal of Engineering and Technology. www.irjet.net
Donahue, J., Hendricks, L. A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko,
K., & Darrell, T. (2017). Long-Term Recurrent Convolutional Networks for Visual
Recognition and Description. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39(4), 677–691. https://doi.org/10.1109/TPAMI.2016.2599174
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
http://www.deeplearningbook.org
Iqbal, M. J., Iqbal, M. M., Ahmad, I., Alassafi, M. O., Alfakeeh, A. S., & Alhomoud, A.
(2021). Real-Time Surveillance Using Deep Learning. Security and Communication
Networks, 2021. https://doi.org/10.1155/2021/6184756
Irfanullah, Hussain, T., Iqbal, A., Yang, B., & Hussain, A. (2022). Real time violence
detection in surveillance videos using Convolutional Neural Networks. Multimedia
Tools and Applications, 81(26), 38151–38173. https://doi.org/10.1007/s11042-022-
40
13169-4
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
https://doi.org/10.1038/nature14539
Lim, F. J. (2019). Smart Security Camera Using Machine Learning. January, 54.
Patil, A., & Rane, M. (2021). Convolutional Neural Networks: An Overview and Its
Applications in Pattern Recognition. Smart Innovation, Systems and Technologies,
195, 21–30. https://doi.org/10.1007/978-981-15-7078-0_3
Shiranthika, C., Premakumara, N., Chiu, H. L., Samani, H., Shyalika, C., & Yang, C. Y.
(2020). Human Activity Recognition Using CNN & LSTM. Proceedings of ICITR
2020 - 5th International Conference on Information Technology Research: Towards
the New Digital Enlightenment, January.
https://doi.org/10.1109/ICITR51448.2020.9310792
Tiwari, R. K., & Verma, G. K. (2015). A Computer Vision based Framework for Visual
Gun Detection Using Harris Interest Point Detector. Procedia Computer Science, 54,
703–712. https://doi.org/10.1016/j.procs.2015.06.083
Velasco-Mata, A., Ruiz-Santaquiteria, J., Vallez, N., & Deniz, O. (2021). Using human
pose information for handgun detection. Neural Computing and Applications, 33(24),
17273–17286. https://doi.org/10.1007/s00521-021-06317-8
Zong Chen, D. J. I. (2020). Smart Security System for Suspicious Activity Detection in
Volatile Areas. Journal of Information Technology and Digital World, 02(01), 64–72.
https://doi.org/10.36548/jitdw.2020.1.006
41