Document (AutoRecovered)
Document (AutoRecovered)
INTRODUCTION
Several real-time sensing applications are becoming developed, particularly in the fitness
and health tracking fields. In order to better understand human behaviour, these applications
utilize mobile sensors included in smartphones to identify human activity. By training a
supervised learning model and showing the outcomes in accordance with the input obtained from
our accelerometer sensor and LSTM -2D CNN model, the HAR system is needed to detect six
fundamental human actions such as walking, standing, walking upstairs, walking downstairs,
laying and sitting.
Many useful mobile applications have recognized the advantages of wearable sensors,
including abnormal driving detection, healthcare systems for remotely monitoring elderly
persons, sport performance tracking, and mobile assistance systems for individuals who have
vision problems. Because of improvements in health, the proportion of elderly individuals in the
global population is higher than it has ever been. As a result, there is a higher need for social
support of the physical and emotional health of those who live alone. There are many reasons to
believe that machine learning and AI will be able to detect tasks.
PAGE \* MERGEFORMAT 34
For seniors who want to age in place, Activity recognition (AR) might be used to keep
track on their well-being, detect any disturbing changes in routine, and notify responders right
away in case of an emergency. According to the hardware used to gather data, augmented reality
(AR) may be split into three categories: camera video, wearable technology, and binary sensors.
Due to concerns about privacy invasion and practical issues, such as discomfort from the device
and higher maintenance requirements, cameras and wearable technology are less than ideal
solutions. This research developed a device-free, privacy-protecting way to investigate data-
driven AR based on deep learning. The binary sensor-based method provides a solution to the
problem of long-term activity monitoring in the actual world.
The representation and extraction of features are necessary for the AR process to be
complete. In order to effectively classify and identify actions that are frequently conflated, such
as standing, sitting, lying down, and walking etc, this study set out to extract a meta-action by
evaluating the causal influence between a collection of sensor activations. Because each person's
activities are a reflection of their unique set of values, customs, and routines, making human
activity a process variable.
PAGE \* MERGEFORMAT 34
Even if the activity areas are comparable, a user's habits and lifestyle may have an influence on
the specific sequence or characteristic of sensor activation in any specific activity. This variation
may be defined as a causality between sensors.
Furthermore, enhancements that can improve the wearable activity detection model's capability
to provide more accurate evaluation of a variety of activities may be developed by using machine
learning approaches. These approaches using standard machine learning, however, generally rely
on heuristic manual feature extraction and are hence typically limited to understanding of the
human domain. System performance in terms of classification accuracy and other evaluation
metrics for systems using standard machine learning are constrained as a result of this limitation.
Deep learning (DL)-based techniques are utilised in this study to overcome these limitations.
PAGE \* MERGEFORMAT 34
1.2.4 Gyroscope:
A gyroscope sensor is a tool that can measure and keep track of an object's rotation and
angular velocity. While accelerometers can only monitor linear motion, they can measure the tilt
and lateral orientation of the item. The terms "angular rate sensor" and "angular velocity sensor"
are also used to refer to gyroscope sensors. These sensors are put in situations where it is
challenging for humans to determine an object's orientation.
1.2.5 Pedometer:
A pedometer is a device that tracks and counts the steps a person takes while walking.
Pedometers are increasingly extensively used by fitness enthusiasts for fitness-related activities.
We were able to use smartphones in this research to develop pedometer functionality since the
majority of modern smartphones include an integrated accelerometer. We utilised this pedometer
in our project to cut the price of the Fitbit devices, which currently cost us between 5k-6K.
1.2.6 Magnetic Field Sensors:
By detecting the planar magnetic field using magnetic properties, the magnetic field
sensor can identify the direction and strength of the magnetic field. It is frequently used in
conventional compass or map navigation to help mobile phone users get accurate positioning.
The magnetic field sensor may be used to calculate the mobile phone's magnetic field
intensity in the x, y, and z directions. If you rotate your phone so that the value in only one
direction is not zero, it will indicate south. Numerous mobile phone compass applications use the
information from this sensor.
PAGE \* MERGEFORMAT 34
Deep learning is performed by artificial neural networks, which include numerous layers.
Such networks include deep neural networks (DNNs), where each layer is capable of carrying
out complicated operations like representation and abstraction to make sense of text, sound, and
image data. Deep learning, often regarded as the machine learning area with the greatest rate of
growth, is being employed by more and more firms to develop novel economic models.
Similar to how the human brain is composed of neurons, neural networks are layers of
nodes. Individual layer nodes are connected to neighboring layer nodes. Based on how many
layers the network has, it is deemed to be deeper. In the human brain, a single neuron gets
hundreds of impulses from other neurons. Signals go between nodes and apply appropriate
weights in an artificial neural network. A node with a higher weight will have an impact on the
nodes in the layer below it. The weighted inputs are compiled to create an output in the final
layer. Deep learning algorithms need advanced hardware because they process a lot of data and
perform several intricate mathematical calculations. However, training a neural network can be
challenging even with such sophisticated technology.
Large data sets are fed into deep learning systems since they need a lot of information to
get correct results. A sequence of binary true or false given inputs extremely complicated
mathematical calculations are used to classify the data as it is processed by artificial neural
networks.
For a human activity recognition, the deep learning algorithms are used for better results.
In the deep leaning there are a greater number of algorithms are present for Activity recognition.
They are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-
Term Memory (LSTM), Radial Basis Function Networks (RBFN), Multilayer Perceptron’s
(MLP). LSTM-2D CNN is best among all the other Models for Activity Recognition.
PAGE \* MERGEFORMAT 34
Due to its unique memory cells, LSTM outperforms convolutional neural networks in
terms of feature extraction from sequence data. In order to more effectively extract the temporal
characteristics from the sequence data, the input data in this work first goes through two layers of
LSTMs. The LSTMs have 32 memory cells per layer. To control the functioning of each memory
cell, the inputs are passed through different gates, such as input gates, forgetting gates, and
output gates.
The advancements in deep learning have been excelled due to CNN. It is deep learning
algorithm which can assign importance to the images and process them and differentiate them
from one another. They are very different from the ordinary neural networks as they are far more
efficient. A convolution neural network is the very simple application of a filter to an input that
results in activation. CNN is a branch of deep learning that exclusively deals with the image
recognition and image processing.
As proposed model mainly deals with the Human activity recognition using the hybrid
deep learning networks this approach has been considered for efficient and accurate results. They
are modeled for classifying specific arrays and images and also have multiple layers including
hidden layers. Hence this technique is very apt and had proven to be efficient in several previous
models.
A general CNN architecture consists of basic layers they are convolutional layer, max
pooling layer, dropout, fully connected layer and activation functions. This basic structure can be
modified by adding or subtracting the layers. Our model is constructed with 4 convolutional 2D
layers, batch normalization, max pooling layers, flatten layer and dense layer stacked on top of
one another. This is entirely trained with a data set of six activities like sitting, standing, walking,
laying, walking upstairs etc.
PAGE \* MERGEFORMAT 34
1.5.1 Convolution 2D Layer
The convolutional layer in CNN is the major building blocks which are generally used at
the beginning as it is the first layer. The convolution layer extracts high level features such as
edges from the input images. The convolved images are dimensionally reduced or increased. The
purpose of convolution layer is to construct the outputs for given inputs. The difference between
normal convolution layer and convolution 2D layer is that it takes 2D inputs whereas
convolution layer takes linear inputs. It also decreases the image size. 4 convolution 2D layers
are used to meet the requirements of this experiment with a dataset.
In deep learning, the batch normalization layer provides a standard the inputs for each
layer in a batch. Dropout rate is closely attached to batch normalization. Without regularizer,
batch normalization is less effective and is more likely to notice an activity's poor performance
and outcomes.
With the use of batch normalization, the training period's epoch count may be decreased
while also stabilizing the learning process for deep learning. Because it is also a regularizer, it
can decrease the internal covariate shift in the layers as well as instability between the layers.
The over fitting problem will be less severe the deeper the networks become. The utilization of
batch normalization in combination with other regularization strategies is usually beneficial.
PAGE \* MERGEFORMAT 34
By adding additional layers to the deep neural network, batch normalization is a
technique for speeding up and stabilizing neural networks. Additionally, it enables each layer to
learn independently on its own and continuously uses the output from the layer before the neural
network layer.
Max pooling layer is used for reducing the dimensions of the feature maps. It summarizes all the
parameters and reduces them. It also reduces the amount of learning time and computational
amount. It is used to size down the samples which is very useful for avoiding the over fitting
problem which is commonly seen in CNN algorithms. It also makes certain distortion invariant
of the model.
PAGE \* MERGEFORMAT 34
1.5.6. Activation Functions
The activation function applied in this architecture ReLu. The rectified linear unit (ReLu)
function is an activation function that works after the training is completed along with the
convolutional layers.
The ReLu gives the output from the input as negative or zero. The vanishing gradient
problem is the main disadvantage of activation functions that generally occurs due to the
hyperbolic tangent and sigmoid activation functions. This vanishing gradient problem can be
solved by using the ReLu function by letting the models learn more and perform better. This
ReLu is a default function used in every CNN and multilayer perception. This ReLu layer is used
in CNN after the convolution layer and before the max pooling layer. The major purpose of
using this function is to increase the non-linearity in the images that are being processed for the
training. The convolved image is where a ReLu function is applied in the entire CNN.
Transfer learning is mostly focused on knowledge storage which has been attained while
solving problems that it can be utilized to a completely different problem that is why it is
considered as a research problem as well. Let us consider an example the knowledge that has
been gained during the image recognition of two wheelers can be used in image recognition in
four wheelers. Transfer learning is a technique of using the pre trained models for other
algorithms and functions. They store the knowledge from previous model and utilize them with
some modifications on current model.
The transfer learning and python are mostly preferred combination for the LSTM-2D
CNN techniques. The major advantage of transfer learning is that it helps in avoiding the over
fitting problem that generally occurs in image processing and convolutional neural networks.
While modeling a second task in the transfer learning optimization it helps to improve the
performance as well as turbulent free progress. From the previously learned tasks it can improve
the modeling which can help in seamless algorithm building.
PAGE \* MERGEFORMAT 34
1.4 Motivation
Human activity recognition is an important and challenging research area with many applications
in the healthcare, smart environments and surveillance and security. Human activity recognition
is a field that specifically deals with this issue through the integration of sensing and reasoning,
in order to deliver context-aware data that can be employed to provide personalized support in
many applications.
As a simple example, imagine a smart home equipped with ambient sensors able to detect
people’s presence and the activation of household applinance
Before delving into the project work, we must first understand the technical aspects,
system requirements, and organization of the project report, which are detailed further below.
The project report is divided into five chapters and references.
PAGE \* MERGEFORMAT 34
CHAPTER 2
LITERATURE SURVEY
The literature study is primarily performed to evaluate the history of the present project,
which aids in identifying flaws in the current system and guidelines for resolving unresolved
problems. The following work discusses the project's history as well as the challenges and
limitations that led to the proposal of remedies and the work of this project.
I. N. Yulita et al. Human activity recognition (HAR) is a rapidly expanding area of study
with several uses. A wearable-based HAR system called the Magnetic Induction-based Human
Activity Recognition System (MI-HAR) has been proposed for collecting human movements and
identifying activities based on the gathered data. In this study, we mainly concentrated on the
performance examination of several machine learning classifiers utilising artificial MI-motion
data (signals based on magnetic induction). This research' primary goal is to assess how well six
popular classifiers perform in HAR applications. Additionally, we evaluated the categorization
performance obtained from MI-motion data with results obtained from similar research
employing accelerometer data. According to our findings, Random Forest had the greatest
performance on MI-motion data, scoring 91.5%.
PAGE \* MERGEFORMAT 34
Smart phone and smart watch sensors may be used to extract information about the user
context, notably the activities. Machine learning algorithms may classify human behavior’s using
raw data gathered from the sensors. Studies that concentrate on identifying mile activity typically
employ motion sensors like an accelerometer and gyroscope. M. C. Sorkun et al. proposed the
effectiveness of activity categorization when various sensors are applied individually or
collectively. We extract numerous characteristics from raw data using a dataset that was gathered
from fifteen people and included six distinct activities, and then supervised machine learning
techniques are used to train and validate the findings. Performance analysis is measured using
five distinct classifiers and several validation techniques.
Yu Zhao et al. proposed to use residual bidirectional long short-term memory (LSTM)
cells in a deep network design. One advantage of the new network is that a bidirectional link may
combine the forward state of positive time and the reverse state of negative time (backward
state). Second, remaining connections between stacked cells act as gradient highways, allowing
them to transmit underlying data straight to the top layer and therefore circumvent the gradient
vanishing problem. In general, the suggested network displays improvements on the spatial
(deeply stacked residual connections) and temporal (using bidirectional cells) dimensions,
aiming to increase the recognition rate. The accuracy was improved by 4.78% and 3.68%,
respectively, when evaluated using the Opportunity data set and the public domain UCI data set,
in comparison to earlier results. The public domain UCI data set's confusion matrix was then
analysed.
Sakorn Mekruksavanich et al. Proposed HAR framework for smartphone sensor data
based on time-series domains of Long Short-Term Memory (LSTM) networks. To examine the
effects of using various types of smartphone sensor data, four baseline LSTM networks are
compared. Additionally, a 4-layer CNN-LSTM hybrid LSTM network is suggested to enhance
recognition performance. On a public smartphone-based dataset of UCI-HAR, the HAR
technique is assessed using several configurations of sample generation methods (OW and
NOW) and validation protocols (10-fold and LOSO cross validation). Additionally, Bayesian
optimization methods are employed in this study since they are useful for fine-tuning each
LSTM network's hyperparameters. Compared to earlier state-of-the-art methods, the
PAGE \* MERGEFORMAT 34
experimental findings show that the proposed 4-layer CNN-LSTM network performs well in
activity recognition, increasing the average accuracy by up to 2.24%.
Pei Tang et al. proposed minimum redundancy and maximum relevance measure for the
purpose of recognizing human activity in smart home environments, the maximum relevance
(mRMR) algorithm (under D-R and D/R criteria) has been used to pick features and to build
various feature subsets based on observed motion sensor events. Following that, the chosen
feature subsets were assessed, and two probabilistic algorithms—the hidden Markov model and
the naive Bayes (NB) classifier—were used to compare the activity identification accuracy rates
(HMM). The experimental results demonstrate that not all characteristics are helpful for
recognizing human activity, and different feature subsets provide varied rates of accuracy for
recognizing human activity. Additionally, even the same feature subset has a distinct impact on
the accuracy rate of recognizing human activity for various activity classifiers. It is crucial for
researchers who are working on human activity recognition to take into account both the relation
between characteristics and actions as well as feature redundancy. Generally speaking, feature
selection and positive to activity recognition may be accomplished using both the maximal
relevance measure and the mRMR method.
Hong yang et al. proposed for the purpose of forecasting daily activity category and
occurrence time mutually and iteratively, a prediction model based on multi-task learning is
presented. First, a feature space of everyday activity is formed by pre-processing raw sensor
signals. As the forecast model, a convolutional neural network (CNN) and bidirectional long
short-term memory (Bi-LSTM) units are combined in a simultaneous multi-task learning model.
Finally, the suggested model is assessed using five different datasets. According to the
experimental findings, this model outperforms the most recent single-task learning models in
accuracy by at least 2.22% and in the metrics of NMAE, NRMSE, and R2 by at least 1.542%,
7.79%, and 1.69%, respectively. The average accuracy is 84%
Smart Homes are typically seen as the ultimate answer to all livability issues, particularly those
involving the care of the elderly and disabled, energy conservation, etc. The secret to home
automation in smart homes is human activity recognition, which enables the smart services to
operate automatically in accordance with human thought.
PAGE \* MERGEFORMAT 34
Although a lot of recent research has been done in this area, much of it can only identify
default actions, which is probably not what smart home services need. Furthermore, because to
insufficient scalability, such research cannot be used outside of the lab. In this paper, we unravel
this problem and Yegang Du et al. proposed a novel framework to not only identify but also
anticipate human behaviour. The framework has three stages: activity prediction in advance,
activity recognition during the activity, and activity recognition after the activity. The hardware
cost of our framework, which uses passive RFID tags, is also sufficiently low to make it widely
used. Additionally, the outcome of the experiment shows that our framework is very scalable and
can achieve good performance in both activity detection and prediction.
Agarwal et al proposed lightweight deep learning model for HAR and put into use on a
Raspberry Pi3. A shallow RNN and the LSTM algorithm were used to generate this model.
Although just one dataset with six activities was evaluated, the recommended model is fairly
accurate and has a straightforward design, which does not demonstrate how well it may be
extended. In the study [16], recurrent neural networks, neural networks, and a deep learning
combination of inception and the model (InnoHAR) are utilised to categorise activities. The
authors used separate convolution in place of traditional convolution, which proved effective for
its intended use in model settings. The results are great; however, it took a while for the model to
barely converge during the duration of the learning phase.
PAGE \* MERGEFORMAT 34
Davide buffelli and Fabio vandin et al proposed an innovative deep learning framework,
TrASenD, which is based only on an attention-based mechanism, outperforms a novel deep
learning framework based on state-of-the-art. We show that our proposed attention-based
architecture outperforms earlier approaches with an average accuracy improvement of more than
7% over the previous best-performing model. We also consider the problem of modifying HAR
deep learning models, which is essential in many applications. The average accuracy is 84%.
PAGE \* MERGEFORMAT 34
CHAPTER 3
PROPOSED METHODOLOGY
This chapter describes the methodology followed in the project. In the first stage we are
going to work with a dataset which we will use to train a Weapon Detection. The YOLO (You
Only Look Once) series is one of the most advanced object detection models. In contrast to
other region proposal-based techniques, it divides the input image into a S x S grid and then
predicts the probability and bounding boxes for an object whose centre falls into a grid cell.
The suggested system's main goal is to detect weapons in CCTV footage by using Deep
Learning technique. To train and validate the classifier, all of these technique make use of
labelled data. Finally, by developing the model then we detect the weapons in CCTV footage.
They are:
PAGE \* MERGEFORMAT 34
Collect the data needed for the train and test the model.
Preprocess the data to eliminate the unnecessary information from the data and split the
data into train and test set.
Determine the structure of the learned function and the associated learning algorithm.
Design the model.
Evaluate the accuracy, precision, recall and F1 score of estimator.
YOLO, which stands for "You Only Look Once," is an object detection technique that
organizes images into a grid. The task of finding objects inside one's own cell falls to each grid
cell. Due to its efficiency and precision, YOLO is among the most well-known object detection
algorithms. In this thesis YOLOV5 algorithm used for weapon detection.
PAGE \* MERGEFORMAT 34
YOLOV5 contains three components in it’s architecture as shown in the figure
3.1 they are: Model Head, Model Neck, and Model Backbone. Model Backbone's main goal is to
pull out important details from an input data. To take a source image and extract meaningful,
significant features from it. In YOLO v5, Cross Stage Partial Networks (CSPDarknet) serve as
the framework for extracting detailed information from the images. Model Neck produces
feature pyramids as its main objective. Models can generally scale images well according to
feature pyramids. It is useful to be able to detect the same object in various scales and sizes. On
unobserved data, feature pyramid models perform well. Other models, such as Feature Pyramid
Network (FPN), BiFPN, and Path Aggregation Network (PANet) used for other feature pyramid
methodologies. The final detecting stage is primarily carried out using the model Head. It used
bounding box to apply to the features and generated final output vectors with bounding boxes,
objectness scores, and confidence score.
We will discuss the Software that we have used. When discussing Deep learning, it is
typical to utilize Python as the primary programming language. That is the first tool we will
pretend to utilize and for YOLOV5 PyTorch is used.
3.3.1 Python
Python includes a number of built-in libraries. Many of the libraries are related to AI and
machine learning. Tensorflow (a high-level neural network framework), Scikit-Learn (for data
mining, data analysis, and machine learning), and others are among the most popular. The list
goes on indefinitely. Python offers a simple OpenCV implementation. Python's popularity stems
from its strong yet simple implementation. For other languages, students and researchers must
first learn the language before attempting ML or AI with it. Python, on the other hand, is not like
this. Tensorflow is one of the most crucial Python libraries that we will use.
3.3.2 PyTorch
The Torch library-based machine learning framework PyTorch was created by Meta AI and is
now a member of the Linux Foundation. It is used for applications like computer vision and
natural language processing. It is open-source software that is available for free under a modified
PAGE \* MERGEFORMAT 34
BSD licence. PyTorch features a C++ interface, even though the Python interface is more refined
and the main focus of development.
Many deep learning applications, including Tesla Autopilot, Pyro from Uber, Transformers from
Hugging Face, PyTorch Lightning, and Catalyst, are built on top of PyTorch. PyTorch offers the
following two top features:
3.3.3 Tensorflow
3.3.4 Keras
Keras is a Python-based open source neural network library. It can run on top of
Tensorflow as well as alternative backends. It is user-friendly, modular, and expandable, with the
goal of enabling rapid experimentation with deep neural networks. It is incapable of doing low-
level operations like as tensor products, convolutions, and so on. That is why the task is
delegated to the backend (such as Tensorflow, which does the job perfectly).
PAGE \* MERGEFORMAT 34
3.3.5 Numpy, Matplotlib and Scikit-Learn
There are other additional libraries that we may import and use. Numpy, as you may or
may not be aware, is one of the most popular Machine Learning libraries. Numpy includes
support for huge, multi-dimensional arrays and matrices, as well as a wide set of high-level
mathematical functions for working with these arrays.
Matplotlib is one of the most popular and capable frameworks for data visualisation.
Matplotlib is a Python 2D plotting package that generates high-quality figures in a range of
hardcopy and interactive formats across platforms. Matplotlib is a Python library that may be
used in Python scripts, Python and IPython shells, Jupyter notebooks, web application servers,
and four graphical user interface toolkits. With just a few lines of code, you could make plots,
histograms, power spectra, bar charts, error charts, scatterplots, and so on.
The major reason we'll utilise this programme is that you have access to a strong graphics
card, which will allow us to run deep neural networks faster. The Nvidia Tesla K80 will be used
as the graphics card. It enables you to execute your code continuously for twelve hours.
PAGE \* MERGEFORMAT 34
3.4 Hardware Requirements
In deep learning hardware requirements gives the performance results based on the
computational parameters. Here is the list of hardware tools that are considered for implementing
the system.
RAM : 8 GB or higher
In this work YOLOV5 algorithm is used for object detection. YOLOV5 algorithm uses
the Pytorch for object detection. YOLOV5 repository can be downloaded at
https://github.com/ultralytics/yolov5. Which is the repository's official home page. It is used to
the training of particular trained objects in images and videos. Depending on the confidence
value, the system detects any weapons present in the images and videos when the training is
over.
The learning rate of a system can be described as the parameter that affects the
model modification with respect to the errors occurred after change of particular values or
weights. The choice of the optimal learning rate is crucial for a model because, as seen in figure
3.2
PAGE \* MERGEFORMAT 34
Figure 3.2 Effect of Learning Rate on Deep Learning
A poor choice of learning rate can result in a very time-consuming training process and a
poor choice of learning rate can result in a process that is unstable or moves too quickly. As a
result, when training the model and obtaining the final result, our model adjusted the learning
rate with 0.001 as the initial learning rate.
The majority of annotation platforms allow for the output of one text file per image in the
YOLO labelling format. For each object in the image, a bounding-box (BBox) annotation is
present in a separate text file. The annotations, which range from 0 to 1, are normalized to the size
of the image. The format in which they are given is as follows:
< object-class-ID> <X center> <Y center> <Box width> <Box height>
Three YAML files that are included with the repo contain the configurations for the
training. Depending on the work, we will modify these files to meet our specific requirements.
The dataset parameters are described in the data-configurations file. The paths to the train,
validation, and test (optional) datasets, the number of classes (nc), and the names of the
classes in the same order as their index will all need to be added to this file since we are
training on our own unique dataset.
The model architecture is determined by the model-configurations file. The P5 models that
Ultralytics supports include the following YOLOv5 architectures: YOLOv5n (nano),
YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), and YOLOv5x (extra large).
These architectural designs work well for training using 640x640 pixel images. Additional
series, known as P6, that are designed for training with larger images of 1280*1280
(YOLOv5n6, YOLOv5s6, YOLOv5m6, YOLOv5l6, YOLOv5x6). An additional output
layer is included in P6 models for the detection of larger objects. They reap the greatest
advantages from training at higher resolution and deliver superior outcomes.For each of the
aforementioned architectures, Ultralytics offers built-in model configuration files in
the'models' directory. If you're starting from scratch and want to train a model, select the
PAGE \* MERGEFORMAT 34
model-configurations YAML file for the architecture you want (in this tutorial, it's
"YOLOv5s6.yaml"), then change the number of classes (nc) parameter to reflect the number
of classes that should be present in your custom data.
The learning rate, momentum, losses and augmentations are all defined in the
hyperparameters-configurations file along with other training-related hyperparameters. The
directory "data/hyp/hyp.scratch.yaml" contains a default hyperparameters file provided by
Ultralytics. For the most part, starting your training with the default hyperparameters is
advised to establish a performance baseline.
3.5.4 Training
The model will perform best when it is trained entirely from scratch with a sufficiently
large dataset. In the thesis the dataset contains 2536 images with three classes. By giving the
weights argument an empty string (' '), the weights are initialised at random. In the training we
provide the number of epochs, batch size, dataset path, initiation weights and image size. It does
the training and then it provides the results as accuracy, precision and recall scores. In the thesis
the number of epochs are 50, the batch size is 24 and the image size is 640 X 640.
3.5.5 Validation
The validation script will be used to assess our model. The 'task' option controls whether
performances are evaluated across the training, validation, or test dataset divides.
It is anticipated that transfer learning will lead to better outcomes than traditional training.
Although there is support for additional pre-trained models, Ultralytic's default model was pre-
trained over the COCO dataset (VOC, Argoverse, VisDrone, GlobalWheat, xView, Objects365,
SKU-110K). COCO is a dataset for object detection that includes pictures of commonplace
settings. There are 80 classes in it. By giving the name of the pre-trained COCO model to the
'weights' argument, our model will be initialized with weights from that model. The pre-trained
model will be downloaded automatically.
PAGE \* MERGEFORMAT 34
3.5.7 Feature Extraction
The backbone layer, which acts as a feature extractor, and the head layer, which computes
the output predictions, make up the two fundamental components of a model. To further
compensate for a small dataset size, we’ll utilize the same backbone as the pretrained COCO
model, and simply train the model’s head. The "freeze" parameter will fix the 12 layers that make
up the YOLOv5s6 backbone.
The last potential training phase, called fine-tuning, is unfreezing the entire model we
previously got and retraining it using our data at a very slow learning rate. By gradually adjusting
the pretrained features to the new data, this has the potential to produce significant improvements.
The hyperparameters-configurations file has an adjustment for the learning rate parameter. We'll
use the hyperparameters from the built-in "hyp.finetune.yaml" file, which have a much slower
learning rate than the default, for the tutorial demonstration. The weights that were saved in the
previous stage will be used as the initial weights.
PAGE \* MERGEFORMAT 34
3.7 Summary
This section focuses on a few key requirements, such as system architecture for the
proposed system, dataset used for the project, hardware and software requirements, training,
validation, Transfer learning and Fine Tuning.
CHAPTER 4
Experimental Evaluation
4.1 Dataset
The dataset used for this study includes weapon dataset with 2536 images. It
contains images of the knife class, pistol class and axe class. Two sets labels and images
constitute the dataset. The information needed to detect weapons is gathered from publicly
PAGE \* MERGEFORMAT 34
accessible websites, CCTV videos on YouTube, GitHub repositories, and the imfdb.org online
library of movie firearms. In the dataset remove the noisy data by using image restoration and
reshape the image size according to YOLO format. There are training and testing sets for the
dataset. This dataset is utilized for training in the proportion of 80%, and testing in the proportion
of 20%. In YOLOV5 dataset consists of images and labels. These labels include the bounding
box coordinates. YOLOV5 included a yaml file. The dataset information such as number of
class, class names and dataset path are provided to the yaml file.
In this thesis mainly four Evaluation Metrics were used for detecting the weapons in cctv
footage. They are Accuracy, Precision, Recall and F1-Score.
TP+ FP
A ccuracy= (1)
TP+ FP+TN + FN
TP
Precision= (2)
TP+ FP
TP
Recall= (3)
TP+ FN
2∗Precision∗Recall
F 1 Score= (4)
Precision+ Recall
All of the experiments in this paper are performed out using 4GB RAM, an Intel Core i5,
5th generation CPU, and a Google Collaborator GPU with 4GB memory. The YOLOV5 system
was trained using 50 epochs, a batch size of 24, and a learning rate of 0.001 to identify weapons
in videos and images.
PAGE \* MERGEFORMAT 34
The mean average precision of the weapon detection system using the YOLOV5
algorithm is shown in Figure 4.1. It is clear that the accuracy has improved. The system's
accuracy was close to 96.6%. Figure 4.2 shows how precise the system is. The model achieved
98% precision in our work. Figure 4.3 depicts the system recall and it is almost 95.7%.
PAGE \* MERGEFORMAT 34
Figure 4.2 Precision
Figure 4.4, 4.5 and 4.6 represents the training box, class and object losses. Where box
loss decreases with increase in epochs, there is no class loss in the model and the object loss also
decreases with increase in epochs.
PAGE \* MERGEFORMAT 34
Figure 4.5 Training class loss
Figure 4.7, 4.8 and 4.9 represents the Validation box, class and object losses.
Where box loss decreases with increase in epochs so that the box loss is limited, there is no class
loss in the validation and the object loss also decreases with increase in epochs. So, the model is
fit for detecting weapons in images and videos.
PAGE \* MERGEFORMAT 34
Figure 4.7 Validating box loss
PAGE \* MERGEFORMAT 34
Figure 4.9 Validating object loss
PAGE \* MERGEFORMAT 34
Figure 4.10 Output for Video
PAGE \* MERGEFORMAT 34
Figure 4.12 Output for Axe
PAGE \* MERGEFORMAT 34
Figure 4.10 represents output obtained when the video input is processed as 91%
confidence ratio. Figure 4.11 represents the output of pistol detection in images with a
confidence ratio of 95%. Figure 4.12 represents the output of axe detection in images with a
confidence ratio of 77%. Figure 4.13 represents the output of knife detection in images with a
confidence ratio of 79%.
4.5 Summary
This chapter provides a comprehensive examination of the obtained results for the
specified data set, and it is carried out in a sequential way. The experimental findings of Deep
Learning Technique that is YOLOV5 algorithm are shown in the graph above; they provide
higher performance metrics.
PAGE \* MERGEFORMAT 34
CHAPTER 5
PAGE \* MERGEFORMAT 34
REFERENCE
PAGE \* MERGEFORMAT 34
[9] T. S. S. Hashmi, N. U. Haq, M. M. Fraz and M. Shahzad, "Application of Deep Learning
for Weapons Detection in Surveillance Videos," 2021 International Conference on Digital
Futures and Transformative Technologies (ICoDT2), 2021, pp. 1-6, doi:
10.1109/ICoDT252288.2021.9441523.
[10] Singh, T. Anand, S. Sharma and P. Singh, "IoT Based Weapons Detection System for
Surveillance and Security Using YOLOV4," 2021 6th International Conference on
Communication and Electronics Systems (ICCES), 2021, pp. 488-493, doi:
10.1109/ICCES51350.2021.9489224.
[11] K. Ding, X. Li, W. Guo and L. Wu, "Improved object detection algorithm for drone-
captured dataset based on yolov5," 2022 2nd International Conference on Consumer
Electronics and Computer Engineering (ICCECE), 2022, pp. 895-899, doi:
10.1109/ICCECE54139.2022.9712813.
[12] L. Xiaomeng, F. Jun and C. Peng, "Vehicle Detection in Traffic Monitoring Scenes
Based on Improved YOLOV5s," 2022 International Conference on Computer Engineering
and Artificial Intelligence (ICCEAI), 2022, pp. 467-471, doi:
10.1109/ICCEAI55464.2022.00103.
[13] M. Jindal, N. Raj, P. Saranya and S. V, "Aircraft Detection from Remote Sensing Images
using YOLOV5 Architecture," 2022 6th International Conference on Devices, Circuits
and Systems (ICDCS), 2022, pp. 332-336, doi: 10.1109/ICDCS54290.2022.9780777.
[14] J. Zhou, M. Yan, C. Luo and X. Xing, "Underwater Sonar Target Detection Based on
YOLOv5," 2021 International Conference on Electronic Information Engineering and
Computer Science (EIECS), 2021, pp. 729-732, doi: 10.1109/EIECS53707.2021.9588050.
[15] Kisaezehra, M. U. Farooq, M. A. Bhutto and A. K. Kazi, "Real-time safety helmet
detection using yolov5 at construction sites," Intelligent Automation & Soft Computing,
vol. 36, no.1, pp. 911–927, 2023.
[16] Li Z, Song J, Qiao K, Li C, Zhang Y and Li Z (2022) Research on ecient feature
extraction: Improving YOLOv5 backbone for facial expression detection in live streaming
scenes. Front. Comput. Neurosci. 16:980063. doi: 10.3389/fncom.2022.980063
[17] W. Liu, Y. Hu and D. Fan, "Safety Helmet Wearing Recognition Based on Improved
YOLOv5," 2022 11th International Conference of Information and Communication
Technology (ICTech)), 2022, pp. 466-470, doi: 10.1109/ICTech55460.2022.00099.
PAGE \* MERGEFORMAT 34
CONFERENCE CERTIFICATE
PAGE \* MERGEFORMAT 34