Deep Neural Network Approachesfor Video Based Human Activity Recognition

In this paper we explained, tried, and tested methods of Human Action recognition for the application of video surveillance

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

183 views4 pages

Deep Neural Network Approachesfor Video Based Human Activity Recognition

In this paper we explained, tried, and tested methods of Human Action recognition for the application of video surveillance

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Volume 6, Issue 6, June – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Deep Neural Network Approaches for Video Based

Human Activity Recognition
Chaitanya Yeole Hricha Singh
School of Electronics and Communication School of Electronics and Communication
MIT World Peace University MIT World Peace University
Pune, India Pune, India

Hemal Waykole Anagha Deshpande

School of Electronics and Communication Assistant Professor
MIT World Peace University School of Electronics and Communication
Pune, India MIT World Peace University
Pune, India

Abstract:- In this paper we explained, tried, and tested develop new activity recognition algorithms and have the
methods of Human Action recognition for the application research of tried and tested methods by which we can get
of video surveillance. This paper provides a method for higher accuracy and stronger capability for handling various
automatically recognizing human activities included in scenarios.
video sequences captured by a single large view camera
in outdoor locations. The elaboration of the dataset which This study attempts to provide a comprehensive review
are videos are taken with the resolution of 720x480, of video-based human activity recognition, as well as an
precisely explained. The methods we implement are overview of various methodologies and their evolutions, by
CNN-VGG16 model and the Single-frame CNN model. covering both typical classic works of literature and theories
We demonstrated our techniques using real-world video of possible solutions. One of the method used in activity
data to automatically distinguish normal behaviors from recognition is CNN-LSTM, this method not only enhances
suspicious ones in a playground setting, films of the accuracy of predicting human actions from raw data, but
continuous performances of six different types of human- it also decreases the model's complexity and eliminates the
human interactions: handshakes, pointing, hugging, need for advanced feature engineering [1]. CNN is
pushing, kicking, and punching. As per the observation, increasingly being used as a feature learning method for
we concluded that the Single frame CNN model shows
human activity recognition [2]. Extracting significant
much better results as compared to CNN VGG16. The
temporal features from raw data is critical. The majority of
implementation was done in python. This paper consist of
HAR techniques need a significant amount of feature
how the convolution neural networks' simple classification
engineering and data pre-processing, which necessitates
method proved to be efficient for the prediction of the
activity by using a single frame method. The working of domain expertise [3]. Many applications, like as video
this method is briefly mentions in the methodology. The surveillance, health care, and human-computer interaction,
difference and the drawback of these methods for human are founded on vision-based HAR research (HCI) [4]. HAR
activity recognition can be clearly seen in the output and faces numerous difficulties, like enormous fluctuation of a
results of the respective. given activity, closeness between classes, time utilization,
and the high extent of Null class. These difficulties have
Keyword:- Artificial Intelligence (AI) Models Are Created to driven researchers to build techniques for methodical
Perceive the Movement of Human from the Provided Dataset. highlights and proficient acknowledgment strategies to viably
take care of these issues [5]. Many writers have used
I. INTRODUCTION sequential techniques and space-time volume approaches to
express actions and perceive action sets directly from
Human activity recognition has a lot of importance in images. Then, for anomalous action states, they employed
many applications including video surveillance, Human- hierarchical recognition approaches. For hierarchical
human interactions, and human- computer interaction. recognition, statistics-based approaches, syntactic
Automatically recognizing activities of interest plays a vital methodologies, and description-based methodologies are all
part in many of the present video surveillance. Human
activity recognition is one of the significant innovation to used [6]. Anomaly detection is one of the most well-known
screen the dynamism of an individual and this can be applications of human activity recognition [7].
accomplished with the help of Machine learning methods.
This is can be used for security purposes and most
importantly detect criminal activity within minutes. It has
wide use of applications. Therefore, it is always desirable to

IJISRT21JUN1122 www.ijisrt.com 1586

Volume 6, Issue 6, June – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
II. DATASET AND PREPROCESSING fast and more efficient and also reduce the probability of
vanishing gradient problem. Considering the case of human
Hand-shaking, pointing, hugging, pushing, kicking, and activity recognition this model does not show the best results,
punching are all examples of human-human interactions that it hasvarious speed limitations.
can be seen in the videos. A total of 20 video sequences with
a length of roughly one minute make up the dataset. Each 1.2 CNN Model
video comprises at least one execution per interaction, Convolution Neural Networks are a type of Neural
resulting in an average of eight human activity executions Networks, which can identify and classify features from
per video. The videos have volunteers dressed in around 15 frames. Analyze visual images is one of the widely used
different ways. The videos were shot at a resolution of functions. Video and picture identification, image
720x480 pixels, with a person's height in the video being 200 classification, medical image analysis, computer vision, and
pixels. natural language processing are just a few of the applications
(NLP).
The videos are divided into two sets for pre-processing,
which makes it easier to implement the dataset. Set 1 is made A CNN architecture is divided into two parts:-
up of about ten video clips shot on a playground. Set 1's • Feature Extraction is a process where a convolution tool
videos were shot at a different zoom rate. Set2, which identifies and separates the various features of the
consists of the remaining 10 sequences, was shot on a green pre-processed frames for analysis.
lawn in a breezy environment. From sequences 1 to 4 and  Based on the characteristics extracted in the previous
from 11 to 13, only 2 interacting volunteers with different phases, a fully connected layer recognizes the output from
clothing appear in each video. That video sequence has both feature extraction and predicts the class of the frames.
conversing people and civilians from sequences 5 to 8 and 14
to 17. Sets9, 10, 18, 19, and 20 are pairs of interacting
volunteers who participate in the activity at the same time.
The background and scale of each set are distinct.

For the implementation of the dataset in themethods, we

had to pre-process this data further. We converted the dataset
into frames using various python libraries. These were further
segregated into train and test datasets for the application of
the techniques i.e., VGG 16and CNN.

III. PROPOSED METHOD

1.1 VGG -16 (Visual Geometry Group)

VGG-16 is a large-scale image recognition architecture
based on deep convolutional neural networks. The University Fig.1 CNN Architecture
of Oxford's K. Simonyan and A. Zisserman proposed this
approach. The architecture of VGG16 has the input that is Convolution Layer: The first layer in the Feature
the pre- processed frames, to the network is an image of Extraction. The mathematical equations of convolution are
dimensions 224x224 x3. The first 2 layers have 64 executed between the input frame and a particular size for a
compartments of 3x3 filter size and the same padding. filter is also executed. By sliding the filter over the input
After the max pool layer of dimensions 2x2, 2 layers have frame, the Dot product of the filter and the input image are
convolution layers of filter size 256 and 3x3. This is followed taken, where the resulting output is the same size as the filter.
by a max-pooling layer of stride 2x2 which is the same as the The corners and edges of this frame are defined by the
previouslayer. This sequence is repeated twice. resulting output of this layer which is also called Feature
Map. The output acquired is sent to the next layer.
Later 2 sets of 3 convolution layers and a max pool
layer are formed. Each has 512 required for both the Max-Pooling Layer: This layer is the second in the
techniques was the same. The dataset was converted into feature extraction process, and its major goal is to lower the
frames and further divided into train and test categories which size of the convolved feature map (the previous layer's
were implemented on the models. More elaborated output) in order to reduce computational costs. This is
information about the models:-filters of 3x3 size with the same accomplished by lowering the number of connections
padding. The frames are then sent to a two-layer convolution between layers. As a result, the feature maps are controlled
stack. The filters we utilize in these convolution and max- independently.
pooling layers are 3x3 in size. After the formation of the
convolution and max- pooling layer, a 7x7x512 feature map The largest element in this layer comes from the feature
was achieved. The output was flattened to make it a 1x25088 map (derived from the previous layer). The mean of the
feature vector. Three layers achieved were further passed to elements in an output-sized frame section is calculated using
the Soft- max layer to normalize the classification vector. In Mean Pooling. Pooling the sum computes the total sum of
this architecture, the activation function used is Relu as it is the components in the predefined section. The Pooling Layer

IJISRT21JUN1122 www.ijisrt.com 1587

Volume 6, Issue 6, June – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
is typically used to connect the Convolutional and Fully method entails developing a function that generates a single
Connected Layers. prediction for the entire movie. This function will create
predictions based on a set number of frames from the video.
Fully Connected Layer: It contains the biases and Finally, the mean of the forecasts of those numbers of frames
weights, as well as the neurons, and is used to link them will be used to determine the final activity class for that
between two layers. These layers are executed before the movie.
output layer and make up the final few layers of the CNN
architecture. The previous layers' input image is flattened and V. RESULT AND DISCUSSION
sent to the Fully Connected layer. After that, the flattened
vector proceeds through a couple additional Fully Connected
layers, where the mat is added.

Activation Function: They're utilized to approximate

any form of network variable-to-variable relationship that's
both continuous and complex. In simple words, they decide
if certain information of the model should move forward at
the end of the neural network or not. Our model uses the
Relu activation function as it lowers the risk of vanishing
gradient.

IV. METHODOLOGY

Fig.3 Loss Curves

Fig.2 Implementation of Model

Extraction of the dataset to the code via python

libraries and tools. Visualize the dataset with labels for a
better understanding of the procedure further: Picked random Fig.4 Accuracy Curves
classes from the dataset and labelled the respective activity to
it.

Pre-processing the Data: frame extraction function:

reads the video file frame by frame, resizes each frame,
normalizes the resized frame, appends the normalized frame
to a list, and finally returns the list. Creating a new dataset of
the frames extracted. The dataset achieved is further Split the
data into training and testing sets.

Construct the model: The model is classified with 2

Convolution Neural Network layers using the Relu activation
function. Train and compile the model: The model is trained
with an accuracy of 90%. Plot the model’s Loss and accuracy
curve using python libraries and tools. Make predictions with
a random video. Single framed method for Prediction: This Fig.5 Recognition result of UT-interactiondataset

IJISRT21JUN1122 www.ijisrt.com 1588

Volume 6, Issue 6, June – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Healthcare Engineering, vol. 2017, Article ID 3090343,
31 pages, 2017.
[5]. Jian Sun, Yongling Fu, Shengguang Li, JieHe, Cheng
Xu, Li Tan, “Sequential Human Activity Recognition
Based on Deep Convolutional Network and Extreme
Learning Machine Using Wearable Sensors”, Journal
of Sensors, vol. 2018, Article ID 8580959, 10 pages,
2018.
[6]. A. Deshpande and K. K. Warhade, “An Improved
Model for Human Activity Recognition by Integrated
feature Approach and Optimized SVM”, 2021
International Conference on Emerging Smart
Computing and Informatics (ESCI), 2021, pp. 571-576,
doi: 10.1109/ESCI50559.2021.9396914.
[7]. Shreyas, D.G., Raksha, S. & Prasad, B.G.
Implementation of an Anomalous Human Activity
Fig.6 Recognition video playing in thenotebook
Recognition System. SN COMPUT. SCI. 1, 168
(2020).
Figure 3 and 4 shows the plot of loss vs validation loss
[8]. Golestani, N., Moghaddam, M. Human activity
and the plot of Accuracy vs validation accuracy of the CNN
recognition using magnetic induction-based motion
model respectively. The accuracy of this model is 90.64%
signals and deep recurrent neural networks. Nat
whereas the validation accuracy of the code is approximately
Commun 11,1551 (2020).
90.34%. The model was trained for 6 activities. Figure 5
[9]. Vrigkas Michalis, Nikou Christophorus, Kakadiaris
successfully predicts the probability of the respective activity
Ioannis A. “A Review of Human Activity Recognition
which was being tested. So using single frame CNN method
Methods”, JOURNAL=Frontiers in Robotics and AI,
we can predict the activities that is being performed in the
Vol.2, Year 2015.
video, it will average those n frame predictions and then give
[10]. C. Dhiman, D.K. Vishwakarma, Engineering
us the final activity class for that video in the form of
Applications of ArtificialIntelligence 77 (2019) 21-45.
likelihood. Though the probabilities are approximate, the
[11]. Abdellaoui, M., Douik, A. (2020). Human action
probability of the activity tested stands out and is easily
recognition in video sequences using deep belief
identified.
networks. Traitement du Signal, Vol. 37, No. 1, pp. 37-
44.
VI. CONCLUSION
[12]. Liu, C., Ying, J., Yang, H. et al. Improved
human action recognition approach based on two-
In this paper, a VGG-16 and CNN-based technique for
stream convolutionalneural network model. Vis Comput
human activity recognition is proposed. To evaluate the
37, 1327–1341 (2021).
performance of the suggested technique, extensive
[13]. “Ryoo, M. S. and Aggarwal, J. K.”, “UT-Interaction
experiments were carried out on the UT-interaction dataset.
Dataset, ICPR contest on Semantic Description of
The experimental results showed that the average accuracy of
Human Activities SDHA”, 2010.
our approach was 60% and 90.64% respectively. Therefore,
[14]. J.K. Aggarwal, Lu Xia, Humanactivityrecognition from
VGG-16 proved to be unreliable when it comes to raw video-
3D data: A review,Pattern Recognition Letters,
graphic datasets. The models take more training time as
Volume 48, 2014, Pages 70-80, ISSN 0167-8655.
observed. On the other hand CNN model proved to be very
[15]. N. Oliver, E. Horvitz and A. Garg, "Layered
efficient for human activity recognition.
representations for human activity recognition,"
Proceedings. Fourth IEEE International Conference on
REFERENCES
Multimodal Interfaces, 2002, pp. 3-8, doi:
10.1109/ICMI.2002.1166960.
[1]. Chih-Ta Yen, Jia-Xian Liao, Yi-Kai Huang, “Human
Daily Activity Recognition Performed Using Wearable
Inertial Sensors Combined With Deep Learning
Algorithms”, Access IEEE, vol. 8, pp. 174105-174114,
2020.
[2]. Cruciani, F. Vafeiadis, A. Nugent, C.et al. Feature
learning for Human Activity Recognition using
Convolutional Neural Networks. CCF Trans. Pervasive
Comp. Interact 2, 18-32(2020).
[3]. Semwal, V.B. (2021). Dua 2021 Article Multi
input CNN GRU BasedHuman Activity Recognition.
[4]. Shugang Zhang, Zhiqiang Wei, Jei Nei, Lei Huang,
Shuang Wang, Zhen Li, “A Review on Human Activity
Recognition Using Vision Based Method”, Journal of