0% found this document useful (0 votes)

60 views36 pages

Human Activity Recognition System Report

This final year project report from Tribhuvan University's Institute of Engineering focuses on human activity recognition using a hybrid approach that integrates Convolutional Long Short-Term Memory networks (ConvLSTMs) with Convolutional Neural Networks (CNNs). The study aims to enhance the accuracy of recognizing a wide range of human activities, which has significant applications in healthcare, sports analysis, and surveillance. The report includes acknowledgments, an abstract, literature review, requirement analysis, and outlines the methodology for implementing the proposed system.

Uploaded by

naruto112h

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views36 pages

Human Activity Recognition System Report

Uploaded by

naruto112h

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

HIMALAYA COLLEGE OF ENGINEERING

[CODE: CT-455]
A
FINAL YEAR PROJECT REPORT
ON
HUMAN ACTIVITY RECONGNITION USING
CONV-LSTM
BY:
BALKRISHNA RAY (HCE076BCT007)

BIKRAM BHUSAL (HCE076BCT008)

BISWAS PANDIT (HCE076BCT010)

SAURABH KARKI (HCE076BCT036)

A PROJECT REPORT SUBMITTED TO DEPARTMENT

OF ELECTRONICS AND COMPUTER ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENT
FOR BACHELOR’S DEGREE IN COMPUTER
ENGINEERING

DEPRATMENT OF ELECTRONICS AND COMPUTER

ENGINEERING

LALITPUR, NEPAL
March 2024
HUMAN ACTIVITY RECONGNITION USING

CONV-LSTM

BY:
BALKRISHNA RAY (HCE076BCT007)

BIKRAM BHUSAL (HCE076BCT008)

BISWAS PANDIT (HCE076BCT010)

SAURABH KARKI (HCE076BCT036)

PROJECT SUPERVISOR

Er. HASINA SHAKYA

A report submitted in partial fulfillment of

the requirements for the degree of Bachelor
in Computer Engineering

Department of Electronics and Computer Engineering

HIMALAYA COLLEGE OF ENGINEERING

Tribhuvan University

Lalitpur, Nepal

March, 2024
ACKNOWLEDGEMENT
We express deep gratitude to the Institute of Engineering, Pulchowk, for
including major projects in the BCT IV/I syllabus, which has greatly enhanced our
academic journey and allowed us to apply practical knowledge. We are thankful
to the management of Himalaya College of Engineering (HCOE) for providing
us with this exceptional opportunity and assembling a team of experts to assist us
during our proposal defense.
We are very thankful to our respected Head of the Department, Er. Ashok GM,
and the Deputy Head of the Department, Er. Devendra Kathayat, for their
invaluable advice, unwavering support, and exceptional guidance throughout our
project. We are also grateful to our project supervisor, Er.Hasina Shakya, for her
unwavering commitment, motivation, and insightful contributions. We are
indebted to our friends and colleagues for their support and constructive feedback
in selecting our project topic. Their encouragement has fostered an environment of
growth and inspiration, fueling our passion for excellence. We humbly
acknowledge all those who have contributed to the realization of our ideas,
transforming them into tangible achievements. In conclusion, we extend our
heartfelt appreciation to everyone involved. We recognize the significant impact
of each individual and institution mentioned here, as their support and guidance
have paved the way for our success. We are eternally grateful for their
contributions.

GROUP MEMBERS:
Balkrishna Ray (HCE076BCT007)
Bikram Bhusal (HCE076BCT008)
Biswas Pandit (HCE076BCT010)
Saurabh Karki (HCE076BCT036)

i
ABSTRACT
This study introduces an innovative methodology for human activity
recognition by seamlessly integrating Convolutional Long Short-Term
Memory networks (ConvLSTMs). In the landscape of machine learning,
particularly within the realm of deep learning, this approach is designed to
tackle the intricate challenge of precisely identifying and categorizing a
diverse array of human activities. These activities encompass a wide
spectrum, ranging from fundamental actions such as walking and sitting, to
more intricate motions like dancing and cooking. The significance of this
endeavor reverberates across multifarious domains, including healthcare,
sports analysis, and surveillance, where accurate activity recognition holds
immense value. Conventional techniques often grapple with the complexity of
capturing both spatial intricacies and the nuanced temporal patterns inherent
within sequences of activities. To overcome these challenges, we advocate for
an innovative hybrid architecture that seamlessly amalgamates the strengths of
CNNs and ConvLSTMs. CNNs excel at extracting spatial features from raw
sensor data, creating a robust foundation for comprehending various activities.
On the other hand, ConvLSTMs specialize in modeling temporal
dependencies within sequential data, enabling the seamless comprehension of
intricate temporal dynamics embedded in human motions. By synergizing
these two powerful deep learning paradigms, our proposed framework not
only elevates the potential for accurate and holistic human activity recognition
but also contributes to the advancement of real-time activity understanding.

Keywords: Convolutional Long Short-Term Memory networks (ConvLSTMs),

Deep learning, Human activity recognition, Hybrid architecture

ii
TABLE OF CONTENTS

ACKNOWLEDGEMENT ........................................................................................ i

ABSTRACT.............................................................................................................ii

LIST OF FIGURES ................................................................................................. v

LIST OF ABBREVIATION ................................................................................... vi

1. INTRODUCTION ............................................................................................... 1

1.1 OBJECTIVE................................................................................................... 2

1.2 SCOPE ........................................................................................................... 2

1.3 PROBLEM STATEMENT ............................................................................ 2

2. LITERATURE REVIEW .................................................................................... 3

3. REQUIREMENT ANALYSIS ............................................................................ 6

3.1 Functional Requirements................................................................................ 6

3.2 Non-Functional Requirements ....................................................................... 6

3.3 Feasibility Study............................................................................................. 7

3.3.1 Technical Feasibility................................................................................ 7

3.3.2 Financial Feasibility ................................................................................ 7

3.3.3 Operational Feasibility ............................................................................ 7

4. SYSTEM DIAGRAM.......................................................................................... 8

4.1 System Flow diagram ..................................................................................... 8

4.2 Sequence Diagram.......................................................................................... 9

4.3 Data Flow Diagram ...................................................................................... 10

5. METHODOLOGY ............................................................................................ 11

5.1 Download and visualize the data with its labels. ......................................... 16

5.2 Pre-process the dataset ................................................................................. 16

iii
5.3 Split the data into train and test set .............................................................. 17

5.4 Implement the ConvLSTM approach ........................................................... 17

Step 5.4.1: Construct the Model ..................................................................... 17

Step 5.4.2: Compile & Train the Model ......................................................... 18

Step 5.4.3: Evaluating the trained Model ....................................................... 18

Step 5.4.4: Plot Model’s Loss & Accuracy Curves ........................................ 19

6. RESULT AND ANALYSIS .............................................................................. 21

6.2 Model Architecture ...................................................................................... 22

6.3 Confusion Matrix ......................................................................................... 23

6.4 Classification Report .................................................................................... 25

7. CONCLUSION .................................................................................................. 26

REFERENCES ...................................................................................................... 27

iv
LIST OF FIGURES
Fig 3.1: Use Case Diagram ................................................................................................ 6

Fig 4.2 System Flow Diagram ........................................................................................... 8

Figure 4.3 Sequence Diagram ............................................................................................ 9

Fig 4.4 Data flow diagram level 0 ................................................................................... 10

Fig 5.1: Representation of an LSTM cell ......................................................................... 13

Fig 5.1.1 Train Prediction Workflow ............................................................................... 14

Figure 5.4.4.1: Loss vs Validation ................................................................................... 20

Figure 5.4.4.2: Accuracy vs Validation ........................................................................... 20

Fig 5.5: Model Summary ................................................................................................. 21

Fig 5.5.1: Confusion Matrix Heatmap .............................................................................. 23

Fig 5.5.2: Classification Report ........................................................................................ 25

v
LIST OF ABBREVIATION
HAR: Human Activity Recognition
CNNs: Convolutional Neural Networks
ConvLSTMs: Convolutional Long Short-Term Memory networks
HMMs: Hidden Markov Models
SVMs: Support Vector Machines

vi
1. INTRODUCTION
Human Activity Recognition (HAR) stands as a pivotal research field, finding
applications in healthcare monitoring, sports analysis, and surveillance systems. In
an increasingly data-driven era, the automatic identification, classification, and
comprehension of human activities from sensor data have captured significant
interest. Integrating Convolutional Neural Networks (CNNs) and Long Short-
Term Memory networks (LSTMs) offers a promising avenue to enhance HAR
model accuracy.

Activities span a wide spectrum, from basic motions like walking and sitting to
complex gestures like dancing and cooking. Accurate activity recognition holds
substantial value across domains. Traditional methods often grapple with
capturing spatial intricacies and nuanced temporal patterns inherent in activity
sequences.

Addressing these challenges, we advocate an innovative hybrid architecture that

synergizes CNNs' spatial feature extraction with ConvLSTMs' temporal
dependency modeling. CNNs excel in extracting spatial features from raw sensor
data, forming a robust basis for understanding activities. Conversely, ConvLSTMs
specialize in modeling sequential dependencies, facilitating the comprehension of
intricate temporal dynamics.

By amalgamating these deep learning paradigms, our proposed framework not

only enhances human activity recognition accuracy. This study delves into the art
of deciphering human actions, augmenting potential for applications requiring
fine-grained and real-time activity understanding. This amalgamation illuminates
a path towards comprehending and classifying diverse human activities, opening
doors to a new dimension of applications and insights.

1
1.1 OBJECTIVE

• To create a web application which can accurately identify and

classify human activities based on input data.

1.2 SCOPE

• Fitness Tracking: HAR can be used to automatically recognize and track

various fitness activities, such as running, walking, cycling, or
weightlifting. This information can be used to monitor a person's daily
physical activity and provide insights into their fitness progress.
• Healthcare: In healthcare settings, HAR can be used to monitor patients
movements and activities, helping in rehabilitation programs, elderly care,
and detecting anomalies in motion patterns that might indicate health
issues.

1.3 PROBLEM STATEMENT

The task of human activity recognition presents numerous challenges yet holds
immense potential for applications spanning healthcare monitoring, sports
analysis, and surveillance. While Convolutional Neural Networks (CNNs) excel at
extracting spatial features from raw data, they often struggle to inherently capture
the temporal relationships between frames in activity sequences. Conversely,
Long Short-Term Memory networks (LSTMs) possess the capability to model
temporal dependencies but may overlook crucial spatial contexts essential for
accurate activity recognition. To address these challenges and harness the
complementary strengths of both CNNs and Convolutional Long Short-Term
Memory networks (ConvLSTMs), this research aims to propose a hybrid
approach. By integrating CNNs' spatial feature extraction prowess with
ConvLSTMs' temporal modeling abilities, the proposed hybrid model seeks to
enhance the accuracy and robustness of human activity recognition systems across
various domain.

2
2. LITERATURE REVIEW

Human Activity Recognition (HAR) has emerged as a critical field of research

with significant applications in domains such as healthcare monitoring, sports
analysis, and surveillance systems. As the world becomes increasingly data-
driven, the ability to automatically identify, classify, and understand human
activities from sensor data has garnered substantial attention. In recent years, the
integration of Convolutional Neural Networks (CNNs) and Convolutional Long
Short-Term Memory networks (ConvLSTMs) has emerged as a promising
approach to enhance the accuracy and robustness of HAR models.
Historically, HAR methodologies encompassed a range of approaches, including
traditional machine learning algorithms and handcrafted feature engineering.
Techniques like Hidden Markov Models (HMMs) and Support Vector Machines
(SVMs) demonstrated success in capturing sequential patterns but struggled to
accommodate the complexities of real-world activities [1] [2]. The growing need
for models that can adapt to intricate patterns inherent in human actions led to the
exploration of deep learning techniques.
Deep learning, particularly CNNs, revolutionized numerous domains, including
image recognition and natural language processing. In the context of HAR,
researchers extended CNNs to process sensor data by transforming it into image-
like representations. These CNN-based models demonstrated remarkable
capabilities in capturing spatial patterns inherent in activities, although they often
fell short in accounting for temporal dependencies [3].
To address the temporal limitations of pure CNN models, ConvLSTMs were
introduced. ConvLSTMs, an extension of traditional LSTMs, fuse the strengths of
CNNs with the sequential modeling capabilities of LSTMs. They allow for the
seamless incorporation of spatial and temporal features, effectively capturing both
fine-grained spatial information and complex temporal dynamics [4] [5]. The
application of ConvLSTMs in HAR has showcased substantial improvements in
recognizing nuanced and context-dependent activities [6].

3
Evaluation of HAR models hinges on the availability of high-quality datasets.
Datasets like "HumanActivityNet" have become benchmarks for testing model
performance [7]. These datasets encompass a wide array of activities, allowing
researchers to comprehensively assess model accuracy and generalizability [8][9].
The diversity of these datasets ensures that models trained on them are well-
prepared to handle real-world scenarios.
Despite the potential of the CNN-ConvLSTM hybrid, challenges persist. The
intricate architecture requires meticulous hyperparameter tuning to prevent
overfitting. Addressing domain adaptation and real-time performance concerns
remains an ongoing endeavor, as models must exhibit adaptability across diverse
contexts and provide real-time insights [8].
In conclusion, the integration of CNNs and ConvLSTMs marks a pivotal
advancement in the field of HAR. This literature review highlights the
evolutionary journey from traditional techniques to the innovative integration of
deep learning architectures. The hybrid approach not only bolsters the accuracy of
activity recognition but also paves the way for real-time and context-aware
understanding of human actions in a myriad of applications.
Recent Advances in HAR

Recent studies have focused on refining temporal modeling for HAR. Time-aware
attention mechanisms [9] and Temporal Convolutional Networks (TCNs)
[10]have been proposed as novel approaches to enhance the temporal
understanding of human activities, providing a more nuanced perspective on
temporal dynamics.

Domain Adaptation and Real-World Deployment

In the realm of domain adaptation, recent works emphasize its importance in HAR
models, especially in real-world scenarios with varying sensor configurations and
data distributions [11].

Explainability in Deep Learning Models

4
The interpretability of CNN-ConvLSTM models has garnered attention, with
recent research exploring attention mechanisms [12] and saliency maps [13] to
shed light on decision-making processes.

Multi-Modal Approaches for Comprehensive Recognition

The integration of data from diverse sensors, such as inertial sensors, video
cameras, and social interactions, has shown promising results in improving the
robustness and accuracy of HAR models [14] [15].

Ethical Considerations and Bias in HAR

Ethical considerations and potential biases in HAR models are gaining

prominence. Recent literature underscores the importance of fairness, especially in
sensitive contexts such as healthcare or law enforcement [16].

Future Directions and Emerging Technologies

Looking ahead, the integration of edge computing for real-time processing

[17]and the collaboration of HAR with Internet of Things (IoT) devices
[18]represent emerging technologies shaping the future of HAR.

In conclusion, the integration of CNNs and ConvLSTMs stands as a pivotal

advancement in HAR. This literature review traces the evolutionary journey from
traditional techniques to the innovative integration of deep learning architectures,
highlighting not only the bolstered accuracy of activity recognition but also the
potential for real-time and context-aware understanding of human actions.

5
3. REQUIREMENT ANALYSIS

3.1 Functional Requirements

Fig 3.1: Use Case Diagram

• Upload the video
User should have the ability to upload videos through a user-friendly
interface on the website
• Recognize activity
The system should have the capability to classify human activity and
predict the action.

3.2 Non-Functional Requirements

• Accuracy
The HAR system should achieve high accuracy in classifying human
activities, ensuring reliable and trustworthy results for end users. The
system should be able to process data and make predictions within a
specified time frame.

6
• Usability and maintainability
The user interface should be intuitive, easy to navigate and accessible to
users easily. The system should be easy to maintain and update over time.

3.3 Feasibility Study

3.3.1 Technical Feasibility

Assess the availability of suitable datasets for training and testing the model. Look
for publicly available datasets or consider collecting your own data if necessary.
Evaluate the availability of computational resources such as GPUs, cloud
computing required for training and inference with models. Determine the
feasibility of implementing convLSTM approach using existing deep learning
frameworks such as TensorFlow, keras.

3.3.2 Financial Feasibility

Estimate the costs associated with data collection, preprocessing, model training,
and deployment. Consider expenses related to hardware, software, personnel, or
any potential licensing. Compare the projected costs with the available budget and
funding sources to ensure financial viability.

3.3.3 Operational Feasibility

Assess the feasibility of integrating the HAR system into existing workflows or
applications, such as healthcare monitoring systems or fitness trackers. Also
evaluate the ease of use and user acceptance of HAR system by potential end
users.

7
4. SYSTEM DIAGRAM
4.1 System Flow diagram

Fig 4.2 System Flow Diagram

Our input comprises a video, which undergoes segmentation into multiple images.
These images are then forwarded to a CNN (Convolutional Neural Network) to
extract visual features. Subsequently, the extracted visual features are fed into an
LSTM (Long Short-Term Memory) network to generate predictions. The CNN's
role is to learn spatial information, while the LSTM specializes in learning
temporal patterns.

8
4.2 Sequence Diagram

Figure 4.3 Sequence Diagram

The diagram features an actor labeled 'User' and a lifeline representing the 'HAR'
(Human Activity Recognition). The interaction begins as the user initiates a call
message to upload the video into the system. This action is depicted by a thin
rectangle, symbolizing the activation bar. Upon receiving the video, the system
processes it and responds with a return message, indicating successful recognition.
Finally, the recognized video is displayed to the user.

9
4.3 Data Flow Diagram

Fig 4.4 Data flow diagram level 0

This diagram depicts an entity labeled 'user' and a process named 'human activity
recognition' (HAR). The arrows signify the flow of data. Initially, the user uploads
a video, which is then processed by the human activity recognition system to
recognize the activity. Subsequently, the identified activity is displayed to the
user.

10
5. METHODOLOGY
There are various techniques in deep learning techniques that can be used in order
to create a project of Human Activity Recognition (HAR). The various methods to
implement this project include Convolution Neural Network (CNN), Recurrent
Neural Network (RNN), Hybrid Architectures and many more. Every technique
has different levels of working mechanisms and accuracy based on their
individual capabilities. CNN has the capabilities of finding out the features of
spatial images. For sequential data like series sensor readings RNN is used. In
case of HAR, the hybrid form will be more accurate as the hybrid architectures are
commonly combinations of CNN and RNN. Long Short-Term Memory (LSTM)
is a type of RNN which can be used in HAR models alongside of CNN for activity
predictions for sequential data.

LSTM (Long Short-Term Memory):

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN)

architecture designed to address the vanishing gradient problem, which often
occurs when training traditional RNNs on long sequences of data. LSTM
networks are capable of learning and remembering long-term dependencies in
sequential data, making them well-suited for tasks such as time series prediction,
natural language processing, and speech recognition.

The LSTM’s key components are as follows:

1. Memory Cells:

The core component of LSTM is the memory cell, which maintains a

hidden state vector that can store information over long periods. Unlike
traditional RNNs, which update their hidden state at each time step, LSTM
networks have mechanisms to selectively add or remove information from
the memory cell, allowing them to retain important information while
discarding irrelevant details.

11
2. Gates:

LSTM networks employ three types of gates to control the flow of

information: the input gate, the forget gate, and the output gate. Each gate
is implemented using a sigmoid activation function, producing values
between 0 and 1 that determine how much information should be let
through.

3. Forget Gate:

The forget gate decides which information from the previous cell state
should be discarded. It takes as input the concatenation of the current input
and the previous hidden state and produces a forget vector. The forget
vector is then multiplied element-wise with the previous cell state,
effectively "forgetting" irrelevant information.

4. Input Gate:

The input gate determines which new information should be stored in the
memory cell. It takes as input the concatenation of the current input and
the previous hidden state and produces an input vector. This input vector is
then combined with a candidate cell state (obtained from the current input)
through element-wise multiplication with a tanh activation function. This
produces new candidate values to be added to the cell state.

5. Update Cell State:

The forget gate and input gate outputs are combined to update the cell
state. The forget gate output is used to scale the previous cell state to
forget irrelevant information, and the input gate output is used to add new
information to the cell state. The resulting updated cell state serves as the
memory for the current time step.

6. Output Gate:

The output gate determines which information from the current cell state
should be exposed to the next hidden state. It takes as input the

12
concatenation of the current input and the previous hidden state and
produces an output vector. The output vector is then combined with the
updated cell state through element-wise multiplication with a tanh
activation function, and the resulting value is the current hidden state.

By selectively updating and passing information through its gates, LSTM

networks can effectively learn and remember long-term dependencies in
sequential data, making them powerful tools for a wide range of tasks
requiring sequence modeling and prediction.

Fig 5.1: Representation of an LSTM cell

13
Fig 5.1 Train Prediction Workflow

Train Prediction Workflow

Data Sets: Inputs raw training videos.

Pre-processing: Splits videos into frames and resizes them for uniformity.

Train/Test Split: Divides the processed data into training and testing sets. The
training set is used to train the model, while the testing set is used to evaluate its
performance.

Build Model: Defines the architecture of the model, including the layers and their
parameters.

Train Model: Trains the model on the training data. The model learns to identify
patterns and relationships within the data.

Evaluate: Assesses the model's performance on the testing data using metrics like
accuracy, precision, recall, or F1-score.

14
Desired Accuracy Met?: Checks if the achieved accuracy meets the predefined
threshold.

Yes: Training is complete. The model can be used for predictions on new data.

No: If the desired accuracy is not met, the model might need further training or
adjustments. This could involve:

• Tuning hyperparameters of the model.

• Going back to the pre-processing stage to improve data quality.
• Modifying the model architecture.

15
5.1 Download and visualize the data with its labels.

First we download the required libraries such as pafy, youtube-dl and moviepy
which will help us to download the videos from the youtube. We also use other
libraries such as openCV which provides a wide range of functionalities for
processing and analyzing images and videos. We also use tensorflow for building
and training various machine learning models, including deep learning models.
We mostly use Keras api running on top of tensorflow which focuses on enabling
fast experimentation and prototyping of deep learning models.

In the first step, we will download and visualize the data along with labels to get
an idea about what we will be dealing with. We will be using the UCF50 – Action
Recognition Dataset, consisting of realistic videos taken from youtube which
differentiates this data set from most of the other available action recognition data
sets as they are not realistic and are staged by actors. The Dataset contains

• 50 Action Categories
• 25 Groups of Videos per Action Category
• 133 Average Videos per Action Category
• 199 Average Number of Frames per Video
• 320 Average Frames Width per Video
• 240 Average Frames Height per Video
• 26 Average Frames Per Seconds per Video

5.2 Pre-process the dataset

Next, we will perform some preprocessing on the dataset. First, we will read the
video files from the dataset and resize the frames of the videos to a fixed width
and height (64X64), to reduce the computations and normalized the data to
range [0-1] by dividing the pixel values with 255, which makes convergence
faster while training the network.

We will create a function that will create a list containing the resized and
normalized frames of a video whose path is passed to it as an argument. The

16
function will read the video file frame by frame, although not all frames are added
to the list as we will only need an evenly distributed sequence length of frames.

Now we will create a function that will iterate through all the classes specified in
the class and will call the function on every video file of the selected classes and
return the frames (features), class index (labels), and video file path
(video_files_paths).

5.3 Split the data into train and test set

As of now, we have the required features and one_hot_encoded_labels . We split

our data to create training and testing sets. We also shuffled the dataset before the
split to avoid any bias and get splits representing the overall distribution of the
data.

5.4 Implement the ConvLSTM approach

In this step, we have implemented the first approach by using a combination

of ConvLSTM cells. A ConvLSTM cell is a variant of an LSTM network that
contains convolutions operations in the network. It is an LSTM with convolution
embedded in the architecture, which makes it capable of identifying spatial
features of the data while keeping into account the temporal relation. For video
classification, this approach effectively captures the spatial relation in the
individual frames and the temporal relation across the different frames. As a result
of this convolution structure, the ConvLSTM is capable of taking in 3-
dimensional input (width, height, num_of_channels).

Step 5.4.1: Construct the Model

To construct the model, we have used Keras ConvLSTM2D recurrent layers.

The ConvLSTM2D layer also takes in the number of filters and kernel size
required for applying the convolutional operations. The output of the layers will
be flattened in the end and have fed to the Dense layer with softmax activation
which outputs the probability of each action category.
We have also used MaxPooling3D layers to reduce the dimensions of the frames
and avoid unnecessary computations and Dropout layers to prevent overfitting the

17
model on the data. The create_convlstm_model function constructs a
Convolutional Long Short-Term Memory (ConvLSTM) model for tasks such as
video classification or action recognition. It begins by initializing a Sequential
model, allowing layers to be added sequentially. The model architecture includes
multiple ConvLSTM2D layers, which perform convolutional operations with
LSTM-like recurrence along both spatial and temporal dimensions. Each
ConvLSTM2D layer is followed by a MaxPooling3D layer for spatial pooling,
and a TimeDistributed layer with Dropout for regularization. After the
convolutional layers, the output is flattened into a 1D array using a Flatten layer.
Finally, a Dense layer with softmax activation is added to classify the input into
the classes specified in CLASSES_LIST. The model's summary is displayed,
showing the number of parameters and the architecture of each layer, before
returning the constructed ConvLSTM model. This architecture is effective for
learning spatiotemporal features from video data, essential for tasks requiring
understanding and analyzing video sequences.

Step 5.4.2: Compile & Train the Model

Next, we have added an early stopping callback to prevent over fitting. Training
will stop if the validation loss does not improve for 10 consecutive epochs. The
callback considers the validation loss to minimize. It stops training when the loss
stops decreasing. The loss function used is categorical crossentropy for multiclass
classification task. Adam optimizer is choosen for optimization. The number of
epochs for which the model will be trained will be 25. We shuffles the training
data before each epoch to prevent the model from learning sequence patterns. 20%
of the training data is held out for validation during training. Here early stopping
callback is applied during training to monitor validation loss and stop training if
necessary. and started the training after compiling the model.

Step 5.4.3: Evaluating the trained Model

Here we extract the loss and accuracy values from the model_evaluation_history,
which likely contains the results of evaluating the model on a separate testing

18
dataset. We define the format for the date and time string (date_time_format) and
obtain the current date and time (current_date_time_dt). It then formats the current
date and time as a string according to the specified format. we then define a useful
name for the saved model file, incorporating the date and time of the model's
creation, as well as the evaluation loss and accuracy. After that, the trained
ConvLSTM model is saved in disk using the defined model file name. By saving
the model with a descriptive name including timestamps and evaluation metrics,
we ensure that each model's file is uniquely identifiable and contains relevant
information about its performance.

The confusion matrix heatmap visualizes the performance of a machine learning

model in classifying samples. It compares the actual labels of the data (rows) with
the labels predicted by the model (columns). The color intensity represent the
number of samples in each category. Darker squares indicate more samples.
ideally most sample fall under diagonal signifying correct prediction. Off diagonal
elements represent errors where the model predicted the wrong class. The
Precision here is the Proportion of predicted positives that were actually correct
and Recall is Proportion of actual positives that were correctly identified. The F1-
score is the harmonic mean of precision and recall, combining both metrics.
Support indicated total number of data points in each class.

Step 5.4.4: Plot Model’s Loss & Accuracy Curves

Here we plot the training loss (loss) and validation loss (val_loss) over successive
epochs in the ConvLSTM model training. This visualization allows you to assess
how well the model is learning from the training data and whether it is overfitting
or underfitting. The plot will include two lines, one representing the training loss
(in blue) and the other representing the validation loss (in red).

Another plot illustrate the training accuracy (accuracy) and validation accuracy
(val_accuracy) of the ConvLSTM model across successive epochs. This
visualization also enables the evaluation of the model's learning progress and
generalization performance. The plot will feature two lines: one representing the

19
training accuracy (in blue) and the other representing the validation accuracy (in
red).

Loss

Epochs

Figure 5.4.4.1: Loss vs Validation loss.

Accuracy

Epochs
Figure 5.4.4.2: Accuracy vs Validation accuracy.

20
6. RESULT AND ANALYSIS
6.1 Model Summary

Fig 5.5: Model Summary

21
6.2 Model Architecture
In our model we have used 5 layers including convLSTM 2D, Max Pooling 3D
layer, Time Distributed Layer, flatten layer, dense layer. Here our kernel size is
(3,3). During max pooling our pool size is (1,2,2). We have used the activation
function as tanh in the convLSTM 2D layer with a dropout rate of 0.2 and
activation function softmax in the dense layer.

• ConvLSTM 2D layers
The convLSTM 2D layer is responsible for extracting spatiotemporal
features from the input data. These layers use convolutional LSTM units,
which combine convolutional and LSTM operations. The parameters here
are number of filters, kernel size, activation function, recurrent dropout.
• MaxPooling 3D layers
The max Pooling 3D layer downsample the spatial dimensions of the
feature maps while preserving the temporal dimension. This helps reduce
computational complexity and extract the most relevant features.
• Time Distribution layer
The time Distributed layer applies dropout regularization independently to
each time step in the input sequence, helping prevent overfitting.
• Flatten layer
The flatten layer flattens the output from the previous layers into a one
dimensional vector, prepraring it for the fully connected layers.
• Dense Layer
The dense layer has softmax activation which serves as the output layer for
multiclass classification. The number of units in this layer corresponds to
the number of classes in the dataset.

22
6.3 Confusion Matrix

Fig 5.5.1: Confusion Matrix Heatmap

23
Confusion Matrix Heatmap

This heatmap visualizes the performance of a machine learning model in

classifying samples. It compares the actual labels of the data (rows) with the labels
predicted by the model (columns).

Color intensity: Represents the number of samples in each category. Darker

squares indicate more samples.

Diagonal: Ideally, most samples should fall here, signifying correct predictions.

Off-diagonal elements: Represent errors where the model predicted the wrong
class.

Specific values:

36: Correctly classified "JumpRope" samples.

1: "JumpRope" samples mistakenly classified as "HorseRace".

35: Correctly classified "HorseRace" samples.

And so on...

Overall, the heatmap helps identify strengths and weaknesses of the model in
classifying different categories.

24
6.4 Classification Report

Fig 5.5.2: Classification Report

The table summarizes the performance of a machine learning model on a

classification task. It shows precision, recall, F1-score, and support for each class.

Precision: Proportion of predicted positives that were actually correct.

Recall: Proportion of actual positives that were correctly identified.

F1-score: Harmonic mean of precision and recall, combining both metrics.

Support: Total number of data points in each class.

Specific values in the table:

JumpRope: Precision of 1.00, recall of 0.97, F1-score of 0.99, and support of 37.

HorseRace: Precision of 0.85, recall of 0.97, F1-score of 0.91, and support of 36.

JavelinThrow: Precision of 0.75, recall of 0.72, F1-score of 0.73, and support of

25.

TennisSwing: Precision of 0.97, recall of 0.90, F1-score of 0.94, and support of

42.

Overall, the classification report provides insights into the model's effectiveness at
identifying different classes.

25
7. CONCLUSION
In this project, we explored the application of Convolutional Long Short-Term
Memory (ConvLSTM) networks and different layers such as convLSTM 2D layer
, maxPooling 3D layer, time distribution layer, flatten layer and dense layer for
human activity recognition (HAR). Leveraging the spatiotemporal features
inherent in ConvLSTM architectures, we aimed to accurately classify various
activities such as jump rope, horse race, tennis swing, javelin throw. Finally, the
project has reach a point where all of its initial objectives have been met, thanks to
all of the changes, new learning, and difficult decisions. Our system can take
video as input and classify the activity among 4 type of activities.

26
REFERENCES

[1] J. C. D. Smith, Hidden Markov Models for Human Activity Recognition,

Works press, 2009.

[2] W. H. Brown, "Support Vector Machines for Human Activity Classification,"

Journal of Artificial Intelligence Research, pp. 175-190, 2012.

[3] L. &. C. X. Zhang, "Human Activity Recognition using CNN-based

Features," IEEE International Conference on Computer Vision, pp. 122-130,
2015.

[4] J. H. &. P. S. H. Lee, "Deep Convolutional Networks for Human Activity

Recognition," IEEE Transactions on Pattern Analysis and Machine
Intelligence, pp. 2188-2199, 2016.

[5] L. W. Nguyen, "ConvLSTM-based Human Activity Recognition,"

Proceedings of the European Conference on Computer Vision, pp. 579-594.

[6] H. S. Y. Kim, "Combining CNN and ConvLSTM for Activity Recognition in

Video Sequences," in Transactions on Multimedia, Chicago, 2019.

[7] S. P. R. Johnson, " HumanActivityNet: A Comprehensive Dataset for Human

Activity Recognition," Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. 560-564, 2020.

[8] W. Y. Li, "A Survey of Human Activity Recognition Datasets," Human-

Machine Systems, pp. 112-123, 2021.

[9] K. L. Patel, "Temporal Attention Mechanisms for Improved Human Activity

Recognition," roceedings of the International Joint Conference on Artificial
Intelligence, pp. 450-465, 2022.

[10] M. N. Wang, "Enhancing Temporal Dynamics in Human Activity

Recognition through Temporal Convolutional Networks," Journal of
Machine Learning Research, pp. 789-804, 2013.

[11] Q.R. Zhao, "Domain Adaptation Techniques for Robust Human Activity
Recognition in Real-world Environments," ACM Transactions on Intelligent
Systems and Technology, vol. 5, no. 2, pp. 265-280, 2022.

[12] S. G. Zhang, "Attention Mechanisms in CNN-ConvLSTM Models:

Interpreting Human Activity Recognition Decisions," Neural Information
Processing Systems, vol. 9, pp. 901-913, 2018.

27
[13] T. I. Kim, "Saliency Maps in CNN-ConvLSTM Architectures for Explainable
Human Activity Recognition," IEEE Transactions on Neural Networks and
Learning Systems,, pp. 13-18, 2019.

[14] R. K. Park, "Integrating Inertial Sensors and Video Data for Enhanced
Human Activity Recognition," International Journal of Computer Vision, pp.
345-355, 2022.

[15] "Comprehensive Human Activity Recognition using Social Interaction Data,"

Proceedings of the AAAI Conference on Artificial Intelligence, pp. 789-802,
2013.

[16] J. N. Choi, "Ensuring Fairness in Human Activity Recognition: Ethical

Considerations and Mitigation Strategies," Ethics and Information
Technology, pp. 112-128, 2021.

[17] X. P. Gao, "Edge Computing for Real-time Processing in Human Activity

Recognition Systems," IEEE Transactions on Mobile Computing, pp. 567-
580, 2013.

[18] Y. R. Xu, "Collaborative Integration of HAR with IoT Devices: Shaping the
Future Landscape," Proceedings of the International Conference on Internet
of Things, pp. 201-213, 2021.

Sign Language Detection
No ratings yet
Sign Language Detection
32 pages
Exam Az 900 Microsoft Azure Fundamentals Skills Measured
No ratings yet
Exam Az 900 Microsoft Azure Fundamentals Skills Measured
8 pages
3D Tune-In Toolkit - An Open-Source Library For Real-Time Binaural Spatialisation
No ratings yet
3D Tune-In Toolkit - An Open-Source Library For Real-Time Binaural Spatialisation
37 pages
BCOL306 Design & Analysis of Algorithm: Course Objectives
No ratings yet
BCOL306 Design & Analysis of Algorithm: Course Objectives
44 pages
Write A C Program To Simulate Lexical Analyzer To Validating A Given Input String.
No ratings yet
Write A C Program To Simulate Lexical Analyzer To Validating A Given Input String.
8 pages
HAHA2
No ratings yet
HAHA2
6 pages
Module 13 - Synchronous Replication of Volumes
No ratings yet
Module 13 - Synchronous Replication of Volumes
53 pages
Starter Discussion Questions?: What Are The 10 Most Commonly Used Passwords?
No ratings yet
Starter Discussion Questions?: What Are The 10 Most Commonly Used Passwords?
4 pages
ABAS II - Protocolo ESCOLAR-PADRES (5-21 Años)
No ratings yet
ABAS II - Protocolo ESCOLAR-PADRES (5-21 Años)
12 pages
Alien Worlds - Blockchain Technical Blueprint
No ratings yet
Alien Worlds - Blockchain Technical Blueprint
28 pages
Design and Development of Convolutional Neural Network For Early Tumor Detection in Brain
No ratings yet
Design and Development of Convolutional Neural Network For Early Tumor Detection in Brain
49 pages
Batch 6
No ratings yet
Batch 6
28 pages
Gas Management System
No ratings yet
Gas Management System
28 pages
Human Activity Recognition Using Convolutional Neural Network
No ratings yet
Human Activity Recognition Using Convolutional Neural Network
19 pages
Presented By: Dewan Imdadul Islam
No ratings yet
Presented By: Dewan Imdadul Islam
13 pages
User Preference of Cyber Security Awaren
No ratings yet
User Preference of Cyber Security Awaren
12 pages
Data Structures Unit 1
No ratings yet
Data Structures Unit 1
96 pages
Human Activity Recognition: A Review
No ratings yet
Human Activity Recognition: A Review
8 pages
Online Personal Counselling System: Meta Description: Online Counseling System Is Fully Work As Online Using The
No ratings yet
Online Personal Counselling System: Meta Description: Online Counseling System Is Fully Work As Online Using The
4 pages
Real Time Human Activity Recognition On Smartphones Using LSTM Networks
No ratings yet
Real Time Human Activity Recognition On Smartphones Using LSTM Networks
6 pages
Instalacion NetBox
No ratings yet
Instalacion NetBox
4 pages
Thesis Aico Schreurs Cito
No ratings yet
Thesis Aico Schreurs Cito
42 pages
Session 16M-Day 2 Review Session
No ratings yet
Session 16M-Day 2 Review Session
7 pages
Container Review
No ratings yet
Container Review
15 pages
Design and Implementation of A Convolutional Neura
No ratings yet
Design and Implementation of A Convolutional Neura
11 pages
Traffic Signs Recognition System Using Deep Learning and CNN Approaches
No ratings yet
Traffic Signs Recognition System Using Deep Learning and CNN Approaches
91 pages
FULLTEXT01
No ratings yet
FULLTEXT01
52 pages
Electronic Diversity Visa Program
No ratings yet
Electronic Diversity Visa Program
1 page
Vaani Thesis
No ratings yet
Vaani Thesis
39 pages
Human Activity
No ratings yet
Human Activity
53 pages
Exercise - Using Interrupts - Flowcode Help
No ratings yet
Exercise - Using Interrupts - Flowcode Help
6 pages
Distributed System QUESTION BANK
No ratings yet
Distributed System QUESTION BANK
9 pages
DeepLearning Text
No ratings yet
DeepLearning Text
21 pages
Project Files 9
No ratings yet
Project Files 9
32 pages
CB iTM 08 16
No ratings yet
CB iTM 08 16
16 pages
Experiment 6
No ratings yet
Experiment 6
5 pages
Inertial Sensor Based Human Activity Identification System Using CNN - LSTM Deep Learning Technique
No ratings yet
Inertial Sensor Based Human Activity Identification System Using CNN - LSTM Deep Learning Technique
6 pages
Praveen Chaudhary
No ratings yet
Praveen Chaudhary
2 pages
Aasl
No ratings yet
Aasl
34 pages
A, Sign Language Detection
No ratings yet
A, Sign Language Detection
32 pages
American SIGN - LANGUAGE - DETECTION
No ratings yet
American SIGN - LANGUAGE - DETECTION
35 pages
B.E Cse Batchno 10
No ratings yet
B.E Cse Batchno 10
81 pages
Deep - Gopalani Resume
No ratings yet
Deep - Gopalani Resume
1 page
Human Suspicious Activity Detection
No ratings yet
Human Suspicious Activity Detection
50 pages
Human Activity Prediction Using Deep Learning JAIN
No ratings yet
Human Activity Prediction Using Deep Learning JAIN
93 pages
Batch 7
No ratings yet
Batch 7
53 pages
ML Project
No ratings yet
ML Project
4 pages
Asl
No ratings yet
Asl
34 pages
Crowdstrike Falcon Adversary Overwatch Cloud Threat Hunting
No ratings yet
Crowdstrike Falcon Adversary Overwatch Cloud Threat Hunting
3 pages
Literature Review
No ratings yet
Literature Review
9 pages
Rubel & Durjoy
No ratings yet
Rubel & Durjoy
20 pages
Informatics 09 00056
No ratings yet
Informatics 09 00056
13 pages
Design and Implementation of A Convolutional Neural Network On An Edge Computing Smartphone For Human Activity Recognition
No ratings yet
Design and Implementation of A Convolutional Neural Network On An Edge Computing Smartphone For Human Activity Recognition
12 pages
Khulna University of Engineering & Technology, Khulna: Submitted By: Ninad Kumar Sarker Roll: 1903032
No ratings yet
Khulna University of Engineering & Technology, Khulna: Submitted By: Ninad Kumar Sarker Roll: 1903032
10 pages
Khulna University of Engineering & Technology, Khulna: Submitted By: Ninad Kumar Sarker Roll: 1903032
No ratings yet
Khulna University of Engineering & Technology, Khulna: Submitted By: Ninad Kumar Sarker Roll: 1903032
9 pages
Issuance of Personal Customs Code FAQ For Foreigner
No ratings yet
Issuance of Personal Customs Code FAQ For Foreigner
15 pages
Mini Project
No ratings yet
Mini Project
32 pages
Resnet 152
No ratings yet
Resnet 152
11 pages
Report Amira Bouaouina Final
No ratings yet
Report Amira Bouaouina Final
77 pages
Batch 7
No ratings yet
Batch 7
21 pages
Sample Report - Abiram
No ratings yet
Sample Report - Abiram
86 pages
LightRAG The Cross Breed of NavieRag and GraghRAG - by SUMITH - Oct, 2024 - Medium
No ratings yet
LightRAG The Cross Breed of NavieRag and GraghRAG - by SUMITH - Oct, 2024 - Medium
14 pages
Summer Internship Report.
No ratings yet
Summer Internship Report.
27 pages
SAP Certified Associate C - THR86 - 2411 Actual Questions
No ratings yet
SAP Certified Associate C - THR86 - 2411 Actual Questions
5 pages
AIML Internship Report
No ratings yet
AIML Internship Report
53 pages
Human Activity Recog Paper1
No ratings yet
Human Activity Recog Paper1
5 pages
Deepanshu Training
No ratings yet
Deepanshu Training
18 pages
IT-8 MajorProject
No ratings yet
IT-8 MajorProject
63 pages
Deep2019 3
No ratings yet
Deep2019 3
6 pages
5 6280382869936280464
No ratings yet
5 6280382869936280464
14 pages
Project Report (DC)
No ratings yet
Project Report (DC)
15 pages
Report Batch 14
No ratings yet
Report Batch 14
57 pages
Smart Video Monitoring: Advanced Deep Learning For Activity and Object Recognition
No ratings yet
Smart Video Monitoring: Advanced Deep Learning For Activity and Object Recognition
5 pages
Maj Report VP
No ratings yet
Maj Report VP
51 pages
NCS8801 NewCoSemi
No ratings yet
NCS8801 NewCoSemi
12 pages
Final Mini Project CB SC I5das21026
No ratings yet
Final Mini Project CB SC I5das21026
30 pages
GUIDE TO USE DELL PFS EXTRACTOR or Other Extractors From BiosUtilities (My Way) - Badcaps
No ratings yet
GUIDE TO USE DELL PFS EXTRACTOR or Other Extractors From BiosUtilities (My Way) - Badcaps
5 pages
Final Report
No ratings yet
Final Report
74 pages
Final Major Report
No ratings yet
Final Major Report
59 pages
21bce1450
No ratings yet
21bce1450
61 pages
1 s2.0 S2666307424000214 Main 4
No ratings yet
1 s2.0 S2666307424000214 Main 4
1 page
1 s2.0 S2666307424000214 Main 3
No ratings yet
1 s2.0 S2666307424000214 Main 3
1 page
Rohan Patil 23551005 BlackBook
No ratings yet
Rohan Patil 23551005 BlackBook
97 pages
Human Activity Recognition LSTM Report
No ratings yet
Human Activity Recognition LSTM Report
7 pages
International Journal of Cognitive Computing in Engineering
No ratings yet
International Journal of Cognitive Computing in Engineering
10 pages
57.Light-Weight Deep Learning Model For Human
No ratings yet
57.Light-Weight Deep Learning Model For Human
6 pages
Foul Legacy
No ratings yet
Foul Legacy
17 pages
Explorations in Computational Physics
From Everand
Explorations in Computational Physics
Devang Patil
No ratings yet
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
From Everand
Introduction to Quantum Computing & Machine Learning Technologies: 1, #1
M. Sreedevi
No ratings yet
Pragmatic Internet of Everything (IOE) for Smart Cities: 360-Degree Perspective
From Everand
Pragmatic Internet of Everything (IOE) for Smart Cities: 360-Degree Perspective
Satya Prakash Yadav
No ratings yet