0% found this document useful (0 votes)
60 views36 pages

Human Activity Recognition System Report

This final year project report from Tribhuvan University's Institute of Engineering focuses on human activity recognition using a hybrid approach that integrates Convolutional Long Short-Term Memory networks (ConvLSTMs) with Convolutional Neural Networks (CNNs). The study aims to enhance the accuracy of recognizing a wide range of human activities, which has significant applications in healthcare, sports analysis, and surveillance. The report includes acknowledgments, an abstract, literature review, requirement analysis, and outlines the methodology for implementing the proposed system.

Uploaded by

naruto112h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views36 pages

Human Activity Recognition System Report

This final year project report from Tribhuvan University's Institute of Engineering focuses on human activity recognition using a hybrid approach that integrates Convolutional Long Short-Term Memory networks (ConvLSTMs) with Convolutional Neural Networks (CNNs). The study aims to enhance the accuracy of recognizing a wide range of human activities, which has significant applications in healthcare, sports analysis, and surveillance. The report includes acknowledgments, an abstract, literature review, requirement analysis, and outlines the methodology for implementing the proposed system.

Uploaded by

naruto112h
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

HIMALAYA COLLEGE OF ENGINEERING

[CODE: CT-455]
A
FINAL YEAR PROJECT REPORT
ON
HUMAN ACTIVITY RECONGNITION USING
CONV-LSTM
BY:
BALKRISHNA RAY (HCE076BCT007)

BIKRAM BHUSAL (HCE076BCT008)

BISWAS PANDIT (HCE076BCT010)

SAURABH KARKI (HCE076BCT036)

A PROJECT REPORT SUBMITTED TO DEPARTMENT


OF ELECTRONICS AND COMPUTER ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENT
FOR BACHELOR’S DEGREE IN COMPUTER
ENGINEERING

DEPRATMENT OF ELECTRONICS AND COMPUTER


ENGINEERING

LALITPUR, NEPAL
March 2024
HUMAN ACTIVITY RECONGNITION USING

CONV-LSTM

BY:
BALKRISHNA RAY (HCE076BCT007)

BIKRAM BHUSAL (HCE076BCT008)

BISWAS PANDIT (HCE076BCT010)

SAURABH KARKI (HCE076BCT036)

PROJECT SUPERVISOR

Er. HASINA SHAKYA

A report submitted in partial fulfillment of


the requirements for the degree of Bachelor
in Computer Engineering

Department of Electronics and Computer Engineering

HIMALAYA COLLEGE OF ENGINEERING

Tribhuvan University

Lalitpur, Nepal

March, 2024
ACKNOWLEDGEMENT
We express deep gratitude to the Institute of Engineering, Pulchowk, for
including major projects in the BCT IV/I syllabus, which has greatly enhanced our
academic journey and allowed us to apply practical knowledge. We are thankful
to the management of Himalaya College of Engineering (HCOE) for providing
us with this exceptional opportunity and assembling a team of experts to assist us
during our proposal defense.
We are very thankful to our respected Head of the Department, Er. Ashok GM,
and the Deputy Head of the Department, Er. Devendra Kathayat, for their
invaluable advice, unwavering support, and exceptional guidance throughout our
project. We are also grateful to our project supervisor, Er.Hasina Shakya, for her
unwavering commitment, motivation, and insightful contributions. We are
indebted to our friends and colleagues for their support and constructive feedback
in selecting our project topic. Their encouragement has fostered an environment of
growth and inspiration, fueling our passion for excellence. We humbly
acknowledge all those who have contributed to the realization of our ideas,
transforming them into tangible achievements. In conclusion, we extend our
heartfelt appreciation to everyone involved. We recognize the significant impact
of each individual and institution mentioned here, as their support and guidance
have paved the way for our success. We are eternally grateful for their
contributions.

GROUP MEMBERS:
Balkrishna Ray (HCE076BCT007)
Bikram Bhusal (HCE076BCT008)
Biswas Pandit (HCE076BCT010)
Saurabh Karki (HCE076BCT036)

i
ABSTRACT
This study introduces an innovative methodology for human activity
recognition by seamlessly integrating Convolutional Long Short-Term
Memory networks (ConvLSTMs). In the landscape of machine learning,
particularly within the realm of deep learning, this approach is designed to
tackle the intricate challenge of precisely identifying and categorizing a
diverse array of human activities. These activities encompass a wide
spectrum, ranging from fundamental actions such as walking and sitting, to
more intricate motions like dancing and cooking. The significance of this
endeavor reverberates across multifarious domains, including healthcare,
sports analysis, and surveillance, where accurate activity recognition holds
immense value. Conventional techniques often grapple with the complexity of
capturing both spatial intricacies and the nuanced temporal patterns inherent
within sequences of activities. To overcome these challenges, we advocate for
an innovative hybrid architecture that seamlessly amalgamates the strengths of
CNNs and ConvLSTMs. CNNs excel at extracting spatial features from raw
sensor data, creating a robust foundation for comprehending various activities.
On the other hand, ConvLSTMs specialize in modeling temporal
dependencies within sequential data, enabling the seamless comprehension of
intricate temporal dynamics embedded in human motions. By synergizing
these two powerful deep learning paradigms, our proposed framework not
only elevates the potential for accurate and holistic human activity recognition
but also contributes to the advancement of real-time activity understanding.

Keywords: Convolutional Long Short-Term Memory networks (ConvLSTMs),


Deep learning, Human activity recognition, Hybrid architecture

ii
TABLE OF CONTENTS

ACKNOWLEDGEMENT ........................................................................................ i

ABSTRACT.............................................................................................................ii

LIST OF FIGURES ................................................................................................. v

LIST OF ABBREVIATION ................................................................................... vi

1. INTRODUCTION ............................................................................................... 1

1.1 OBJECTIVE................................................................................................... 2

1.2 SCOPE ........................................................................................................... 2

1.3 PROBLEM STATEMENT ............................................................................ 2

2. LITERATURE REVIEW .................................................................................... 3

3. REQUIREMENT ANALYSIS ............................................................................ 6

3.1 Functional Requirements................................................................................ 6

3.2 Non-Functional Requirements ....................................................................... 6

3.3 Feasibility Study............................................................................................. 7

3.3.1 Technical Feasibility................................................................................ 7

3.3.2 Financial Feasibility ................................................................................ 7

3.3.3 Operational Feasibility ............................................................................ 7

4. SYSTEM DIAGRAM.......................................................................................... 8

4.1 System Flow diagram ..................................................................................... 8

4.2 Sequence Diagram.......................................................................................... 9

4.3 Data Flow Diagram ...................................................................................... 10

5. METHODOLOGY ............................................................................................ 11

5.1 Download and visualize the data with its labels. ......................................... 16

5.2 Pre-process the dataset ................................................................................. 16

iii
5.3 Split the data into train and test set .............................................................. 17

5.4 Implement the ConvLSTM approach ........................................................... 17

Step 5.4.1: Construct the Model ..................................................................... 17

Step 5.4.2: Compile & Train the Model ......................................................... 18

Step 5.4.3: Evaluating the trained Model ....................................................... 18

Step 5.4.4: Plot Model’s Loss & Accuracy Curves ........................................ 19

6. RESULT AND ANALYSIS .............................................................................. 21

6.2 Model Architecture ...................................................................................... 22

6.3 Confusion Matrix ......................................................................................... 23

6.4 Classification Report .................................................................................... 25

7. CONCLUSION .................................................................................................. 26

REFERENCES ...................................................................................................... 27

iv
LIST OF FIGURES
Fig 3.1: Use Case Diagram ................................................................................................ 6

Fig 4.2 System Flow Diagram ........................................................................................... 8

Figure 4.3 Sequence Diagram ............................................................................................ 9

Fig 4.4 Data flow diagram level 0 ................................................................................... 10

Fig 5.1: Representation of an LSTM cell ......................................................................... 13

Fig 5.1.1 Train Prediction Workflow ............................................................................... 14

Figure 5.4.4.1: Loss vs Validation ................................................................................... 20

Figure 5.4.4.2: Accuracy vs Validation ........................................................................... 20

Fig 5.5: Model Summary ................................................................................................. 21

Fig 5.5.1: Confusion Matrix Heatmap .............................................................................. 23

Fig 5.5.2: Classification Report ........................................................................................ 25

v
LIST OF ABBREVIATION
HAR: Human Activity Recognition
CNNs: Convolutional Neural Networks
ConvLSTMs: Convolutional Long Short-Term Memory networks
HMMs: Hidden Markov Models
SVMs: Support Vector Machines

vi
1. INTRODUCTION
Human Activity Recognition (HAR) stands as a pivotal research field, finding
applications in healthcare monitoring, sports analysis, and surveillance systems. In
an increasingly data-driven era, the automatic identification, classification, and
comprehension of human activities from sensor data have captured significant
interest. Integrating Convolutional Neural Networks (CNNs) and Long Short-
Term Memory networks (LSTMs) offers a promising avenue to enhance HAR
model accuracy.

Activities span a wide spectrum, from basic motions like walking and sitting to
complex gestures like dancing and cooking. Accurate activity recognition holds
substantial value across domains. Traditional methods often grapple with
capturing spatial intricacies and nuanced temporal patterns inherent in activity
sequences.

Addressing these challenges, we advocate an innovative hybrid architecture that


synergizes CNNs' spatial feature extraction with ConvLSTMs' temporal
dependency modeling. CNNs excel in extracting spatial features from raw sensor
data, forming a robust basis for understanding activities. Conversely, ConvLSTMs
specialize in modeling sequential dependencies, facilitating the comprehension of
intricate temporal dynamics.

By amalgamating these deep learning paradigms, our proposed framework not


only enhances human activity recognition accuracy. This study delves into the art
of deciphering human actions, augmenting potential for applications requiring
fine-grained and real-time activity understanding. This amalgamation illuminates
a path towards comprehending and classifying diverse human activities, opening
doors to a new dimension of applications and insights.

1
1.1 OBJECTIVE

• To create a web application which can accurately identify and


classify human activities based on input data.

1.2 SCOPE

• Fitness Tracking: HAR can be used to automatically recognize and track


various fitness activities, such as running, walking, cycling, or
weightlifting. This information can be used to monitor a person's daily
physical activity and provide insights into their fitness progress.
• Healthcare: In healthcare settings, HAR can be used to monitor patients
movements and activities, helping in rehabilitation programs, elderly care,
and detecting anomalies in motion patterns that might indicate health
issues.

1.3 PROBLEM STATEMENT

The task of human activity recognition presents numerous challenges yet holds
immense potential for applications spanning healthcare monitoring, sports
analysis, and surveillance. While Convolutional Neural Networks (CNNs) excel at
extracting spatial features from raw data, they often struggle to inherently capture
the temporal relationships between frames in activity sequences. Conversely,
Long Short-Term Memory networks (LSTMs) possess the capability to model
temporal dependencies but may overlook crucial spatial contexts essential for
accurate activity recognition. To address these challenges and harness the
complementary strengths of both CNNs and Convolutional Long Short-Term
Memory networks (ConvLSTMs), this research aims to propose a hybrid
approach. By integrating CNNs' spatial feature extraction prowess with
ConvLSTMs' temporal modeling abilities, the proposed hybrid model seeks to
enhance the accuracy and robustness of human activity recognition systems across
various domain.

2
2. LITERATURE REVIEW

Human Activity Recognition (HAR) has emerged as a critical field of research


with significant applications in domains such as healthcare monitoring, sports
analysis, and surveillance systems. As the world becomes increasingly data-
driven, the ability to automatically identify, classify, and understand human
activities from sensor data has garnered substantial attention. In recent years, the
integration of Convolutional Neural Networks (CNNs) and Convolutional Long
Short-Term Memory networks (ConvLSTMs) has emerged as a promising
approach to enhance the accuracy and robustness of HAR models.
Historically, HAR methodologies encompassed a range of approaches, including
traditional machine learning algorithms and handcrafted feature engineering.
Techniques like Hidden Markov Models (HMMs) and Support Vector Machines
(SVMs) demonstrated success in capturing sequential patterns but struggled to
accommodate the complexities of real-world activities [1] [2]. The growing need
for models that can adapt to intricate patterns inherent in human actions led to the
exploration of deep learning techniques.
Deep learning, particularly CNNs, revolutionized numerous domains, including
image recognition and natural language processing. In the context of HAR,
researchers extended CNNs to process sensor data by transforming it into image-
like representations. These CNN-based models demonstrated remarkable
capabilities in capturing spatial patterns inherent in activities, although they often
fell short in accounting for temporal dependencies [3].
To address the temporal limitations of pure CNN models, ConvLSTMs were
introduced. ConvLSTMs, an extension of traditional LSTMs, fuse the strengths of
CNNs with the sequential modeling capabilities of LSTMs. They allow for the
seamless incorporation of spatial and temporal features, effectively capturing both
fine-grained spatial information and complex temporal dynamics [4] [5]. The
application of ConvLSTMs in HAR has showcased substantial improvements in
recognizing nuanced and context-dependent activities [6].

3
Evaluation of HAR models hinges on the availability of high-quality datasets.
Datasets like "HumanActivityNet" have become benchmarks for testing model
performance [7]. These datasets encompass a wide array of activities, allowing
researchers to comprehensively assess model accuracy and generalizability [8][9].
The diversity of these datasets ensures that models trained on them are well-
prepared to handle real-world scenarios.
Despite the potential of the CNN-ConvLSTM hybrid, challenges persist. The
intricate architecture requires meticulous hyperparameter tuning to prevent
overfitting. Addressing domain adaptation and real-time performance concerns
remains an ongoing endeavor, as models must exhibit adaptability across diverse
contexts and provide real-time insights [8].
In conclusion, the integration of CNNs and ConvLSTMs marks a pivotal
advancement in the field of HAR. This literature review highlights the
evolutionary journey from traditional techniques to the innovative integration of
deep learning architectures. The hybrid approach not only bolsters the accuracy of
activity recognition but also paves the way for real-time and context-aware
understanding of human actions in a myriad of applications.
Recent Advances in HAR

Recent studies have focused on refining temporal modeling for HAR. Time-aware
attention mechanisms [9] and Temporal Convolutional Networks (TCNs)
[10]have been proposed as novel approaches to enhance the temporal
understanding of human activities, providing a more nuanced perspective on
temporal dynamics.

Domain Adaptation and Real-World Deployment

In the realm of domain adaptation, recent works emphasize its importance in HAR
models, especially in real-world scenarios with varying sensor configurations and
data distributions [11].

Explainability in Deep Learning Models

4
The interpretability of CNN-ConvLSTM models has garnered attention, with
recent research exploring attention mechanisms [12] and saliency maps [13] to
shed light on decision-making processes.

Multi-Modal Approaches for Comprehensive Recognition

The integration of data from diverse sensors, such as inertial sensors, video
cameras, and social interactions, has shown promising results in improving the
robustness and accuracy of HAR models [14] [15].

Ethical Considerations and Bias in HAR

Ethical considerations and potential biases in HAR models are gaining


prominence. Recent literature underscores the importance of fairness, especially in
sensitive contexts such as healthcare or law enforcement [16].

Future Directions and Emerging Technologies

Looking ahead, the integration of edge computing for real-time processing


[17]and the collaboration of HAR with Internet of Things (IoT) devices
[18]represent emerging technologies shaping the future of HAR.

In conclusion, the integration of CNNs and ConvLSTMs stands as a pivotal


advancement in HAR. This literature review traces the evolutionary journey from
traditional techniques to the innovative integration of deep learning architectures,
highlighting not only the bolstered accuracy of activity recognition but also the
potential for real-time and context-aware understanding of human actions.

5
3. REQUIREMENT ANALYSIS

3.1 Functional Requirements

Fig 3.1: Use Case Diagram


• Upload the video
User should have the ability to upload videos through a user-friendly
interface on the website
• Recognize activity
The system should have the capability to classify human activity and
predict the action.

3.2 Non-Functional Requirements


• Accuracy
The HAR system should achieve high accuracy in classifying human
activities, ensuring reliable and trustworthy results for end users. The
system should be able to process data and make predictions within a
specified time frame.

6
• Usability and maintainability
The user interface should be intuitive, easy to navigate and accessible to
users easily. The system should be easy to maintain and update over time.

3.3 Feasibility Study

3.3.1 Technical Feasibility

Assess the availability of suitable datasets for training and testing the model. Look
for publicly available datasets or consider collecting your own data if necessary.
Evaluate the availability of computational resources such as GPUs, cloud
computing required for training and inference with models. Determine the
feasibility of implementing convLSTM approach using existing deep learning
frameworks such as TensorFlow, keras.

3.3.2 Financial Feasibility

Estimate the costs associated with data collection, preprocessing, model training,
and deployment. Consider expenses related to hardware, software, personnel, or
any potential licensing. Compare the projected costs with the available budget and
funding sources to ensure financial viability.

3.3.3 Operational Feasibility

Assess the feasibility of integrating the HAR system into existing workflows or
applications, such as healthcare monitoring systems or fitness trackers. Also
evaluate the ease of use and user acceptance of HAR system by potential end
users.

7
4. SYSTEM DIAGRAM
4.1 System Flow diagram

Fig 4.2 System Flow Diagram

Our input comprises a video, which undergoes segmentation into multiple images.
These images are then forwarded to a CNN (Convolutional Neural Network) to
extract visual features. Subsequently, the extracted visual features are fed into an
LSTM (Long Short-Term Memory) network to generate predictions. The CNN's
role is to learn spatial information, while the LSTM specializes in learning
temporal patterns.

8
4.2 Sequence Diagram

Figure 4.3 Sequence Diagram

The diagram features an actor labeled 'User' and a lifeline representing the 'HAR'
(Human Activity Recognition). The interaction begins as the user initiates a call
message to upload the video into the system. This action is depicted by a thin
rectangle, symbolizing the activation bar. Upon receiving the video, the system
processes it and responds with a return message, indicating successful recognition.
Finally, the recognized video is displayed to the user.

9
4.3 Data Flow Diagram

Fig 4.4 Data flow diagram level 0

This diagram depicts an entity labeled 'user' and a process named 'human activity
recognition' (HAR). The arrows signify the flow of data. Initially, the user uploads
a video, which is then processed by the human activity recognition system to
recognize the activity. Subsequently, the identified activity is displayed to the
user.

10
5. METHODOLOGY
There are various techniques in deep learning techniques that can be used in order
to create a project of Human Activity Recognition (HAR). The various methods to
implement this project include Convolution Neural Network (CNN), Recurrent
Neural Network (RNN), Hybrid Architectures and many more. Every technique
has different levels of working mechanisms and accuracy based on their
individual capabilities. CNN has the capabilities of finding out the features of
spatial images. For sequential data like series sensor readings RNN is used. In
case of HAR, the hybrid form will be more accurate as the hybrid architectures are
commonly combinations of CNN and RNN. Long Short-Term Memory (LSTM)
is a type of RNN which can be used in HAR models alongside of CNN for activity
predictions for sequential data.

LSTM (Long Short-Term Memory):

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN)


architecture designed to address the vanishing gradient problem, which often
occurs when training traditional RNNs on long sequences of data. LSTM
networks are capable of learning and remembering long-term dependencies in
sequential data, making them well-suited for tasks such as time series prediction,
natural language processing, and speech recognition.

The LSTM’s key components are as follows:

1. Memory Cells:

The core component of LSTM is the memory cell, which maintains a


hidden state vector that can store information over long periods. Unlike
traditional RNNs, which update their hidden state at each time step, LSTM
networks have mechanisms to selectively add or remove information from
the memory cell, allowing them to retain important information while
discarding irrelevant details.

11
2. Gates:

LSTM networks employ three types of gates to control the flow of


information: the input gate, the forget gate, and the output gate. Each gate
is implemented using a sigmoid activation function, producing values
between 0 and 1 that determine how much information should be let
through.

3. Forget Gate:

The forget gate decides which information from the previous cell state
should be discarded. It takes as input the concatenation of the current input
and the previous hidden state and produces a forget vector. The forget
vector is then multiplied element-wise with the previous cell state,
effectively "forgetting" irrelevant information.

4. Input Gate:

The input gate determines which new information should be stored in the
memory cell. It takes as input the concatenation of the current input and
the previous hidden state and produces an input vector. This input vector is
then combined with a candidate cell state (obtained from the current input)
through element-wise multiplication with a tanh activation function. This
produces new candidate values to be added to the cell state.

5. Update Cell State:

The forget gate and input gate outputs are combined to update the cell
state. The forget gate output is used to scale the previous cell state to
forget irrelevant information, and the input gate output is used to add new
information to the cell state. The resulting updated cell state serves as the
memory for the current time step.

6. Output Gate:

The output gate determines which information from the current cell state
should be exposed to the next hidden state. It takes as input the

12
concatenation of the current input and the previous hidden state and
produces an output vector. The output vector is then combined with the
updated cell state through element-wise multiplication with a tanh
activation function, and the resulting value is the current hidden state.

By selectively updating and passing information through its gates, LSTM


networks can effectively learn and remember long-term dependencies in
sequential data, making them powerful tools for a wide range of tasks
requiring sequence modeling and prediction.

Fig 5.1: Representation of an LSTM cell

13
Fig 5.1 Train Prediction Workflow

Train Prediction Workflow

Data Sets: Inputs raw training videos.

Pre-processing: Splits videos into frames and resizes them for uniformity.

Train/Test Split: Divides the processed data into training and testing sets. The
training set is used to train the model, while the testing set is used to evaluate its
performance.

Build Model: Defines the architecture of the model, including the layers and their
parameters.

Train Model: Trains the model on the training data. The model learns to identify
patterns and relationships within the data.

Evaluate: Assesses the model's performance on the testing data using metrics like
accuracy, precision, recall, or F1-score.

14
Desired Accuracy Met?: Checks if the achieved accuracy meets the predefined
threshold.

Yes: Training is complete. The model can be used for predictions on new data.

No: If the desired accuracy is not met, the model might need further training or
adjustments. This could involve:

• Tuning hyperparameters of the model.


• Going back to the pre-processing stage to improve data quality.
• Modifying the model architecture.

15
5.1 Download and visualize the data with its labels.

First we download the required libraries such as pafy, youtube-dl and moviepy
which will help us to download the videos from the youtube. We also use other
libraries such as openCV which provides a wide range of functionalities for
processing and analyzing images and videos. We also use tensorflow for building
and training various machine learning models, including deep learning models.
We mostly use Keras api running on top of tensorflow which focuses on enabling
fast experimentation and prototyping of deep learning models.

In the first step, we will download and visualize the data along with labels to get
an idea about what we will be dealing with. We will be using the UCF50 – Action
Recognition Dataset, consisting of realistic videos taken from youtube which
differentiates this data set from most of the other available action recognition data
sets as they are not realistic and are staged by actors. The Dataset contains

• 50 Action Categories
• 25 Groups of Videos per Action Category
• 133 Average Videos per Action Category
• 199 Average Number of Frames per Video
• 320 Average Frames Width per Video
• 240 Average Frames Height per Video
• 26 Average Frames Per Seconds per Video

5.2 Pre-process the dataset

Next, we will perform some preprocessing on the dataset. First, we will read the
video files from the dataset and resize the frames of the videos to a fixed width
and height (64X64), to reduce the computations and normalized the data to
range [0-1] by dividing the pixel values with 255, which makes convergence
faster while training the network.

We will create a function that will create a list containing the resized and
normalized frames of a video whose path is passed to it as an argument. The

16
function will read the video file frame by frame, although not all frames are added
to the list as we will only need an evenly distributed sequence length of frames.

Now we will create a function that will iterate through all the classes specified in
the class and will call the function on every video file of the selected classes and
return the frames (features), class index (labels), and video file path
(video_files_paths).

5.3 Split the data into train and test set

As of now, we have the required features and one_hot_encoded_labels . We split


our data to create training and testing sets. We also shuffled the dataset before the
split to avoid any bias and get splits representing the overall distribution of the
data.

5.4 Implement the ConvLSTM approach

In this step, we have implemented the first approach by using a combination


of ConvLSTM cells. A ConvLSTM cell is a variant of an LSTM network that
contains convolutions operations in the network. It is an LSTM with convolution
embedded in the architecture, which makes it capable of identifying spatial
features of the data while keeping into account the temporal relation. For video
classification, this approach effectively captures the spatial relation in the
individual frames and the temporal relation across the different frames. As a result
of this convolution structure, the ConvLSTM is capable of taking in 3-
dimensional input (width, height, num_of_channels).

Step 5.4.1: Construct the Model

To construct the model, we have used Keras ConvLSTM2D recurrent layers.


The ConvLSTM2D layer also takes in the number of filters and kernel size
required for applying the convolutional operations. The output of the layers will
be flattened in the end and have fed to the Dense layer with softmax activation
which outputs the probability of each action category.
We have also used MaxPooling3D layers to reduce the dimensions of the frames
and avoid unnecessary computations and Dropout layers to prevent overfitting the

17
model on the data. The create_convlstm_model function constructs a
Convolutional Long Short-Term Memory (ConvLSTM) model for tasks such as
video classification or action recognition. It begins by initializing a Sequential
model, allowing layers to be added sequentially. The model architecture includes
multiple ConvLSTM2D layers, which perform convolutional operations with
LSTM-like recurrence along both spatial and temporal dimensions. Each
ConvLSTM2D layer is followed by a MaxPooling3D layer for spatial pooling,
and a TimeDistributed layer with Dropout for regularization. After the
convolutional layers, the output is flattened into a 1D array using a Flatten layer.
Finally, a Dense layer with softmax activation is added to classify the input into
the classes specified in CLASSES_LIST. The model's summary is displayed,
showing the number of parameters and the architecture of each layer, before
returning the constructed ConvLSTM model. This architecture is effective for
learning spatiotemporal features from video data, essential for tasks requiring
understanding and analyzing video sequences.

Step 5.4.2: Compile & Train the Model

Next, we have added an early stopping callback to prevent over fitting. Training
will stop if the validation loss does not improve for 10 consecutive epochs. The
callback considers the validation loss to minimize. It stops training when the loss
stops decreasing. The loss function used is categorical crossentropy for multiclass
classification task. Adam optimizer is choosen for optimization. The number of
epochs for which the model will be trained will be 25. We shuffles the training
data before each epoch to prevent the model from learning sequence patterns. 20%
of the training data is held out for validation during training. Here early stopping
callback is applied during training to monitor validation loss and stop training if
necessary. and started the training after compiling the model.

Step 5.4.3: Evaluating the trained Model

Here we extract the loss and accuracy values from the model_evaluation_history,
which likely contains the results of evaluating the model on a separate testing

18
dataset. We define the format for the date and time string (date_time_format) and
obtain the current date and time (current_date_time_dt). It then formats the current
date and time as a string according to the specified format. we then define a useful
name for the saved model file, incorporating the date and time of the model's
creation, as well as the evaluation loss and accuracy. After that, the trained
ConvLSTM model is saved in disk using the defined model file name. By saving
the model with a descriptive name including timestamps and evaluation metrics,
we ensure that each model's file is uniquely identifiable and contains relevant
information about its performance.

The confusion matrix heatmap visualizes the performance of a machine learning


model in classifying samples. It compares the actual labels of the data (rows) with
the labels predicted by the model (columns). The color intensity represent the
number of samples in each category. Darker squares indicate more samples.
ideally most sample fall under diagonal signifying correct prediction. Off diagonal
elements represent errors where the model predicted the wrong class. The
Precision here is the Proportion of predicted positives that were actually correct
and Recall is Proportion of actual positives that were correctly identified. The F1-
score is the harmonic mean of precision and recall, combining both metrics.
Support indicated total number of data points in each class.

Step 5.4.4: Plot Model’s Loss & Accuracy Curves

Here we plot the training loss (loss) and validation loss (val_loss) over successive
epochs in the ConvLSTM model training. This visualization allows you to assess
how well the model is learning from the training data and whether it is overfitting
or underfitting. The plot will include two lines, one representing the training loss
(in blue) and the other representing the validation loss (in red).

Another plot illustrate the training accuracy (accuracy) and validation accuracy
(val_accuracy) of the ConvLSTM model across successive epochs. This
visualization also enables the evaluation of the model's learning progress and
generalization performance. The plot will feature two lines: one representing the

19
training accuracy (in blue) and the other representing the validation accuracy (in
red).

Loss

Epochs

Figure 5.4.4.1: Loss vs Validation loss.


Accuracy

Epochs
Figure 5.4.4.2: Accuracy vs Validation accuracy.

20
6. RESULT AND ANALYSIS
6.1 Model Summary

Fig 5.5: Model Summary

21
6.2 Model Architecture
In our model we have used 5 layers including convLSTM 2D, Max Pooling 3D
layer, Time Distributed Layer, flatten layer, dense layer. Here our kernel size is
(3,3). During max pooling our pool size is (1,2,2). We have used the activation
function as tanh in the convLSTM 2D layer with a dropout rate of 0.2 and
activation function softmax in the dense layer.

• ConvLSTM 2D layers
The convLSTM 2D layer is responsible for extracting spatiotemporal
features from the input data. These layers use convolutional LSTM units,
which combine convolutional and LSTM operations. The parameters here
are number of filters, kernel size, activation function, recurrent dropout.
• MaxPooling 3D layers
The max Pooling 3D layer downsample the spatial dimensions of the
feature maps while preserving the temporal dimension. This helps reduce
computational complexity and extract the most relevant features.
• Time Distribution layer
The time Distributed layer applies dropout regularization independently to
each time step in the input sequence, helping prevent overfitting.
• Flatten layer
The flatten layer flattens the output from the previous layers into a one
dimensional vector, prepraring it for the fully connected layers.
• Dense Layer
The dense layer has softmax activation which serves as the output layer for
multiclass classification. The number of units in this layer corresponds to
the number of classes in the dataset.

22
6.3 Confusion Matrix

Fig 5.5.1: Confusion Matrix Heatmap

23
Confusion Matrix Heatmap

This heatmap visualizes the performance of a machine learning model in


classifying samples. It compares the actual labels of the data (rows) with the labels
predicted by the model (columns).

Color intensity: Represents the number of samples in each category. Darker


squares indicate more samples.

Diagonal: Ideally, most samples should fall here, signifying correct predictions.

Off-diagonal elements: Represent errors where the model predicted the wrong
class.

Specific values:

36: Correctly classified "JumpRope" samples.

1: "JumpRope" samples mistakenly classified as "HorseRace".

35: Correctly classified "HorseRace" samples.

And so on...

Overall, the heatmap helps identify strengths and weaknesses of the model in
classifying different categories.

24
6.4 Classification Report

Fig 5.5.2: Classification Report

The table summarizes the performance of a machine learning model on a


classification task. It shows precision, recall, F1-score, and support for each class.

Precision: Proportion of predicted positives that were actually correct.

Recall: Proportion of actual positives that were correctly identified.

F1-score: Harmonic mean of precision and recall, combining both metrics.

Support: Total number of data points in each class.

Specific values in the table:

JumpRope: Precision of 1.00, recall of 0.97, F1-score of 0.99, and support of 37.

HorseRace: Precision of 0.85, recall of 0.97, F1-score of 0.91, and support of 36.

JavelinThrow: Precision of 0.75, recall of 0.72, F1-score of 0.73, and support of


25.

TennisSwing: Precision of 0.97, recall of 0.90, F1-score of 0.94, and support of


42.

Overall, the classification report provides insights into the model's effectiveness at
identifying different classes.

25
7. CONCLUSION
In this project, we explored the application of Convolutional Long Short-Term
Memory (ConvLSTM) networks and different layers such as convLSTM 2D layer
, maxPooling 3D layer, time distribution layer, flatten layer and dense layer for
human activity recognition (HAR). Leveraging the spatiotemporal features
inherent in ConvLSTM architectures, we aimed to accurately classify various
activities such as jump rope, horse race, tennis swing, javelin throw. Finally, the
project has reach a point where all of its initial objectives have been met, thanks to
all of the changes, new learning, and difficult decisions. Our system can take
video as input and classify the activity among 4 type of activities.

26
REFERENCES

[1] J. C. D. Smith, Hidden Markov Models for Human Activity Recognition,


Works press, 2009.

[2] W. H. Brown, "Support Vector Machines for Human Activity Classification,"


Journal of Artificial Intelligence Research, pp. 175-190, 2012.

[3] L. &. C. X. Zhang, "Human Activity Recognition using CNN-based


Features," IEEE International Conference on Computer Vision, pp. 122-130,
2015.

[4] J. H. &. P. S. H. Lee, "Deep Convolutional Networks for Human Activity


Recognition," IEEE Transactions on Pattern Analysis and Machine
Intelligence, pp. 2188-2199, 2016.

[5] L. W. Nguyen, "ConvLSTM-based Human Activity Recognition,"


Proceedings of the European Conference on Computer Vision, pp. 579-594.

[6] H. S. Y. Kim, "Combining CNN and ConvLSTM for Activity Recognition in


Video Sequences," in Transactions on Multimedia, Chicago, 2019.

[7] S. P. R. Johnson, " HumanActivityNet: A Comprehensive Dataset for Human


Activity Recognition," Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. 560-564, 2020.

[8] W. Y. Li, "A Survey of Human Activity Recognition Datasets," Human-


Machine Systems, pp. 112-123, 2021.

[9] K. L. Patel, "Temporal Attention Mechanisms for Improved Human Activity


Recognition," roceedings of the International Joint Conference on Artificial
Intelligence, pp. 450-465, 2022.

[10] M. N. Wang, "Enhancing Temporal Dynamics in Human Activity


Recognition through Temporal Convolutional Networks," Journal of
Machine Learning Research, pp. 789-804, 2013.

[11] Q.R. Zhao, "Domain Adaptation Techniques for Robust Human Activity
Recognition in Real-world Environments," ACM Transactions on Intelligent
Systems and Technology, vol. 5, no. 2, pp. 265-280, 2022.

[12] S. G. Zhang, "Attention Mechanisms in CNN-ConvLSTM Models:


Interpreting Human Activity Recognition Decisions," Neural Information
Processing Systems, vol. 9, pp. 901-913, 2018.

27
[13] T. I. Kim, "Saliency Maps in CNN-ConvLSTM Architectures for Explainable
Human Activity Recognition," IEEE Transactions on Neural Networks and
Learning Systems,, pp. 13-18, 2019.

[14] R. K. Park, "Integrating Inertial Sensors and Video Data for Enhanced
Human Activity Recognition," International Journal of Computer Vision, pp.
345-355, 2022.

[15] "Comprehensive Human Activity Recognition using Social Interaction Data,"


Proceedings of the AAAI Conference on Artificial Intelligence, pp. 789-802,
2013.

[16] J. N. Choi, "Ensuring Fairness in Human Activity Recognition: Ethical


Considerations and Mitigation Strategies," Ethics and Information
Technology, pp. 112-128, 2021.

[17] X. P. Gao, "Edge Computing for Real-time Processing in Human Activity


Recognition Systems," IEEE Transactions on Mobile Computing, pp. 567-
580, 2013.

[18] Y. R. Xu, "Collaborative Integration of HAR with IoT Devices: Shaping the
Future Landscape," Proceedings of the International Conference on Internet
of Things, pp. 201-213, 2021.

28

You might also like