Human Activity Recognition System Report
Human Activity Recognition System Report
INSTITUTE OF ENGINEERING
[CODE: CT-455]
A
FINAL YEAR PROJECT REPORT
ON
HUMAN ACTIVITY RECONGNITION USING
CONV-LSTM
BY:
BALKRISHNA RAY (HCE076BCT007)
LALITPUR, NEPAL
March 2024
HUMAN ACTIVITY RECONGNITION USING
CONV-LSTM
BY:
BALKRISHNA RAY (HCE076BCT007)
PROJECT SUPERVISOR
Tribhuvan University
Lalitpur, Nepal
March, 2024
ACKNOWLEDGEMENT
We express deep gratitude to the Institute of Engineering, Pulchowk, for
including major projects in the BCT IV/I syllabus, which has greatly enhanced our
academic journey and allowed us to apply practical knowledge. We are thankful
to the management of Himalaya College of Engineering (HCOE) for providing
us with this exceptional opportunity and assembling a team of experts to assist us
during our proposal defense.
We are very thankful to our respected Head of the Department, Er. Ashok GM,
and the Deputy Head of the Department, Er. Devendra Kathayat, for their
invaluable advice, unwavering support, and exceptional guidance throughout our
project. We are also grateful to our project supervisor, Er.Hasina Shakya, for her
unwavering commitment, motivation, and insightful contributions. We are
indebted to our friends and colleagues for their support and constructive feedback
in selecting our project topic. Their encouragement has fostered an environment of
growth and inspiration, fueling our passion for excellence. We humbly
acknowledge all those who have contributed to the realization of our ideas,
transforming them into tangible achievements. In conclusion, we extend our
heartfelt appreciation to everyone involved. We recognize the significant impact
of each individual and institution mentioned here, as their support and guidance
have paved the way for our success. We are eternally grateful for their
contributions.
GROUP MEMBERS:
Balkrishna Ray (HCE076BCT007)
Bikram Bhusal (HCE076BCT008)
Biswas Pandit (HCE076BCT010)
Saurabh Karki (HCE076BCT036)
i
ABSTRACT
This study introduces an innovative methodology for human activity
recognition by seamlessly integrating Convolutional Long Short-Term
Memory networks (ConvLSTMs). In the landscape of machine learning,
particularly within the realm of deep learning, this approach is designed to
tackle the intricate challenge of precisely identifying and categorizing a
diverse array of human activities. These activities encompass a wide
spectrum, ranging from fundamental actions such as walking and sitting, to
more intricate motions like dancing and cooking. The significance of this
endeavor reverberates across multifarious domains, including healthcare,
sports analysis, and surveillance, where accurate activity recognition holds
immense value. Conventional techniques often grapple with the complexity of
capturing both spatial intricacies and the nuanced temporal patterns inherent
within sequences of activities. To overcome these challenges, we advocate for
an innovative hybrid architecture that seamlessly amalgamates the strengths of
CNNs and ConvLSTMs. CNNs excel at extracting spatial features from raw
sensor data, creating a robust foundation for comprehending various activities.
On the other hand, ConvLSTMs specialize in modeling temporal
dependencies within sequential data, enabling the seamless comprehension of
intricate temporal dynamics embedded in human motions. By synergizing
these two powerful deep learning paradigms, our proposed framework not
only elevates the potential for accurate and holistic human activity recognition
but also contributes to the advancement of real-time activity understanding.
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENT ........................................................................................ i
ABSTRACT.............................................................................................................ii
1. INTRODUCTION ............................................................................................... 1
1.1 OBJECTIVE................................................................................................... 2
4. SYSTEM DIAGRAM.......................................................................................... 8
5. METHODOLOGY ............................................................................................ 11
5.1 Download and visualize the data with its labels. ......................................... 16
iii
5.3 Split the data into train and test set .............................................................. 17
7. CONCLUSION .................................................................................................. 26
REFERENCES ...................................................................................................... 27
iv
LIST OF FIGURES
Fig 3.1: Use Case Diagram ................................................................................................ 6
v
LIST OF ABBREVIATION
HAR: Human Activity Recognition
CNNs: Convolutional Neural Networks
ConvLSTMs: Convolutional Long Short-Term Memory networks
HMMs: Hidden Markov Models
SVMs: Support Vector Machines
vi
1. INTRODUCTION
Human Activity Recognition (HAR) stands as a pivotal research field, finding
applications in healthcare monitoring, sports analysis, and surveillance systems. In
an increasingly data-driven era, the automatic identification, classification, and
comprehension of human activities from sensor data have captured significant
interest. Integrating Convolutional Neural Networks (CNNs) and Long Short-
Term Memory networks (LSTMs) offers a promising avenue to enhance HAR
model accuracy.
Activities span a wide spectrum, from basic motions like walking and sitting to
complex gestures like dancing and cooking. Accurate activity recognition holds
substantial value across domains. Traditional methods often grapple with
capturing spatial intricacies and nuanced temporal patterns inherent in activity
sequences.
1
1.1 OBJECTIVE
1.2 SCOPE
The task of human activity recognition presents numerous challenges yet holds
immense potential for applications spanning healthcare monitoring, sports
analysis, and surveillance. While Convolutional Neural Networks (CNNs) excel at
extracting spatial features from raw data, they often struggle to inherently capture
the temporal relationships between frames in activity sequences. Conversely,
Long Short-Term Memory networks (LSTMs) possess the capability to model
temporal dependencies but may overlook crucial spatial contexts essential for
accurate activity recognition. To address these challenges and harness the
complementary strengths of both CNNs and Convolutional Long Short-Term
Memory networks (ConvLSTMs), this research aims to propose a hybrid
approach. By integrating CNNs' spatial feature extraction prowess with
ConvLSTMs' temporal modeling abilities, the proposed hybrid model seeks to
enhance the accuracy and robustness of human activity recognition systems across
various domain.
2
2. LITERATURE REVIEW
3
Evaluation of HAR models hinges on the availability of high-quality datasets.
Datasets like "HumanActivityNet" have become benchmarks for testing model
performance [7]. These datasets encompass a wide array of activities, allowing
researchers to comprehensively assess model accuracy and generalizability [8][9].
The diversity of these datasets ensures that models trained on them are well-
prepared to handle real-world scenarios.
Despite the potential of the CNN-ConvLSTM hybrid, challenges persist. The
intricate architecture requires meticulous hyperparameter tuning to prevent
overfitting. Addressing domain adaptation and real-time performance concerns
remains an ongoing endeavor, as models must exhibit adaptability across diverse
contexts and provide real-time insights [8].
In conclusion, the integration of CNNs and ConvLSTMs marks a pivotal
advancement in the field of HAR. This literature review highlights the
evolutionary journey from traditional techniques to the innovative integration of
deep learning architectures. The hybrid approach not only bolsters the accuracy of
activity recognition but also paves the way for real-time and context-aware
understanding of human actions in a myriad of applications.
Recent Advances in HAR
Recent studies have focused on refining temporal modeling for HAR. Time-aware
attention mechanisms [9] and Temporal Convolutional Networks (TCNs)
[10]have been proposed as novel approaches to enhance the temporal
understanding of human activities, providing a more nuanced perspective on
temporal dynamics.
In the realm of domain adaptation, recent works emphasize its importance in HAR
models, especially in real-world scenarios with varying sensor configurations and
data distributions [11].
4
The interpretability of CNN-ConvLSTM models has garnered attention, with
recent research exploring attention mechanisms [12] and saliency maps [13] to
shed light on decision-making processes.
The integration of data from diverse sensors, such as inertial sensors, video
cameras, and social interactions, has shown promising results in improving the
robustness and accuracy of HAR models [14] [15].
5
3. REQUIREMENT ANALYSIS
6
• Usability and maintainability
The user interface should be intuitive, easy to navigate and accessible to
users easily. The system should be easy to maintain and update over time.
Assess the availability of suitable datasets for training and testing the model. Look
for publicly available datasets or consider collecting your own data if necessary.
Evaluate the availability of computational resources such as GPUs, cloud
computing required for training and inference with models. Determine the
feasibility of implementing convLSTM approach using existing deep learning
frameworks such as TensorFlow, keras.
Estimate the costs associated with data collection, preprocessing, model training,
and deployment. Consider expenses related to hardware, software, personnel, or
any potential licensing. Compare the projected costs with the available budget and
funding sources to ensure financial viability.
Assess the feasibility of integrating the HAR system into existing workflows or
applications, such as healthcare monitoring systems or fitness trackers. Also
evaluate the ease of use and user acceptance of HAR system by potential end
users.
7
4. SYSTEM DIAGRAM
4.1 System Flow diagram
Our input comprises a video, which undergoes segmentation into multiple images.
These images are then forwarded to a CNN (Convolutional Neural Network) to
extract visual features. Subsequently, the extracted visual features are fed into an
LSTM (Long Short-Term Memory) network to generate predictions. The CNN's
role is to learn spatial information, while the LSTM specializes in learning
temporal patterns.
8
4.2 Sequence Diagram
The diagram features an actor labeled 'User' and a lifeline representing the 'HAR'
(Human Activity Recognition). The interaction begins as the user initiates a call
message to upload the video into the system. This action is depicted by a thin
rectangle, symbolizing the activation bar. Upon receiving the video, the system
processes it and responds with a return message, indicating successful recognition.
Finally, the recognized video is displayed to the user.
9
4.3 Data Flow Diagram
This diagram depicts an entity labeled 'user' and a process named 'human activity
recognition' (HAR). The arrows signify the flow of data. Initially, the user uploads
a video, which is then processed by the human activity recognition system to
recognize the activity. Subsequently, the identified activity is displayed to the
user.
10
5. METHODOLOGY
There are various techniques in deep learning techniques that can be used in order
to create a project of Human Activity Recognition (HAR). The various methods to
implement this project include Convolution Neural Network (CNN), Recurrent
Neural Network (RNN), Hybrid Architectures and many more. Every technique
has different levels of working mechanisms and accuracy based on their
individual capabilities. CNN has the capabilities of finding out the features of
spatial images. For sequential data like series sensor readings RNN is used. In
case of HAR, the hybrid form will be more accurate as the hybrid architectures are
commonly combinations of CNN and RNN. Long Short-Term Memory (LSTM)
is a type of RNN which can be used in HAR models alongside of CNN for activity
predictions for sequential data.
1. Memory Cells:
11
2. Gates:
3. Forget Gate:
The forget gate decides which information from the previous cell state
should be discarded. It takes as input the concatenation of the current input
and the previous hidden state and produces a forget vector. The forget
vector is then multiplied element-wise with the previous cell state,
effectively "forgetting" irrelevant information.
4. Input Gate:
The input gate determines which new information should be stored in the
memory cell. It takes as input the concatenation of the current input and
the previous hidden state and produces an input vector. This input vector is
then combined with a candidate cell state (obtained from the current input)
through element-wise multiplication with a tanh activation function. This
produces new candidate values to be added to the cell state.
The forget gate and input gate outputs are combined to update the cell
state. The forget gate output is used to scale the previous cell state to
forget irrelevant information, and the input gate output is used to add new
information to the cell state. The resulting updated cell state serves as the
memory for the current time step.
6. Output Gate:
The output gate determines which information from the current cell state
should be exposed to the next hidden state. It takes as input the
12
concatenation of the current input and the previous hidden state and
produces an output vector. The output vector is then combined with the
updated cell state through element-wise multiplication with a tanh
activation function, and the resulting value is the current hidden state.
13
Fig 5.1 Train Prediction Workflow
Pre-processing: Splits videos into frames and resizes them for uniformity.
Train/Test Split: Divides the processed data into training and testing sets. The
training set is used to train the model, while the testing set is used to evaluate its
performance.
Build Model: Defines the architecture of the model, including the layers and their
parameters.
Train Model: Trains the model on the training data. The model learns to identify
patterns and relationships within the data.
Evaluate: Assesses the model's performance on the testing data using metrics like
accuracy, precision, recall, or F1-score.
14
Desired Accuracy Met?: Checks if the achieved accuracy meets the predefined
threshold.
Yes: Training is complete. The model can be used for predictions on new data.
No: If the desired accuracy is not met, the model might need further training or
adjustments. This could involve:
15
5.1 Download and visualize the data with its labels.
First we download the required libraries such as pafy, youtube-dl and moviepy
which will help us to download the videos from the youtube. We also use other
libraries such as openCV which provides a wide range of functionalities for
processing and analyzing images and videos. We also use tensorflow for building
and training various machine learning models, including deep learning models.
We mostly use Keras api running on top of tensorflow which focuses on enabling
fast experimentation and prototyping of deep learning models.
In the first step, we will download and visualize the data along with labels to get
an idea about what we will be dealing with. We will be using the UCF50 – Action
Recognition Dataset, consisting of realistic videos taken from youtube which
differentiates this data set from most of the other available action recognition data
sets as they are not realistic and are staged by actors. The Dataset contains
• 50 Action Categories
• 25 Groups of Videos per Action Category
• 133 Average Videos per Action Category
• 199 Average Number of Frames per Video
• 320 Average Frames Width per Video
• 240 Average Frames Height per Video
• 26 Average Frames Per Seconds per Video
Next, we will perform some preprocessing on the dataset. First, we will read the
video files from the dataset and resize the frames of the videos to a fixed width
and height (64X64), to reduce the computations and normalized the data to
range [0-1] by dividing the pixel values with 255, which makes convergence
faster while training the network.
We will create a function that will create a list containing the resized and
normalized frames of a video whose path is passed to it as an argument. The
16
function will read the video file frame by frame, although not all frames are added
to the list as we will only need an evenly distributed sequence length of frames.
Now we will create a function that will iterate through all the classes specified in
the class and will call the function on every video file of the selected classes and
return the frames (features), class index (labels), and video file path
(video_files_paths).
17
model on the data. The create_convlstm_model function constructs a
Convolutional Long Short-Term Memory (ConvLSTM) model for tasks such as
video classification or action recognition. It begins by initializing a Sequential
model, allowing layers to be added sequentially. The model architecture includes
multiple ConvLSTM2D layers, which perform convolutional operations with
LSTM-like recurrence along both spatial and temporal dimensions. Each
ConvLSTM2D layer is followed by a MaxPooling3D layer for spatial pooling,
and a TimeDistributed layer with Dropout for regularization. After the
convolutional layers, the output is flattened into a 1D array using a Flatten layer.
Finally, a Dense layer with softmax activation is added to classify the input into
the classes specified in CLASSES_LIST. The model's summary is displayed,
showing the number of parameters and the architecture of each layer, before
returning the constructed ConvLSTM model. This architecture is effective for
learning spatiotemporal features from video data, essential for tasks requiring
understanding and analyzing video sequences.
Next, we have added an early stopping callback to prevent over fitting. Training
will stop if the validation loss does not improve for 10 consecutive epochs. The
callback considers the validation loss to minimize. It stops training when the loss
stops decreasing. The loss function used is categorical crossentropy for multiclass
classification task. Adam optimizer is choosen for optimization. The number of
epochs for which the model will be trained will be 25. We shuffles the training
data before each epoch to prevent the model from learning sequence patterns. 20%
of the training data is held out for validation during training. Here early stopping
callback is applied during training to monitor validation loss and stop training if
necessary. and started the training after compiling the model.
Here we extract the loss and accuracy values from the model_evaluation_history,
which likely contains the results of evaluating the model on a separate testing
18
dataset. We define the format for the date and time string (date_time_format) and
obtain the current date and time (current_date_time_dt). It then formats the current
date and time as a string according to the specified format. we then define a useful
name for the saved model file, incorporating the date and time of the model's
creation, as well as the evaluation loss and accuracy. After that, the trained
ConvLSTM model is saved in disk using the defined model file name. By saving
the model with a descriptive name including timestamps and evaluation metrics,
we ensure that each model's file is uniquely identifiable and contains relevant
information about its performance.
Here we plot the training loss (loss) and validation loss (val_loss) over successive
epochs in the ConvLSTM model training. This visualization allows you to assess
how well the model is learning from the training data and whether it is overfitting
or underfitting. The plot will include two lines, one representing the training loss
(in blue) and the other representing the validation loss (in red).
Another plot illustrate the training accuracy (accuracy) and validation accuracy
(val_accuracy) of the ConvLSTM model across successive epochs. This
visualization also enables the evaluation of the model's learning progress and
generalization performance. The plot will feature two lines: one representing the
19
training accuracy (in blue) and the other representing the validation accuracy (in
red).
Loss
Epochs
Epochs
Figure 5.4.4.2: Accuracy vs Validation accuracy.
20
6. RESULT AND ANALYSIS
6.1 Model Summary
21
6.2 Model Architecture
In our model we have used 5 layers including convLSTM 2D, Max Pooling 3D
layer, Time Distributed Layer, flatten layer, dense layer. Here our kernel size is
(3,3). During max pooling our pool size is (1,2,2). We have used the activation
function as tanh in the convLSTM 2D layer with a dropout rate of 0.2 and
activation function softmax in the dense layer.
• ConvLSTM 2D layers
The convLSTM 2D layer is responsible for extracting spatiotemporal
features from the input data. These layers use convolutional LSTM units,
which combine convolutional and LSTM operations. The parameters here
are number of filters, kernel size, activation function, recurrent dropout.
• MaxPooling 3D layers
The max Pooling 3D layer downsample the spatial dimensions of the
feature maps while preserving the temporal dimension. This helps reduce
computational complexity and extract the most relevant features.
• Time Distribution layer
The time Distributed layer applies dropout regularization independently to
each time step in the input sequence, helping prevent overfitting.
• Flatten layer
The flatten layer flattens the output from the previous layers into a one
dimensional vector, prepraring it for the fully connected layers.
• Dense Layer
The dense layer has softmax activation which serves as the output layer for
multiclass classification. The number of units in this layer corresponds to
the number of classes in the dataset.
22
6.3 Confusion Matrix
23
Confusion Matrix Heatmap
Diagonal: Ideally, most samples should fall here, signifying correct predictions.
Off-diagonal elements: Represent errors where the model predicted the wrong
class.
Specific values:
And so on...
Overall, the heatmap helps identify strengths and weaknesses of the model in
classifying different categories.
24
6.4 Classification Report
JumpRope: Precision of 1.00, recall of 0.97, F1-score of 0.99, and support of 37.
HorseRace: Precision of 0.85, recall of 0.97, F1-score of 0.91, and support of 36.
Overall, the classification report provides insights into the model's effectiveness at
identifying different classes.
25
7. CONCLUSION
In this project, we explored the application of Convolutional Long Short-Term
Memory (ConvLSTM) networks and different layers such as convLSTM 2D layer
, maxPooling 3D layer, time distribution layer, flatten layer and dense layer for
human activity recognition (HAR). Leveraging the spatiotemporal features
inherent in ConvLSTM architectures, we aimed to accurately classify various
activities such as jump rope, horse race, tennis swing, javelin throw. Finally, the
project has reach a point where all of its initial objectives have been met, thanks to
all of the changes, new learning, and difficult decisions. Our system can take
video as input and classify the activity among 4 type of activities.
26
REFERENCES
[11] Q.R. Zhao, "Domain Adaptation Techniques for Robust Human Activity
Recognition in Real-world Environments," ACM Transactions on Intelligent
Systems and Technology, vol. 5, no. 2, pp. 265-280, 2022.
27
[13] T. I. Kim, "Saliency Maps in CNN-ConvLSTM Architectures for Explainable
Human Activity Recognition," IEEE Transactions on Neural Networks and
Learning Systems,, pp. 13-18, 2019.
[14] R. K. Park, "Integrating Inertial Sensors and Video Data for Enhanced
Human Activity Recognition," International Journal of Computer Vision, pp.
345-355, 2022.
[18] Y. R. Xu, "Collaborative Integration of HAR with IoT Devices: Shaping the
Future Landscape," Proceedings of the International Conference on Internet
of Things, pp. 201-213, 2021.
28