0% found this document useful (0 votes)
17 views

Human Activity Recognition using DL methods

Uploaded by

classroomkm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Human Activity Recognition using DL methods

Uploaded by

classroomkm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1.

Introduction
In the present scenario, Human Activity Recognition has become one of the most widely-known
research areas [1]. Recently, due to its numerous applications in fields including medical care,
disease prediction, robotics, sports, video surveillance, and others, human activity recognition
(HAR) has attracted a lot of attention. The ability to automatically identify and understand human
actions from sensory data is generally a critical task in the domain of artificial intelligence.
However, it also possesses the potential to revolutionize several industries as well. A dense
understanding of human behavior along with an improvement in life quality is fostered by HAR-
enabled individualized health monitoring, behavioral analysis, and real-time activity monitoring
capabilities. In accordance with a report published by UN (United Nations) [2], it is assumed that
by 2050, there will be 2 billion elderly people world-wide.

However, elderly individuals require extra care and attention since the majority of these people
possess multiple diseases. An essential component of smart healthcare is the real-time monitoring
of individuals’ physical activities, especially their daily living activities (DLAs) [3], which can
significantly improve eldercare and medical rehabilitation. Numerous serious diseases are
significantly impacted by daily activities. Therefore, monitoring of day-to-day physical activities
provides a crucial health indicator as well. Generally, it is common practice to track, evaluate, and
comprehend different postures across a wide range of systems and applications by classifying and
identifying human physical activities [4].

In HAR, several human activities including running, walking, sleeping, sitting, standing, and so
on are recognized. Experimental data and required resources can be acquired from several sensors
(wearable, wireless etc.), accelerometers, or through images, or video frames as well. There exist
several sensor-based frameworks of HAR like smartphone sensor-enabled, audio/video data-
related and body-worn sensor-based as well [5]. However, among these, body-worn sensors may
not be comfortable to the users, as it is required to place on several locations of body. Apart from
that, collecting inputs from audio or video possess different kinds of privacy related issues.
Moreover, both the signals from body-worn devices and audio/video need complex techniques for
pre-processing to remove unwanted noises from input data as well. In most of the cases, long-
ranged audio signals become noisy for white noise or background noise as well. Hence, an audio
input at a certain moment fails to provide valuable insights. Moreover, differentiating between
two audio signals gets difficult too [6]. Therefore, it can be stated that for distinctly identifying
some basic human activities audio inputs are not always sufficient and suitable on their own as
well. Collecting video data, especially in populated locations may be problematic, due to
existence of various physical obstacles, or due to low brightness as well [7].

However, for inferring the characteristics of transfer modes and human activities, sensor-based
data can also be acquired from smartphones also. In general, the physical systems on HAR based
on smartphone sensors are prompted by their discretion, ubiquity, and inexpensive appliance
procedures, usefulness, and noninvasive properties as well [8],[9]. Utilizing smartphones,
continuous input can be gathered at the time of performing any kind of physical actions. Apart
from that, due to various built-in mobile sensors, keeping track of health-related data has become
more accurate and elegant nowadays. These different built-in sensors of smartphones are used for
collection important insights of HAR models. It is noticed that, among mobile sensors,
gyroscopes and accelerometers are the most widely used sensors [10], [11]. Multivariate time-
series characteristics can be found in datasets derived from smartphone sensors. The basic feature
of time-series data is local dependency. Furthermore, human activity signals are translation-
invariant and hierarchical, and also possess dynamic information with regard to underlying
systems as well [12]. Therefore, the requirement of modeling these high-dimensional datasets
accurately is increasing. Physical activities include several unique characteristics. Thus, HAR
involves various methodological issues, including imbalanced datasets, interclass similarity,
intraclass variability, empty class problem, and others as well [13].

In the current days, HAR systems based on smartphone sensor data most preferably utilize the
traditional machine learning (ML) algorithms as well as deep learning (DL) methods for
recognizing human activities in efficient manner [14]. For extracting relevant underlying features
responsible for differentiating distinct activity patterns, in conventional ML methods, feature
engineering has become one of the most ruling phases. The desired and enhanced model
performance of HAR systems hugely relies upon the efficient feature engineering of raw input
signals. After extracting useful features, those are then fed to the classifiers to identify human
activities. However, for obtaining enhanced classification result, extraction of relevant features
accurately is highly required. Without a relevant feature engineering model, traditional classifiers
cannot perform well and thus fails to identify human activities accurately and competently [15].
In this manner, complex techniques for data pre-processing are needed for getting sensory data in
a proper form, and this extraction procedure of hand-crafted features from sensory data require
high expertise domain knowledge in this sense. Finally, the extracted handcrafted features are sent
to conventional classification system for identifying human activities. However, it is noticed from
several research that these handcrafted features not always work for all the models and perform
poorly in recognition models as well [16]. Apart from that, it requires distinct handcrafted feature
vector for different domains in research for handling classification problems properly. In this
manner, currently, most of the researchers prefer and utilize DL algorithms to overcome such
problems.

However, there are huge application field of HAR, most extensively applied in medical domains,
and for the purpose of taking care and tracking records of elderly people for helping them in
better and secure lifestyle [17]. Moreover, for controlling and monitoring crime records and rates,
HAR can also be applied. Apart from that, the everyday activities recognition can build an
environment for smart home technology. Driving behaviors can be detected and thus helps in
promoting safe transportation. Implementing HAR, military operations can be identified as well.
Moreover, the other domains where HAR is applied are entertainment, autonomous driving,
surveillance and security, human-robot interaction as well [18]. The basic objective of HAR is
recognizing different kinds of human actions and activities in controlled and uncontrolled
manners.

1.1. Motivation

In several research works, it is noticed that nowadays most of the Human Activity Recognition
models utilize Deep Learning algorithms rather using traditional ML algorithms, as ML
algorithms require handcrafted features [19]. Apart from that, DL models possess capability of
automatic feature extraction and learning. This makes an extra advantage for using DL models in
HAR systems. DL algorithms are basically capable of extracting important features in an efficient
manner without any manual intervention along possessing the ability of recognizing human
actions simultaneously [19], [20]. These DL methodologies are proved as outstanding in
performing desired prediction in various domains such as intelligent gaming system, ideal
recognition of image and speech, natural language processing (NLP), and so on as well [21]. In
the literature of HAR, DL methods have gained an outstanding contribution. Nowadays, in the
most of the research works regarding HAR domain, different types of DL algorithms are being
applied and investigated. Recognition of distinct complex activities of human require some
certain steps for identifying all the responsible features accurately so that the HAR model can get
its desired outcome successfully. Among these procedures, one of the most considerable phases is
extraction of relevant features. Human activities are basically consists of two types of features:
spatial and temporal. Identification of both of these features is equally important for recognizing a
specific activity [22]. In this manner, extraction of these spatio-temporal characteristics of
smartphone-based sensory input is highly required. For this purpose, proper procedures of feature
extraction along with following required data pre-processing techniques for converting raw and
noisy input signal into acceptable as well as clear data is required.

There exist numerous DL methodologies applied in the HAR architectures for gaining required
spatial and temporal characteristics of data to ideally specify human actions. Among these DL
mechanisms, Inception module is one of the widely used DL model that generally helps in
extracting spatial as well as local trends of data ideally. In HAR domain, utilization of Inception
modules aids with several advantageous features that enrich the classification cost and accuracy
of activity classifiers. Inception modules utilize blending of convolutional filters holding varying
receptive field size within same layer that allows the model capturing multi-scaled features [23].
This nature leads to better and efficient utilization of parameters along with using less number of
layers as well. Apart from that, inception module incorporates techniques such as batch
normalization (BN), and dropout, helping the model in reducing the risk of overfitting [24]. Faster
inference time and training is also possible by utilizing inception architecture as they allow
parallel processing rather than sequential processing. Moreover, inception modules also help in
mitigating the problem of vanishing gradient that makes the model easier to train deep models
efficiently. However, besides extraction of local characteristics of sensory data, it is also equally
required to accurately retrieve temporal dependency of data. Generally, RNNs are better and
widely used deep networks in HAR domain. Among RNNs, LSTM play a great role in capturing
temporal relationship of data. Although LSTMs possess various advantages in capturing temporal
characteristics, still it lacks sometimes in addressing complex activities that involves bi-
directional activities as well. To overcome this problem, Bi-LSTM is utilized as they can capture
temporal trends of data from both of the instances, past and future [25]. Capturing bi-directional
trends of data is necessary for identifying the activities involving forward and backward
movements as well. Apart from this, it is also required to keep an eye in extracting these features
wisely so that any unwanted features cannot be entertained in model training. For this, to pay
extra “attention” in feature selection, attention mechanisms are being used in most of the cases
recently [26]. Generally, attention mechanisms are kind of DL methods that pay extra attention in
selecting features by paying more attention in wanted features, while paying less attention in
unwanted ones as well.

Hence, motivated by these constraints (Bi-LSTM, Inception, and Attention), here, in this article,
an integrated model is proposed for identification of human activities. The corresponding
literature of this study also indicates that the proposed combination is novel. However, the step-
by-step architecture in this proposed model involves the following three different phases:

 Inception-CBAM module
 Bi-LSTM-Attention module
 A GAP layer followed by dropout and Softmax layers
1.2. Contribution

The main contributions provided in this article are as follows:

 The paper presents and discusses several aspects regarding the domain of human activity
recognition.
 An extensive literature review consisting DL-based frameworks of HAR is performed for
easy understanding of readers regarding the topic along with identifying potential
literature gaps.
 An efficient hybrid DL-based model is proposed consisting three DL algorithms such as
Attention, Inception, and Bi-LSTM to ideally extract essential spatio-temporal
characteristics of smartphone-based sensor data.
 The effectiveness and proper justification of the proposed system is presented through
required experiments, validation techniques, performance metric and cross-validation
techniques as well. Finally, we compare the obtained result with other existing
literatures.
2. Related Work
In this section, a comprehensive literature review of related articles has been performed. Various
works and research done in this particular field of HAR have been evaluated. Generally, to find
out the effective research gap and the possible future directions that can make this domain of
HAR more productive and more fruitful, this review work is presented. The research field of
HAR domain generally consist of both ML and DL approaches. Previously, in several research
activities, researchers utilized the application of classic ML algorithms in HAR domain to find out
recognition accuracy of models [27]. In these ML-based systems, researchers used numerous
feature selection or/and extraction procedures prior to feeding the collected data to classifiers for
identifying several human behaviors. However, it is noticed that, ML models depend upon
handcrafted extraction of features, and this procedure of feature retrieval requires expertise in
domain knowledge and manual intervention as well, which results in increased time complexity
[28]. In this sense, to overcome such disadvantages of ML-based HAR systems, researchers have
focused in exploring and applying DL-based mechanisms, as DL-based architectures possess the
benefits of automated extraction of features, without human interference as well. Hence, in this
part, a survey on DL-based human action identification systems is presented.

2.1. DL for HAR

In [29], author proposed smartphone-based hybrid DL architecture for recognizing human


activities. The hybrid architecture "ConvAE-LSTM" consists of deep learning models: CNN,
auto-encoders (AE), and LSTM. CNN models perform well by extracting useful features
automatically and capturing spatial features, for reducing dimensionality AEs are used, and
LSTMs are popular for capturing temporal sequences as well. Thus, this hybrid unified model
forms a complimentary architecture by covering all the advantageous aspects like spatiotemporal
characteristics and dimensionality reduction. Four distinct standard public datasets are used for
the proposed experimental purpose. Two of them are smartphone-based (WISDM, UCI), and the
rest of the two are based on body-worn sensors (OPPORTUNITY, PAMAP2) as well. Using the
metrics such as recall, F1 score, precision, and accuracy along with a cross-validation technique
named LOSO; the acquired outcomes are cross-checked and validated. The model produced
average recall, F1 score, precision, and accuracy of 96.83%, 97.67%, 97%, and 98.14% on UCI
dataset and 98.33%, 98.17%, 98.17%, and 98.67% on WISDM respectively. In [30], Wang et al.
presented a deep architecture, capable of learning local features and modeling time dependencies
between features automatically, without manual intervention. In this regard, the author built a
hybrid model, combination of convolutional neural network (CNN) model and long short-term
memory (LSTM) recurrent deep network. CNN model is utilized here to extract relevant features
from collected sensor-based experimental data. LSTM architecture is applied for capturing long-
term reciprocity among two activities for further improvement purpose of the identification rate of
HAR. Hence, combining CNN-LSTM, a model based on wearable sensors is proposed to detect
several human activities and associated transitions accurately. Acceleration and gyroscope sensor-
based smartphone data are collected for experimental purposes. The experiment was performed
utilizing the “HAPT” dataset. However, the proposed model achieved recognition rate up to
approximately 95.87%, and higher than 80% of identification rate in transitions as well. In [31],
author proposed a deep network combining convolutional layers with LSTM model. LSTM is
basically a variant of recurrent neural network (RNN) and more capable of processing temporal
features or sequences as well. CNN deep network is utilized for capturing local spatial
dependencies of data. This hybrid model “LSTM- CNN” automatically extracts activity features
and classify these with fewer model parameters as well. Here, three broadly used mobile sensor-
based public datasets; WISDM, UCI, and OPPORTUNITY are used for experimental and
analytical purpose. However, the model obtained F1 score of 95.78%, 92.63%, and 95.85% on
UCI-HAR, OPPORTUNITY, and WISDM datasets as well. In [32], a DL model “GRU-INC” is
proposed for recognizing human activities. The model is an "Inception-Attention" based method
combining the Gated Recurrent Unit (GRU) model. The combination is effective for actively
capturing spatial and temporal information of time-series data. Here, combination of GRU and
Attention is utilized for extracting temporal features. On the other hand, Inception along with
Convolutional Block Attention Module (CBAM) is exploited for extracting spatial
representations. Using available public datasets such as OPPORTUNITY, WISDM, PAMAP2,
UCI-HAR, and Daphnet, several human activities have been examined. However, the experiment
produced an F1-score of 90.05%, 99.12%, 90.30%, 96.27%, and 95.99% on these datasets
respectively. Thus, it can be observed that the model has produced better result compared to
several other deep models as well. The use of attention method for both the temporal and spatial
parts separately has made the GRU-INC architecture capable of identifying the challenging
physical human activities. In [33], researchers present a novel deep architecture-based activity
recognition model, “Convolutional neural network-long short-term memory network” (CNN-
LSTM) architecture as well. This model is basically a hybrid model, combination of two different
deep architecture CNN architecture and LSTM architecture. For experimental purpose, two
datasets are used and the proposed method is applied over these two datasets, iSPL (3-activity)
and UCI HAR (6-activity) for evaluating the applicability and performance of the proposed
method. The performance of the model is evaluated using several performance metrics like
accuracy, cross-entropy. After performing the experiment, the obtained accuracy result of the
model was 99.06% and 92.13% respectively for iSPL and UCI HAR dataset. In [34], the author
proposed a deep neural architecture, InnoHAR model combining two deep architectures; Gated
Recurrent Unit (GRU) and inception net- work as well. The model accepts input data in the form
of waves of multi-channel sensing devices end-to-end. Gated Recurrent Unit (GRU) is employed
for effective modeling of time series data and features as well. Among RNNs, GRU is quite
popular for its simple architecture and temporal ability. GRU model possess the ability of sensing
temporal relationships between data points. Apart from that, in this experiment, for retrieving
spatial features from sensor-based waveform data, GoogLeNet’s Inception part is used for
implementing inception on three datasets. The experiment was performed over three datasets,
OPPORTUNITY, PAMAP2, and SMARTPHONE and performance evaluation was done using F-
measure that also covers both recall and precision as well. Considering the overall performance of
the proposed structure, the experimental outcomes deliver that the suggested InnoHAR based on
Inception-like model produce better output than both CNN (9% improvement for
OPPORTUNITY dataset and 3% for PAMAP2 dataset) and DeepConvLSTM (5% improvement
for OPPORTUNITY dataset and 3% for PAMAP2 dataset) as well. In [35], author proposed a
hybrid DL model combining two commonly used DL network, CNN architecture and LSTM
architecture for achieving better recognition performance for indoor environments. CNNs are
basically used for extracting features spatially, where; LSTMs are mainly focused on extracting
learning temporary information dependencies as well. Keeping this in consideration, author
presented a hybrid model combining these two with desire to obtain an improved performance.
For analytical evaluation, a self-made dataset is used that collected instances via e Kinect V2
sensor capable of extracting 25 distinct joints of human body (involves 12 distinct human activity
classes) from 20 members. However, the proposed model obtained accuracy of 90.89% in
comparison with other existing deep models. In [36], author introduced a new method that
involves convolutional deep model (CNN) with differing kernel dimensions and bi-directional
long-short-term memory for capturing features at several resolutions. The main motive of this
research work lies effectively in the appropriate selection in effective extraction of temporal as
well as spatial patterns from sensory data and also optimal representation of video using classic
CNN algorithm and BiLSTM as well. Two datasets (WISDM and UCI) are utilized in this
analytical study where data collection procedures involve sensors, accelerometers, and
gyroscopes. By using BiLSTM, it is possible to comprehend the innate temporal relationship of
the spatial deep-learning map. It is noticed that the proposed approached scored better for
WISDM dataset (98.53%) compared to UCI HAR dataset (97.05%). In [37], author proposed a
dual-channel supervised model “ST-deepHAR” consisting of LSTM network, followed by
attention mechanism for fusing temporal nature of inertial sensory data along with a convolutional
ResNet for extracting the spatial dependencies of sensory data as well. Apart from that, in the
proposed model, an adaptive operation for channel-squeezing is introduced in order to fine-tune
the convolutional feature extraction ability of the neural network exploiting the multi-channel
dependency. After the retrieval of spatio-temporal data, those data were concatenated for making
final classification decision by feeding through multilayer perceptron and a softmax layer as well.
For experimental purpose, two well-known and publicly available HAR datasets (WISDM, UCI
HAR) are utilized, and performance of the proposed architecture is evaluated. However, after the
successful deployment of the suggested hybrid structure, the model obtained an accuracy of
97.70% and 98.90% on UCI HAR and WISDM respectively along with obtaining a F1-
MEASURE OF 98.32% and 97.50% on WISDM and UCI HAR datasets respectively as well. In
[38], the researchers developed a hybrid DL model that can effectively recognize several human
movements captured utilizing IMU sensors. The hybrid model basically consists of CNN model
and Bi-LSTM units for extracting temporal sequences along with spatial characteristics
simultaneously from the raw sensory data. Apart from that, a meta-heuristic optimization method,
“Rao-3” is adopted for identifying ideal values of hyper-parameters for the suggested hybrid
architecture for the purpose of enhancing model performance. Three widely used HAR datasets
are used in this article for evaluation of classification performance. The used datasets are UCI
HAR, MHEALTH, and PAMAP2 as well. However, it is noticed that the proposed deep
architecture secured accuracy of 97.16%, 99.25%, and 94.91% respectively on the mentioned
datasets. In [39], authors proposed a deep learning-based framework for the efficient detection of
anomalous activities of human. The suggested framework is implemented combining three
components of DL, CNN, Bi-LSTM, and Attention for identifying unique spatio-temporal trends
of data by paying extra attention to the wanted patterns as well. However, the analytical task has
been performed in this article using three distinct datasets, UCF50, UCF11, and subUCF crime as
well. After performing the experiment, the mentioned three datasets obtained accuracy score of
96.04%, 98.90%, and 61.04% respectively. In [40], author introduced a two-stream DL model
having less complexity utilizing raw RGB sequences along with their “dynamic motion images
(DMIs)” for recognizing complex human behaviors. The frames of RGB have been trained
incorporating a pre-trained network of Inception-v3 module and having CNN-LSTM attached
with end-to-end training. Moreover, for dynamic image streaming, some last layers of utilized
pre-trained network are fine-tuned. Utilizing the proposed two-stream model, the features are
extracted and then are max fused to get increased classification accuracy as well. For the
evaluation purpose, authors used dyadic SBU Interaction as well as MIVIA Action dataset,
single-person activity dataset. However, after experiment, the model gained an accuracy
percentage of 98.70% on SBU Interaction dataset, and 99.41% on MIVIA Action dataset as well.
In [41], author proposed a 1D-CNN – Bi-LSTM model followed by attention mechanism, CBiAM
for specifically recognizing states of cyclists utilizing smartphones. The motto is enhancing the
safety measures along with promoting secured cycling experience to avoid accidental or
emergency risks as well. A new created dataset “cycling safe (CySa)” was utilized for the
experimental purpose that contains data on various actions of the cyclists during cycling, where
smartphones were placed in their pocket position for collecting the data. The suggested CBiAM
system was trained using the CySa dataset incorporating varying window sizes, learning rates,
and batch sizes as well. The robust performance of the model was validated using public HAR
datasets (Opportunity, UCI-HAR, WISDM, MOTIONSENSE, and PAMAP2). However, the
model successfully acquired noticeable F1-scores of 94.72%, 97.51%, 87.05%, 97.67%, and
99.82% respectively on the used datasets. In [42], authors presented a new parallel deep
architecture, DLT, generally based on the idea of pipeline concatenation. In the proposed pipeline
system, single pipelines are consist of two sub-pipelines, first one, consisting 1D-CNN that learns
the local features, and the second one is Bi-LSTM, LSTMs that learns the temporal dependencies
as well by merging feature maps along with integrating the channel attention. The experiment was
held on two HAR datasets, that are available publicly, that is WISDM, and PAMAP2 as well.
However, the model obtained an outstanding accuracy of 97.90% and 98.52% on the WISDM and
PAMAP2 datasets respectively. In [43], author proposed a deep stacked model for recognizing
human activities involving auto-encoder algorithm. The aim of this paper is proposing a deep
model based on auto-encoder along with orientation of invariant features, for identifying complex
human activities. Basically, in this article, a deep stacked architecture that involves auto-encoder
for extracting crucial human behaviors for improvement of model accuracy, and reducing over-
fitting is proposed. The data was taken from smartphone accelerometer. In this model, the
advantageous aspects of auto-encoder, sparse auto-encoder, softmax classifier and others are
utilized for obtaining better model performance. For analyzing the model performance, author
used several types of performance metrics such as recall, accuracy, specificity, confusion matrix
as well. It is observed that the proposed model gained an accuracy of 97.13% compared to the
traditional ML algorithms and deep belief network as well. In [44], author presented a novel deep
learning based framework for recognizing dynamic human activities, static human behaviors, and
transitional activities as well by utilizing SDAE (stacked denoising auto-encoders). The
experimental setup is designed for acquiring three types (twelve daily activities) of day-to-day
activities utilizing wearable sensors. These records were collected from 10 adults in smart lab of
Ulster University for analytical purpose. In this article, SDAE, a deep model that extracts various
features in an automatic manner is used for experimental purpose. However, the performance
analysis of the deployed model was measured using performance metrics such as precision,
accuracy, recall, and F1 score as well. It is observed that the model obtained an overall
identification accuracy of 94.88% on the mentioned three kinds of activities as well.

In the table 1, the above reviewed literatures based on deep learning method are summarized.
From the table, it can be easily stated that WISDM and UCI datasets are the most widely used
popular standard and publicly available smartphone-based datasets. Moreover, it can also be
observed that most of the literatures have utilized the benefits of attention mechanism for
selecting useful features wisely. Research GAPAlmost maximum models have extracted both the
spatial as well as temporal dependencies of sensory inputs in order to recognize different human
behaviors in an efficient manner with enhanced model performance. However, in this article, to
address the potential literature gaps, a hybrid model involving Inception module, Bi-LSTM
network, and Attention mechanism is suggested by taking all these factors in considerations. It is
desired that the proposed architecture will be capable of obtaining desired output with enhanced
classification accuracy along with proper justifications of the model performance as well.

Table 1: HAR systems based on Deep Learning

Reference Dataset Sensor Classifier Accuracy


[29] WISDM smartphone CNN+ 98.67%
UCI smartphone Auto-encoder+ 98.14%
OPPORTUNITY Body-worn LSTM 95.69%
PAMAP2 Body-worn 94.33%
[30] HAPT (“Human Body-worn CNN + LSTM 95.87%
Activities and
Postural
Transitions”)
Dataset
[31] WISDM Smartphone LSTM + CNN 95.78%(F1 score)
UCI HAR smartphone 92.63%
OPPORTUNITY Body-worn 95.85%
[32] OPPORTUNITY Body-worn GRU + Attention + 90.05%(F1 score)
WISDM Smartphone Inception 99.12%
PAMAP2 Body-worn 90.30%
UCI HAR Smartphone 96.27%
Daphnet Body-worn 95.99%
[33] iSPL Body-worn CNN + LSTM 99.06%
UCI HAR smartphone 92.13%
[34] OPPORTUNITY Body-worn GRU + Inception 94.60% (F-measure)
PAMAP2 Body-worn 93.50%
SMARTPHONE Body-worn 94.50%
[35] Self-collected Body-worn CNN + LSTM 90.89%
[36] WISDM Smartphone CNN + Bi-LSTM 98.53%
UCI HAR smartphone 97.05%
[37] WISDM Smartphone LSTM + Attention + 98.90%
UCI HAR Smartphone ResNet 97.70%
[38] UCI HAR Smartphone CNN + Bi-LSTM 97.16%
MHEALTH Body-worn 99.25%
PAMAP2 Body-worn 94.91%
[39] UCF50 Video Data CNN + Bi-LSTM 96.04%
UCF11 Video Data +Attention 98.90%
subUCF crime Video Data 61.04%
[40] SBU Interaction Video Data Inception-V3 + CNN + 98.70%
MIVIA Action Video Data LSTM 99.41%
[41] CySa (Self-made) Body-worn CNN + Bi-LSTM + -
OPPORTUNITY Body-worn Attention 94.72% (F1 score)
UCI HAR Smartphone 97.51%
WISDM Smartphone 87.05%
MOTIONSENSE Smartphone 97.67%
PAMAP2 Body-worn 99.82%
[42] WISDM Smartphone CNN + Bi-LSTM + 97.90%
PAMAP2 Body-worn LSTM 98.52%
[43] Self-collected Smartphone Auto-encoder 97.13%
[44] Self-collected Body-worn Stacked Auto-encoder 94.88%

3. Preliminaries
In this research work, a combined hybrid deep model is presented that generally combines three
distinct deep learning networks. The proposed system is formed combining the advantageous aspects
of attention mechanism, Bi-LSTM model, and Inception model as well. The main concept behind this
idea is retrieval of local and temporal features of the input data efficiently. Hence, for interpreting the
working mechanism of the model, it is highly required to understand all the associated components
and concerned parameters separately. Therefore, in the below section, the conceptual elaboration of
the required components is presented briefly. In figure 1, the mechanism of proposed hybrid DL
framework is displayed.

Fig 1: Proposed Model Architecture


3.1. Inception Module

Inception is one of the popular DL models nowadays. Currently, in numerous research works in the
deep learning domain, researchers prefer and utilize inception module hugely for its beneficial
characteristics. The inception module, can also be called as “GoogLeNet” was basically introduced by
Google researchers. Inception model was basically made for addressing some shortcomings of
traditional DL architectures. Inception models are useful for curing the problem of vanishing
gradients and computational power of deeper networks. In inception model, there are multiple
convolutional surfaces of varying kernel sizes such as 1x1, 3x3, 5x5, and pooling layers as well [45].
These layers form a concatenated output combined. This design is effective in capturing various
scales of image traits. Using inception, spatial patterns can be drawn effectively from image or video
sequences as well within same layer. However, in the domain of HAR, unlike CNN, Inception module
also plays a very crucial role in selecting spatial characteristics of input data accurately. Inception
modules are basically sparse architecture in nature, where CNNs are dense, and this sparse nature of
Inception module leverages the benefit of multi-scaled extraction of relevant features utilizing less
amount of time, reducing the time complexity of the model as well. Moreover, HAR datasets
generally work with huge number of input data that contains numerous information and feature sets as
well. Hence, dimensionality reduction is one of the concerned factors in HAR domain for ensuring
only desired features are selected for model deployment. Inception module utilize 1x1 convolution
that serves as an efficient way for reducing dimension, along with assuring reduced computational
cost and parameters usage without affecting the network depth as well. These 1x1 convolutions also
help in capturing fine-grained spatial characteristics of data. Apart from that, the advanced versions of
Inception (Inception-V2, Inception-V3) incorporate facilities such as batch normalization (BN) and
Dropout, which mitigates the risk of overfitting and make it convenient for HAR models [46].
Furthermore, inception modules allow benefit of parallel processing that leads to faster inference and
training time, compared to other models as well. Hence, it can be stated that incorporating Inception
module in HAR mechanisms will add extra benefits on the overall accuracy of the model and also for
retrieving spatial trends of data as well.

Fig 2: Inception Module [47]


3.2. Bi-LSTM

In the sphere of human activity recognition, Recurrent Neural Networks (RNNs) play an important
role. Among RNNs, LSTM networks are generally utilized in a large basis. For capturing temporal
dependencies of data, researchers apply LSTM networks in HAR models. For effective removal and
selection of features and for getting enhanced accurate result, besides extracting spatial characteristics
of data, extraction of temporal dependencies are similarly important. Human actions generally consist
of time-series sensory data. Hence, temporal trends in time-series data play crucial role for modelling
human movements. LSTMs are responsible for retrieving temporal characteristics from sensory data
for its long-term dependencies as well as temporal characteristics. Not only for capturing human
actions, but also capturing small or long transitions are equally important in HAR models. Though
LSTMs are good in capturing temporal features, but it possesses some major drawbacks too. To
overcome such shortcomings of LSTM models, there comes the necessity of Bi-LSTM models [48].
For recognizing complex human movements such as swimming, cycling, walking, it is crucial to
identify the actions that generally depend on preceding and succeeding movements. LSTMs are
capable of capturing only one-directional data instances, while Bi-LSTMs process input signals in
both forward and backward directions allowing the model in capturing contextual information from
both the past and future time steps as well [49]. This nature provides a better comprehensive insight
regarding the temporal dependencies of input, helping in acquiring enhanced classification result.
Hence, considering these advantages of Bidirectional LSTM models, nowadays, in most of the
analytical tasks involving complex human activities, Bi-LSTMs are considered as more suitable and
preferable ones rather than applying LSTM networks as well. Figure 3 depicts the working of Bi-
LSTM.

The mathematical expression of working of Bi-LSTM is as follows:

( ( [ ]) ( [ ]) ) [50]

Where, : information from past time steps of hidden states : Activation function

: Information from future time steps of hidden states : Bias

: Input states embedded in two directions : Sigmoid function

: Hidden states embedded in two directions

Fig 3: Bi-LSTM Model [41]


3.3. Attention Mechanism

In the sphere of human activity detection, the features play the most important role. Recognition
accuracy and the model efficiency effectively rely upon proper selection of essential features. It is
noticed that the feature identification and selection is one of the most crucial parts in recognizing
human behaviours efficiently. For detecting human movements, both the temporal and the spatial
selection of features are essentially required. It is highly important to evaluate and recognize the
features that are crucial for the model implementation as well. Here comes the need of Attention
mechanism to pay extra “attention” as well as emphasize in most wanted and relevant features as well
[51]. The attention mechanism is especially beneficial where not every piece of input is equally
meaningful or informative. In the HAR domain, currently, most of the researchers prefer attention
method to concentrate on particular time steps or movements that are more reminiscent of particular
activities. In human activity recognition, attention can help the model focus on the most relevant parts
of the input data. It potentially highlights important time steps or features that are crucial for
distinguishing between different activities. Apart from that, this can enhance the model ability to
recognize activities possessing varying durations or complexities [52]. Hence, leveraging the
advantageous aspects of attention mechanism, in this paper, attention is incorporated in the proposed
model for essentially picking up the most relevant features.

4. Proposed Model
Here, a hybrid deep architecture is proposed for recognizing several human actions effectively. The
proposed DL-based framework is a hybrid model that involves Bi-LSTM, Inception, and Attention
mechanisms together in a combination for the formation of the classification framework. The
proposed Bi-LSTM-Inception model, with attached attention mechanism with both of these models is
suggested in order to obtain a better predictive outcome in HAR domain by mitigating potential
literature gaps. The working mechanism of the proposed framework is elaborated step-by-step in the
following section. First of all, data collection is required for deploying the proposed algorithm and
analysis the results as well. However, from in-built smartphone-based sensors multivariate time-series
details are gathered for recognizing several human activities. Utilizing various sensors like gyroscopes
and tri-axial accelerometers, fine-grained sensory data can be acquired. Here, for evaluation purpose,
datasets are collected through smartphone sensors. However, collected information using mobile
phone sensors are generally noisy, and they also not reside in proper format. Naturally, with such kind
of noisy raw data, recognizing basic hidden patterns is not possible. Hence, it is required to remove
unwanted noise from the raw data before feeding the data to the classification model in order to obtain
desired output. Here comes the requirement of data pre-processing. However, after performing pre-
processing tasks, the processed data is then fed to the model for obtaining final classification output.
The mechanism of the model is divided into two parts generally; one is for extracting spatial features
of data while another one will be utilized for retrieving temporal dependencies of data. After retrieval
of the spatio-temporal features, then the final obtained features are sent to the corresponding layers
and finally sent to the softmax layer for getting the final outcome of activity recognition model. The
model is then validated and justified using certain methods such as cross-validation technique,
comparison among proposed model and other existing literatures, and finally evaluating the
performance metrics as well. In figure 4, the overview of the proposed model involving the required
components and layers is displayed.
Fig 4: Proposed Model Flow Diagram

4.1. Dataset Description

For evaluating the performance of human activity detection model, in this study, two smartphone-
based publicly available datasets: UCI HAR and WISDM are utilized. The basic elaboration of these
two datasets is presented as follows:

 UCI HAR [53]: This standard database comes from the “University of California Irvine
(UCI) Machine Learning” repository, which is openly accessible to the public. The dataset is
basically a balanced dataset. This dataset was gathered from thirty individuals, ranging in age
from 19 to 48, who engaged in six distinct activities of everyday living, including “sitting”,
“standing”, “walking”, “lying”, “walking upstairs”, and “walking downstairs”. A smartphone
“Samsung Galaxy S II” integrated with gyroscope and accelerometer, positioned on the waist
was used for gathering the data. Additionally, this dataset was gathered under appropriate
supervision in a laboratory setting. The researchers measured the 3-axial angular velocity and
tri-axial linear acceleration at a constant sampling rate of 50 Hz. Statistically, the dataset is
consists of 7, 48,406 number of data and further details are displayed in table 2.
Table 2 : Activities involved in UCI HAR
Activities Samples Percentage
Walking 122,091 16.3%
Sitting 126,677 16.9%
Standing 138,105 18.5%
Laying 136,865 18.3%
Walking Upstairs 116,707 15.6%
Walking Downstairs 107,961 14.4%
 WISDM Dataset [54]: This dataset is given by the “Wireless Sensor and Data Mining
(WISDM)” lab. The dataset contains a total number of 1098209 samples. It is basically an
unbalanced dataset. Six basic activities are involved within this dataset: “standing”, “sitting”,
“walking”, “upstairs”, “downstairs”, and “jogging” as well. The activity “walking” possesses
the greatest percentage of 38.6%, while, the activity “standing” is made up of only taking
4.4% samples. The experimental object includes 36 subjects. These individuals performed
specific daily actions using an Android smartphone positioned on their front leg pockets as
well. The sensor, accelerometer was used with a sampling frequency of 20 Hz. The sensor is
an integrated sensor of the mobile device. In table 3, further related information is provided.
Table 3 : Activities involved in WISDM
Activities Samples Percentage
Walking 424,400 38.6%
Jogging 342,177 31.2%
Walking Upstairs 122,869 11.2%
Walking Downstairs 100,427 9.1%
Sitting 59,939 5.5%
Standing 48,397 4.4%

4.2. Pre-processing Technique

Raw signal input compiled from the sensors such as IMU, body-worn, or smartphone-based sensors
generally include numerous diverse and ineffective data dimensions along with containing noise, or
unwanted parameters. For refining and converting such noisy data into cleaned one so that it can be
readily fed to the model for classification.

4.3. Inception-CBAM Operation

4.4. Bi-LSTM-Attention Operation

4.5. GAP, Dropout, and Softmax layers

You might also like