Human Activity Recognition using DL methods
Human Activity Recognition using DL methods
Introduction
In the present scenario, Human Activity Recognition has become one of the most widely-known
research areas [1]. Recently, due to its numerous applications in fields including medical care,
disease prediction, robotics, sports, video surveillance, and others, human activity recognition
(HAR) has attracted a lot of attention. The ability to automatically identify and understand human
actions from sensory data is generally a critical task in the domain of artificial intelligence.
However, it also possesses the potential to revolutionize several industries as well. A dense
understanding of human behavior along with an improvement in life quality is fostered by HAR-
enabled individualized health monitoring, behavioral analysis, and real-time activity monitoring
capabilities. In accordance with a report published by UN (United Nations) [2], it is assumed that
by 2050, there will be 2 billion elderly people world-wide.
However, elderly individuals require extra care and attention since the majority of these people
possess multiple diseases. An essential component of smart healthcare is the real-time monitoring
of individuals’ physical activities, especially their daily living activities (DLAs) [3], which can
significantly improve eldercare and medical rehabilitation. Numerous serious diseases are
significantly impacted by daily activities. Therefore, monitoring of day-to-day physical activities
provides a crucial health indicator as well. Generally, it is common practice to track, evaluate, and
comprehend different postures across a wide range of systems and applications by classifying and
identifying human physical activities [4].
In HAR, several human activities including running, walking, sleeping, sitting, standing, and so
on are recognized. Experimental data and required resources can be acquired from several sensors
(wearable, wireless etc.), accelerometers, or through images, or video frames as well. There exist
several sensor-based frameworks of HAR like smartphone sensor-enabled, audio/video data-
related and body-worn sensor-based as well [5]. However, among these, body-worn sensors may
not be comfortable to the users, as it is required to place on several locations of body. Apart from
that, collecting inputs from audio or video possess different kinds of privacy related issues.
Moreover, both the signals from body-worn devices and audio/video need complex techniques for
pre-processing to remove unwanted noises from input data as well. In most of the cases, long-
ranged audio signals become noisy for white noise or background noise as well. Hence, an audio
input at a certain moment fails to provide valuable insights. Moreover, differentiating between
two audio signals gets difficult too [6]. Therefore, it can be stated that for distinctly identifying
some basic human activities audio inputs are not always sufficient and suitable on their own as
well. Collecting video data, especially in populated locations may be problematic, due to
existence of various physical obstacles, or due to low brightness as well [7].
However, for inferring the characteristics of transfer modes and human activities, sensor-based
data can also be acquired from smartphones also. In general, the physical systems on HAR based
on smartphone sensors are prompted by their discretion, ubiquity, and inexpensive appliance
procedures, usefulness, and noninvasive properties as well [8],[9]. Utilizing smartphones,
continuous input can be gathered at the time of performing any kind of physical actions. Apart
from that, due to various built-in mobile sensors, keeping track of health-related data has become
more accurate and elegant nowadays. These different built-in sensors of smartphones are used for
collection important insights of HAR models. It is noticed that, among mobile sensors,
gyroscopes and accelerometers are the most widely used sensors [10], [11]. Multivariate time-
series characteristics can be found in datasets derived from smartphone sensors. The basic feature
of time-series data is local dependency. Furthermore, human activity signals are translation-
invariant and hierarchical, and also possess dynamic information with regard to underlying
systems as well [12]. Therefore, the requirement of modeling these high-dimensional datasets
accurately is increasing. Physical activities include several unique characteristics. Thus, HAR
involves various methodological issues, including imbalanced datasets, interclass similarity,
intraclass variability, empty class problem, and others as well [13].
In the current days, HAR systems based on smartphone sensor data most preferably utilize the
traditional machine learning (ML) algorithms as well as deep learning (DL) methods for
recognizing human activities in efficient manner [14]. For extracting relevant underlying features
responsible for differentiating distinct activity patterns, in conventional ML methods, feature
engineering has become one of the most ruling phases. The desired and enhanced model
performance of HAR systems hugely relies upon the efficient feature engineering of raw input
signals. After extracting useful features, those are then fed to the classifiers to identify human
activities. However, for obtaining enhanced classification result, extraction of relevant features
accurately is highly required. Without a relevant feature engineering model, traditional classifiers
cannot perform well and thus fails to identify human activities accurately and competently [15].
In this manner, complex techniques for data pre-processing are needed for getting sensory data in
a proper form, and this extraction procedure of hand-crafted features from sensory data require
high expertise domain knowledge in this sense. Finally, the extracted handcrafted features are sent
to conventional classification system for identifying human activities. However, it is noticed from
several research that these handcrafted features not always work for all the models and perform
poorly in recognition models as well [16]. Apart from that, it requires distinct handcrafted feature
vector for different domains in research for handling classification problems properly. In this
manner, currently, most of the researchers prefer and utilize DL algorithms to overcome such
problems.
However, there are huge application field of HAR, most extensively applied in medical domains,
and for the purpose of taking care and tracking records of elderly people for helping them in
better and secure lifestyle [17]. Moreover, for controlling and monitoring crime records and rates,
HAR can also be applied. Apart from that, the everyday activities recognition can build an
environment for smart home technology. Driving behaviors can be detected and thus helps in
promoting safe transportation. Implementing HAR, military operations can be identified as well.
Moreover, the other domains where HAR is applied are entertainment, autonomous driving,
surveillance and security, human-robot interaction as well [18]. The basic objective of HAR is
recognizing different kinds of human actions and activities in controlled and uncontrolled
manners.
1.1. Motivation
In several research works, it is noticed that nowadays most of the Human Activity Recognition
models utilize Deep Learning algorithms rather using traditional ML algorithms, as ML
algorithms require handcrafted features [19]. Apart from that, DL models possess capability of
automatic feature extraction and learning. This makes an extra advantage for using DL models in
HAR systems. DL algorithms are basically capable of extracting important features in an efficient
manner without any manual intervention along possessing the ability of recognizing human
actions simultaneously [19], [20]. These DL methodologies are proved as outstanding in
performing desired prediction in various domains such as intelligent gaming system, ideal
recognition of image and speech, natural language processing (NLP), and so on as well [21]. In
the literature of HAR, DL methods have gained an outstanding contribution. Nowadays, in the
most of the research works regarding HAR domain, different types of DL algorithms are being
applied and investigated. Recognition of distinct complex activities of human require some
certain steps for identifying all the responsible features accurately so that the HAR model can get
its desired outcome successfully. Among these procedures, one of the most considerable phases is
extraction of relevant features. Human activities are basically consists of two types of features:
spatial and temporal. Identification of both of these features is equally important for recognizing a
specific activity [22]. In this manner, extraction of these spatio-temporal characteristics of
smartphone-based sensory input is highly required. For this purpose, proper procedures of feature
extraction along with following required data pre-processing techniques for converting raw and
noisy input signal into acceptable as well as clear data is required.
There exist numerous DL methodologies applied in the HAR architectures for gaining required
spatial and temporal characteristics of data to ideally specify human actions. Among these DL
mechanisms, Inception module is one of the widely used DL model that generally helps in
extracting spatial as well as local trends of data ideally. In HAR domain, utilization of Inception
modules aids with several advantageous features that enrich the classification cost and accuracy
of activity classifiers. Inception modules utilize blending of convolutional filters holding varying
receptive field size within same layer that allows the model capturing multi-scaled features [23].
This nature leads to better and efficient utilization of parameters along with using less number of
layers as well. Apart from that, inception module incorporates techniques such as batch
normalization (BN), and dropout, helping the model in reducing the risk of overfitting [24]. Faster
inference time and training is also possible by utilizing inception architecture as they allow
parallel processing rather than sequential processing. Moreover, inception modules also help in
mitigating the problem of vanishing gradient that makes the model easier to train deep models
efficiently. However, besides extraction of local characteristics of sensory data, it is also equally
required to accurately retrieve temporal dependency of data. Generally, RNNs are better and
widely used deep networks in HAR domain. Among RNNs, LSTM play a great role in capturing
temporal relationship of data. Although LSTMs possess various advantages in capturing temporal
characteristics, still it lacks sometimes in addressing complex activities that involves bi-
directional activities as well. To overcome this problem, Bi-LSTM is utilized as they can capture
temporal trends of data from both of the instances, past and future [25]. Capturing bi-directional
trends of data is necessary for identifying the activities involving forward and backward
movements as well. Apart from this, it is also required to keep an eye in extracting these features
wisely so that any unwanted features cannot be entertained in model training. For this, to pay
extra “attention” in feature selection, attention mechanisms are being used in most of the cases
recently [26]. Generally, attention mechanisms are kind of DL methods that pay extra attention in
selecting features by paying more attention in wanted features, while paying less attention in
unwanted ones as well.
Hence, motivated by these constraints (Bi-LSTM, Inception, and Attention), here, in this article,
an integrated model is proposed for identification of human activities. The corresponding
literature of this study also indicates that the proposed combination is novel. However, the step-
by-step architecture in this proposed model involves the following three different phases:
Inception-CBAM module
Bi-LSTM-Attention module
A GAP layer followed by dropout and Softmax layers
1.2. Contribution
The paper presents and discusses several aspects regarding the domain of human activity
recognition.
An extensive literature review consisting DL-based frameworks of HAR is performed for
easy understanding of readers regarding the topic along with identifying potential
literature gaps.
An efficient hybrid DL-based model is proposed consisting three DL algorithms such as
Attention, Inception, and Bi-LSTM to ideally extract essential spatio-temporal
characteristics of smartphone-based sensor data.
The effectiveness and proper justification of the proposed system is presented through
required experiments, validation techniques, performance metric and cross-validation
techniques as well. Finally, we compare the obtained result with other existing
literatures.
2. Related Work
In this section, a comprehensive literature review of related articles has been performed. Various
works and research done in this particular field of HAR have been evaluated. Generally, to find
out the effective research gap and the possible future directions that can make this domain of
HAR more productive and more fruitful, this review work is presented. The research field of
HAR domain generally consist of both ML and DL approaches. Previously, in several research
activities, researchers utilized the application of classic ML algorithms in HAR domain to find out
recognition accuracy of models [27]. In these ML-based systems, researchers used numerous
feature selection or/and extraction procedures prior to feeding the collected data to classifiers for
identifying several human behaviors. However, it is noticed that, ML models depend upon
handcrafted extraction of features, and this procedure of feature retrieval requires expertise in
domain knowledge and manual intervention as well, which results in increased time complexity
[28]. In this sense, to overcome such disadvantages of ML-based HAR systems, researchers have
focused in exploring and applying DL-based mechanisms, as DL-based architectures possess the
benefits of automated extraction of features, without human interference as well. Hence, in this
part, a survey on DL-based human action identification systems is presented.
In the table 1, the above reviewed literatures based on deep learning method are summarized.
From the table, it can be easily stated that WISDM and UCI datasets are the most widely used
popular standard and publicly available smartphone-based datasets. Moreover, it can also be
observed that most of the literatures have utilized the benefits of attention mechanism for
selecting useful features wisely. Research GAPAlmost maximum models have extracted both the
spatial as well as temporal dependencies of sensory inputs in order to recognize different human
behaviors in an efficient manner with enhanced model performance. However, in this article, to
address the potential literature gaps, a hybrid model involving Inception module, Bi-LSTM
network, and Attention mechanism is suggested by taking all these factors in considerations. It is
desired that the proposed architecture will be capable of obtaining desired output with enhanced
classification accuracy along with proper justifications of the model performance as well.
3. Preliminaries
In this research work, a combined hybrid deep model is presented that generally combines three
distinct deep learning networks. The proposed system is formed combining the advantageous aspects
of attention mechanism, Bi-LSTM model, and Inception model as well. The main concept behind this
idea is retrieval of local and temporal features of the input data efficiently. Hence, for interpreting the
working mechanism of the model, it is highly required to understand all the associated components
and concerned parameters separately. Therefore, in the below section, the conceptual elaboration of
the required components is presented briefly. In figure 1, the mechanism of proposed hybrid DL
framework is displayed.
Inception is one of the popular DL models nowadays. Currently, in numerous research works in the
deep learning domain, researchers prefer and utilize inception module hugely for its beneficial
characteristics. The inception module, can also be called as “GoogLeNet” was basically introduced by
Google researchers. Inception model was basically made for addressing some shortcomings of
traditional DL architectures. Inception models are useful for curing the problem of vanishing
gradients and computational power of deeper networks. In inception model, there are multiple
convolutional surfaces of varying kernel sizes such as 1x1, 3x3, 5x5, and pooling layers as well [45].
These layers form a concatenated output combined. This design is effective in capturing various
scales of image traits. Using inception, spatial patterns can be drawn effectively from image or video
sequences as well within same layer. However, in the domain of HAR, unlike CNN, Inception module
also plays a very crucial role in selecting spatial characteristics of input data accurately. Inception
modules are basically sparse architecture in nature, where CNNs are dense, and this sparse nature of
Inception module leverages the benefit of multi-scaled extraction of relevant features utilizing less
amount of time, reducing the time complexity of the model as well. Moreover, HAR datasets
generally work with huge number of input data that contains numerous information and feature sets as
well. Hence, dimensionality reduction is one of the concerned factors in HAR domain for ensuring
only desired features are selected for model deployment. Inception module utilize 1x1 convolution
that serves as an efficient way for reducing dimension, along with assuring reduced computational
cost and parameters usage without affecting the network depth as well. These 1x1 convolutions also
help in capturing fine-grained spatial characteristics of data. Apart from that, the advanced versions of
Inception (Inception-V2, Inception-V3) incorporate facilities such as batch normalization (BN) and
Dropout, which mitigates the risk of overfitting and make it convenient for HAR models [46].
Furthermore, inception modules allow benefit of parallel processing that leads to faster inference and
training time, compared to other models as well. Hence, it can be stated that incorporating Inception
module in HAR mechanisms will add extra benefits on the overall accuracy of the model and also for
retrieving spatial trends of data as well.
In the sphere of human activity recognition, Recurrent Neural Networks (RNNs) play an important
role. Among RNNs, LSTM networks are generally utilized in a large basis. For capturing temporal
dependencies of data, researchers apply LSTM networks in HAR models. For effective removal and
selection of features and for getting enhanced accurate result, besides extracting spatial characteristics
of data, extraction of temporal dependencies are similarly important. Human actions generally consist
of time-series sensory data. Hence, temporal trends in time-series data play crucial role for modelling
human movements. LSTMs are responsible for retrieving temporal characteristics from sensory data
for its long-term dependencies as well as temporal characteristics. Not only for capturing human
actions, but also capturing small or long transitions are equally important in HAR models. Though
LSTMs are good in capturing temporal features, but it possesses some major drawbacks too. To
overcome such shortcomings of LSTM models, there comes the necessity of Bi-LSTM models [48].
For recognizing complex human movements such as swimming, cycling, walking, it is crucial to
identify the actions that generally depend on preceding and succeeding movements. LSTMs are
capable of capturing only one-directional data instances, while Bi-LSTMs process input signals in
both forward and backward directions allowing the model in capturing contextual information from
both the past and future time steps as well [49]. This nature provides a better comprehensive insight
regarding the temporal dependencies of input, helping in acquiring enhanced classification result.
Hence, considering these advantages of Bidirectional LSTM models, nowadays, in most of the
analytical tasks involving complex human activities, Bi-LSTMs are considered as more suitable and
preferable ones rather than applying LSTM networks as well. Figure 3 depicts the working of Bi-
LSTM.
( ( [ ]) ( [ ]) ) [50]
Where, : information from past time steps of hidden states : Activation function
In the sphere of human activity detection, the features play the most important role. Recognition
accuracy and the model efficiency effectively rely upon proper selection of essential features. It is
noticed that the feature identification and selection is one of the most crucial parts in recognizing
human behaviours efficiently. For detecting human movements, both the temporal and the spatial
selection of features are essentially required. It is highly important to evaluate and recognize the
features that are crucial for the model implementation as well. Here comes the need of Attention
mechanism to pay extra “attention” as well as emphasize in most wanted and relevant features as well
[51]. The attention mechanism is especially beneficial where not every piece of input is equally
meaningful or informative. In the HAR domain, currently, most of the researchers prefer attention
method to concentrate on particular time steps or movements that are more reminiscent of particular
activities. In human activity recognition, attention can help the model focus on the most relevant parts
of the input data. It potentially highlights important time steps or features that are crucial for
distinguishing between different activities. Apart from that, this can enhance the model ability to
recognize activities possessing varying durations or complexities [52]. Hence, leveraging the
advantageous aspects of attention mechanism, in this paper, attention is incorporated in the proposed
model for essentially picking up the most relevant features.
4. Proposed Model
Here, a hybrid deep architecture is proposed for recognizing several human actions effectively. The
proposed DL-based framework is a hybrid model that involves Bi-LSTM, Inception, and Attention
mechanisms together in a combination for the formation of the classification framework. The
proposed Bi-LSTM-Inception model, with attached attention mechanism with both of these models is
suggested in order to obtain a better predictive outcome in HAR domain by mitigating potential
literature gaps. The working mechanism of the proposed framework is elaborated step-by-step in the
following section. First of all, data collection is required for deploying the proposed algorithm and
analysis the results as well. However, from in-built smartphone-based sensors multivariate time-series
details are gathered for recognizing several human activities. Utilizing various sensors like gyroscopes
and tri-axial accelerometers, fine-grained sensory data can be acquired. Here, for evaluation purpose,
datasets are collected through smartphone sensors. However, collected information using mobile
phone sensors are generally noisy, and they also not reside in proper format. Naturally, with such kind
of noisy raw data, recognizing basic hidden patterns is not possible. Hence, it is required to remove
unwanted noise from the raw data before feeding the data to the classification model in order to obtain
desired output. Here comes the requirement of data pre-processing. However, after performing pre-
processing tasks, the processed data is then fed to the model for obtaining final classification output.
The mechanism of the model is divided into two parts generally; one is for extracting spatial features
of data while another one will be utilized for retrieving temporal dependencies of data. After retrieval
of the spatio-temporal features, then the final obtained features are sent to the corresponding layers
and finally sent to the softmax layer for getting the final outcome of activity recognition model. The
model is then validated and justified using certain methods such as cross-validation technique,
comparison among proposed model and other existing literatures, and finally evaluating the
performance metrics as well. In figure 4, the overview of the proposed model involving the required
components and layers is displayed.
Fig 4: Proposed Model Flow Diagram
For evaluating the performance of human activity detection model, in this study, two smartphone-
based publicly available datasets: UCI HAR and WISDM are utilized. The basic elaboration of these
two datasets is presented as follows:
UCI HAR [53]: This standard database comes from the “University of California Irvine
(UCI) Machine Learning” repository, which is openly accessible to the public. The dataset is
basically a balanced dataset. This dataset was gathered from thirty individuals, ranging in age
from 19 to 48, who engaged in six distinct activities of everyday living, including “sitting”,
“standing”, “walking”, “lying”, “walking upstairs”, and “walking downstairs”. A smartphone
“Samsung Galaxy S II” integrated with gyroscope and accelerometer, positioned on the waist
was used for gathering the data. Additionally, this dataset was gathered under appropriate
supervision in a laboratory setting. The researchers measured the 3-axial angular velocity and
tri-axial linear acceleration at a constant sampling rate of 50 Hz. Statistically, the dataset is
consists of 7, 48,406 number of data and further details are displayed in table 2.
Table 2 : Activities involved in UCI HAR
Activities Samples Percentage
Walking 122,091 16.3%
Sitting 126,677 16.9%
Standing 138,105 18.5%
Laying 136,865 18.3%
Walking Upstairs 116,707 15.6%
Walking Downstairs 107,961 14.4%
WISDM Dataset [54]: This dataset is given by the “Wireless Sensor and Data Mining
(WISDM)” lab. The dataset contains a total number of 1098209 samples. It is basically an
unbalanced dataset. Six basic activities are involved within this dataset: “standing”, “sitting”,
“walking”, “upstairs”, “downstairs”, and “jogging” as well. The activity “walking” possesses
the greatest percentage of 38.6%, while, the activity “standing” is made up of only taking
4.4% samples. The experimental object includes 36 subjects. These individuals performed
specific daily actions using an Android smartphone positioned on their front leg pockets as
well. The sensor, accelerometer was used with a sampling frequency of 20 Hz. The sensor is
an integrated sensor of the mobile device. In table 3, further related information is provided.
Table 3 : Activities involved in WISDM
Activities Samples Percentage
Walking 424,400 38.6%
Jogging 342,177 31.2%
Walking Upstairs 122,869 11.2%
Walking Downstairs 100,427 9.1%
Sitting 59,939 5.5%
Standing 48,397 4.4%
Raw signal input compiled from the sensors such as IMU, body-worn, or smartphone-based sensors
generally include numerous diverse and ineffective data dimensions along with containing noise, or
unwanted parameters. For refining and converting such noisy data into cleaned one so that it can be
readily fed to the model for classification.