0% found this document useful (0 votes)

60 views12 pages

LSTM-CNN Architecture For Human Activity Recognition

Uploaded by

moomina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views12 pages

LSTM-CNN Architecture For Human Activity Recognition

Uploaded by

moomina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Received February 29, 2020, accepted March 17, 2020, date of publication March 20, 2020, date of current

version March 31, 2020.

Digital Object Identifier 10.1109/ACCESS.2020.2982225

LSTM-CNN Architecture for Human

Activity Recognition
KUN XIA , JIANGUANG HUANG , AND HANYU WANG
University of Shanghai for Science and Technology, Shanghai 200093, China
Corresponding author: Jianguang Huang ([email protected])
This work was supported in part by the National Natural Science Foundation of China under Grant 512070901, and in part by the Key
Project of National Natural Science Foundation of China under Grant 51637001.

ABSTRACT In the past years, traditional pattern recognition methods have made great progress. However,
these methods rely heavily on manual feature extraction, which may hinder the generalization model
performance. With the increasing popularity and success of deep learning methods, using these tech-
niques to recognize human actions in mobile and wearable computing scenarios has attracted widespread
attention. In this paper, a deep neural network that combines convolutional layers with long short-term
memory (LSTM) was proposed. This model could extract activity features automatically and classify them
with a few model parameters. LSTM is a variant of the recurrent neural network (RNN), which is more
suitable for processing temporal sequences. In the proposed architecture, the raw data collected by mobile
sensors was fed into a two-layer LSTM followed by convolutional layers. In addition, a global average
pooling layer (GAP) was applied to replace the fully connected layer after convolution for reducing model
parameters. Moreover, a batch normalization layer (BN) was added after the GAP layer to speed up the
convergence, and obvious results were achieved. The model performance was evaluated on three public
datasets (UCI, WISDM, and OPPORTUNITY). Finally, the overall accuracy of the model in the UCI-HAR
dataset is 95.78%, in the WISDM dataset is 95.85%, and in the OPPORTUNITY dataset is 92.63%. The
results show that the proposed model has higher robustness and better activity detection capability than some
of the reported results. It can not only adaptively extract activity features, but also has fewer parameters and
higher accuracy.

INDEX TERMS Human activity recognition, convolution, long short-term memory, mobile sensors.

I. INTRODUCTION Due to the rapid development of sensor technology and

Human activity recognition (HAR) plays an important role ubiquitous computing technology, sensor-based HAR has
in people’s daily lives because it has the ability to learn pro- become more and more popular, and it is widely used with
found advanced knowledge about human activities from raw privacy being well protected. Researchers have explored the
sensor data [1]. With the development of human-computer role of different types of sensing technology in activity recog-
interaction applications, the technology of HAR has become nition to improve recognition accuracy. According to the
a popular research direction at home and abroad. People manner in which sensors are employed in an environment, the
could automatically classify the type of human motion and technologies of human activity recognition could be widely
obtain the information that the human body needs to convey divided into two categories: approaches based on fixed sen-
by extracting features from daily activities, which in turn sors and approaches based on mobile sensors [6].
provides a basis for other intelligent applications. Hitherto, The methods based on fixed sensors mean that the
this technology has been widely used in the fields of home information is obtained from sensors mounted at a fixed
behavior analysis [2], video surveillance [3], gait analysis [4], position, involving acoustic sensors [7], radars [8], static
and gesture recognition [5], etc. cameras [9], and other ambient-based sensors. Among them,
camera-based methods are the most popular methods, among
The associate editor coordinating the review of this manuscript and which background subtraction method, optical flow method
approving it for publication was Yongping Pan . and energy-based segmentation method are usually applied

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 56855
K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

to extract features [10]–[13]. Representative is an image for modeling and recognizing human activities [18]. Early
processing method based on Kinect sensors which could researches mainly used decision tree, support vector machine
acquire the depth image features of moving targets whereby (SVM), naïve Bayes and other traditional machine learning
Jun Liu et al. [10] proposed a space-time short-term memory methods to classify the data collected by sensors [19]–[22].
(ST-LSTM) network to recognize activities. Kitani et al. [11] In [19], gradient histogram and Fourier descriptor based
presented a sparse optical flow algorithm to acquire the his- on centroid feature were used to extract the features of
togram of human motion features and proposed an unsuper- acceleration and angular velocity data. Then Jain et al. [19]
vised Dirichley hybrid model to classify 11 human activities. used two classifiers, support vector machine and k-nearest
Although these activity monitoring methods can provide neighbor (KNN), to recognize the activities of two public
better recognition accuracy, they are not suitable in many datasets. Jalloul et al. [20] used six inertial measurement
indoor environments, especially where privacy is a concern. units to construct a monitoring system. After performing
Furthermore, the results of vision-based approaches are easily network analysis, a number of network measures that satisfy
affected by illumination variations, ambient occlusion, and the statistical test were selected to form a feature set, and
background change. This greatly limits their practical use. then the authors used the random forest (RF) classifier to
The other methods of activity recognition are to use mobile classify the activities. Finally, an overall accuracy of 84.6%
sensors. In these methods, the information from different was achieved. The paper [21] presented a wearable wire-
kinds of behaviors is usually collected from a set of dedicated less accelerometer-based activity recognition system and its
body-worn motion sensors, such as accelerometers, gyro- application in medical detection. Relief-F and sequential for-
scopes, and magnetometers. Acceleration and angular veloc- ward floating search (SFFS) were combined for feature selec-
ity data would change according to human motion. Therefore, tion. Finally, Naïve Bayesian and k-nearest neighbor (KNN)
they could be used to infer human activities. The miniatur- were used for activity classification and comparative analysis.
ization and flexibility of sensors allow individuals to wear or Machine learning methods may rely heavily on heuristic
carry mobile devices embedded with various sensing units. manual feature extraction in most daily human activity recog-
This is different from fixed sensor-based approaches [14]. nition tasks. It is usually limited by human domain knowl-
Moreover, these sensors have the characteristics of low cost, edge [23]. To address this problem, researchers have turned
low power consumption, high capacity, miniaturization, and to deep learning methods that could automatically extract
less dependence on surroundings [15]. Therefore, activity appropriate features from raw sensor data during the training
recognition based on mobile sensors has received widespread phase and present the low-level original temporal features
attention because of its portability and high acceptance in with high-level abstract sequences. In view of the successful
daily life. Correspondingly, a large number of researches application of deep learning models in image classification,
have been carried out to explore the potential of mobile voice recognition, natural language processing, and other
sensors for activity recognition in a ubiquitous and pervasive fields, it is a new research direction in pattern recognition to
way. Margarito et al. [16] put accelerometers on the wrist of transfer it to the field of human activity recognition [24]–[27].
subjects to collect acceleration data and then used template In [24], authors proposed to convert the data acquired by
matching algorithm to classify 8 common sports activities. three-axis accelerometers into an ‘‘image’’ format, and then
In [17], a smart life assistant system (SAIL) for the elderly they used CNN with three convolutional layers and one
and disabled was proposed. Zhu et al. [17] collected the fea- fully-connected layer to identify human activities. Ordóñez
tures by the way of multi-sensor fusion strategy and achieved and Roggen. [25] proposed an activity recognition classi-
the target of recognizing 13 kinds of daily activities. fier, which combined deep CNN and LSTM to classify 27
The rest of this paper is organized as follows. Section II hand gestures and five movements. Finally, simulation results
presents some current sensor-based activity recognition showed that the F1 score on the two classifiers were 0.93 and
researches that using machine learning methods and deep 0.958, respectively. Lin et al. [26] presented a novel iterative
learning methods. Section III presents the description of CNN strategy with autocorrelation pre-processing capability,
three public datasets and data pre-processing for the imple- instead of traditional micro-Doppler image pre-processing,
mented network. Section IV gives details on the proposed which can accurately classify seven activities or five subjects.
LSTM-CNN architecture. Section V shows the experimen- And this strategy used an iterative deep learning framework to
tal results and compares them with some of the previously automatically define and extract features. Finally, traditional
reported works. Moreover, the impact of network structure supervised learning classifiers were used to mark different
and hyper-parameters on model performance is discussed. activities based on the captured radar signals.
Finally, the last section summarizes this research with a brief Although the above models could generally recognize
summary. human activities, the overall network structure is relatively
complex. In addition, these models have a large number of
II. RELATED WORK parameters, which results in high computational cost. It is
In recent years, an enormous amount of researches has difficult to be used in occasions that require high real-time
been conducted by researchers in exploring different sensing performance. Many researchers have made great efforts in
technologies and a number of methods have been proposed this regard. Agarwal et al. [28] proposed a lightweight deep

56856 VOLUME 8, 2020

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

TABLE 1. Information of three public datasets. TABLE 2. Activities of UCI-HAR.

learning model for HAR and deployed it on Raspberry Pi3. TABLE 3. Activities of WISDM.
This model was developed using a shallow RNN in combi-
nation with the LSTM algorithm, and its overall accuracy on
the WISDM dataset achieved 95.78%. Although the proposed
model has high accuracy and brief architecture, it was only
evaluated on one dataset which has just six activities, which
does not prove that the proposed model has good gener-
alization ability. The paper [29] proposed a deep learning
model (InnoHAR) based on the combination of inception
neural network and recurrent neural network to classify activ-
ities. The authors used separate convolution to replace the
traditional convolution, which achieved the goal of reducing
model parameters. The results showed an excellent effect, but between the static postures: standing to sitting, sitting to
the model converged hardly, causing a lot of time to be wasted standing, sitting to laying, laying to sitting, standing to lay-
in the training stage. ing, laying to standing. Specifically, in this paper, only six
To address the shortcomings of the above methods, a novel basic activities were selected as input samples due to the
deep neural network for human activity recognition was pro- percentage of postural transitions is small. The experiments
posed, which we referred to as LSTM-CNN. The model could had been video-recorded to manually label the data. Finally,
extract activity features automatically and classify them with the researchers captured 3-axial acceleration and 3-axial
few parameters. In addition, it was evaluated on three of the angular velocity data at a constant rate of 50Hz. According
most widely used public datasets. The results show that the to statistics, the number of samples in this dataset is 748406,
proposed model not only has high accuracy but also has good and the detailed information was shown in Table 2.
generalization ability and fast convergence speed.
B. WISDM
III. DATASET DESCRIPTION The WISDM dataset [31] has a total of 1098209 samples,
The information of three public information was summarized and the percentage of the total samples associated with
in Table 1. It can be seen that there are some differences each activity was shown in Table 3. It can be seen that
between them. The UCI-HAR dataset has the largest num- WISDM is an unbalanced dataset. Activity walking takes
ber of volunteers, which means that this dataset was con- up the most, reaching 38.6% while standing only accounts
structed from the recordings of 30 subjects. The WISDM for 4.4%. Its experimental object consists of 36 subjects.
dataset consists of 6 activities as same as the UCI-HAR These subjects performed certain daily activities with an
dataset, but it has the largest number of samples. And it Android phone in their front leg pockets. The sensor used
is an unbalanced dataset, which would be mentioned later. is an accelerometer with a sampling frequency of 20 Hz.
The OPPORTUNITY dataset consists of 17 activities. It was It is also a built-in motion sensor of the smartphone. Six
collected by 5 types of sensors, namely accelerometers, gyro- activities were recorded: standing (Std), sitting (Sit), walking
scopes, magnetometers, object sensors, and ambient sensors. (Walk), upstairs (Up), downstairs (Down), and jogging (Jog).
The data collection was supervised by a dedicated person to
A. UCI-HAR ensure the quality of data. Fig. 1 shows the acceleration wave-
The UCI-HAR dataset [30] was built from the recordings form of 2.56 seconds (128 points in total) of each activity with
of 30 subjects aged 19-48 years. During the recording, all sub- the aim of visualizing the characteristics of the raw data on
jects were instructed to follow an activity protocol. And they each axis.
wore a smartphone (Samsung Galaxy S II) with embedded
inertial sensors around their waist. The six activities of daily C. OPPORTUNITY
living are standing (Std), laying (Lay), walking (Walk), walk- The OPPORTUNITY dataset [32], [33] was collected in a
ing downstairs (Down) and walking upstairs (Up). In addi- sensor-rich environment, which includes 17 complex gestures
tion, this dataset also includes postural transitions that occur and modes of locomotion. Overall, it contains recordings of

VOLUME 8, 2020 56857

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

TABLE 4. Activities of OPPORTUNITY.

FIGURE 1. Acceleration waveform of 2.56 seconds of each activity. summarized in Table 4 and the characters in parentheses
denote the symbols of gestures.

D. DATA PRE-PROCESSING
In order to feed the proposed network with a certain data
dimension and improve the accuracy of the model, the raw
data collected by motion sensors need to be pre-processed as
follows.

1) LINEAR INTERPOLATION
The datasets mentioned above are realistic and the sensors
worn on the subjects are wireless. Therefore, some data may
be lost during the collection process, and the lost data is
usually indicated with NaN/0. To overcome this problem,
the linear interpolation algorithm was used to fill the missing
values in this paper.

2) SCALING AND NORMALIZATION

Using large values from channels directly to train models may
FIGURE 2. Placement of on-body sensors used in the OPPORTUNITY lead to training bais, So it is necessary to normalize the input
dataset. data to the range of 0 to 1, as shown in (1):
Xi − xi min
four subjects who perform morning activities in daily life Xi = (i = 1, 2, · · ·, n) (1)
xi max − xi min
scenes. Different modalities of sensors had been integrated
into the environment, objects and on the body. In terms of the where n denotes the number of channels, and xi max , xi min
sensor setting, the OPPORTUNITY challenge guidelines [33] are the maximum and minimum values of the i − thchannel,
were adopted. We only considered the sensors on the body, respectively.
including 5 inertial measurement units on the sports jacket,
2 InertiaCube3 sensors on the feet and 12 Bluetooth 3-axis 3) SEGMENTATION
acceleration sensors. As shown in Fig. 2, the yellow oval In this paper, an end-to-end human activity recognition model
blocks denote 3-axis accelerometers and red round blocks was implemented. The input to the model consists of a data
represent inertial measurement units, where ‘‘RSHOE’’ and sequence. The sequence is short time series extracted from
‘‘LSHOE’’ are two InertiaCube3 sensors. During the record- the raw sensor data. In the process of data collection, the data
ing, five activities of daily living (ADL) sessions and one were recorded continuously. In order to preserve the temporal
drill session were conducted for each subject. Each sensor relationship between the data points in an activity, a sliding
axis is considered as a separate channel, resulting in an input window with an overlap rate of 50% was used to segment
space of 113 channels in size. Specifically, these sensors the data collected by motion sensors. For the WISDM and
have a sampling rate of 30 Hz. In this paper, we focused UCI-HAR dataset, the length of the sliding window is 128.
only on the recognition of sporadic gestures. Thus, this is an For the OPPORTUNITY dataset, the recordings of each activ-
18-class (including the Null class) segmentation and classifi- ity only last for a short period of time, and a short sliding win-
cation problem. The gestures included in this dataset were dow is needed to segment the data to obtain more samples.

56858 VOLUME 8, 2020

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

the model is obtained from an Output layer (a dense layer

with a Softmax classifier), yielding a probability distribution
over classes.

A. LSTM LAYERS
RNN could take advantage of the chronological relationship
FIGURE 3. Segmentation of sensor data.
between sensor readings. Although RNN has the ability to
capture temporal information from sequential data, it has the
problem of gradient vanishing, which hinders the ability of
the network to model between raw sensor data and human
activities in a long context window. LSTM is a variety of
RNN, which could eliminate this limitation. LSTM has great
advantages in feature extraction of sequence data than con-
volutional neural networks due to its special memory cells.
In this paper, the input data first passes through two layers of
LSTMs to better extract the temporal features in the sequence
data. Each layer of LSTMs has 32 memory cells. The inputs
are sent to different gates, including input gates, forgetting
gates and output gates, to control the behavior of each mem-
ory cell. The activation of each LSTM unit is calculated by
the following formula:
ht = σ (wi,h · xt + wh,h · ht−1 + b) (2)
where ht and ht−1 represent the activation at time t and t − 1,
respectively, σ is a non-linear activation function, wi,h is the
input-hidden weight matrix, and wh,h is the hidden-hidden
weight matrix, and b is the hidden bias vector.
The output of the LSTM layer has three dimensions (sam-
ples, time steps, input dimension), while the size of the input
sample of CNN needs four. In order to adapt to the input shape
of the convolutional layer, the output of the second layer of
LSTM is dimensionally expanded, which could be presented
as (samples, 1, time steps, input dimension).
FIGURE 4. Frame diagram of the LSTM-CNN model.
B. CONVOLUTIONAL AND POOLING LAYERS
CNN has gained increasing popularity because of its ability
to learn unique representations from images or speech [34].
In this paper, the length of the sliding window for the OPPOR-
And the convolutional layer is the most important unit in
TUNITY dataset is chosen to be 24. It is worth noting that our
CNN, which uses convolution kernels to convolve the inputs.
choice on the optimal window size was made in an adaptive
It works as a filter and is then activated by a non-linear
and empirical manner [29] to generate good segments for
activation function, as follows:
all the activities considered. Fig. 3 shows the details of the
M X
N
segmentation. The horizontal data represents the sampling X
points and the vertical data represents the sensor channels. ai,j = f ( wm,n · xi+m,j+n + b) (3)
m=1 n=1

IV. PROPOSED ARCHITECTURE where ai,j is the corresponding activation, wm,n denotes the
The network structure of the LSTM-CNN model is as shown m × n weight matrix of convolution kernel, xi+m,j+n indicates
in Fig. 4. It consists of eight layers. Firstly, the preprocessed the activation of the upper neurons connected to the neuron
data is fed into a two-layer LSTM with 64 neurons in total. (i, j), b is the bias value, and f is a non-linear function.
It is used for the extraction of temporal features. Following In this paper, the convolutional layers employ rectified
LSTMs are two other convolutional layers, and it is used linear units (ReLU) to calculate the feature maps, and its
for extracting spatial features. The first convolution layer has non-linear function is defined as:
64 filters while the other has 128. And between the two con-
σ (x) = max(0, x) (4)
volutional layers is the max-pooling layer. At the end of the
model, there is a global average pooling layer (GAP) followed Generally speaking, the more convolution kernels are used,
with a batch normalization layer (BN). Finally, the output of the more hidden features could be mined in the input samples.

VOLUME 8, 2020 56859

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

There are two convolutional layers in the LSTM-CNN model. TABLE 5. Instances of three public datasets.
In the first convolutional layer, 64 convolution kernels are
used for feature extraction and the size of each convolution
kernel is 1 × 5. The sliding step of the convolution window
is 2. In the second, 128 convolution kernels are used to
perform a deeper feature extraction operation on the features
output from the upper layer. Each convolution kernel has a
size of 1 × 3 and the convolution window in this layer has a
sliding step size of 1. There is a max-pooling layer between end of the model. Each node of the fully-connected layer is
the two convolutional layers for performing the downsam- connected to the nodes of the upper layer so that the features
pling operation. It serves two purposes. One is to reduce the extracted from the upper layer could be merged. It makes up
parameters while maintaining dominant features, the other is for the shortcomings of the GAP layer in this regard.
to filter the interference noise caused by the unconscious jitter Behind the fully-connected layer is the Softmax classifier
of the human body. which converts the output of the upper layer into a probability
vector whose value represents the probability of classes to
C. GLOBAL AVERAGE POOLING LAYER which the current sample belongs. The expression formula is
Different from classical CNN, the model mentioned in as follows:
this paper used a global average pooling layer (GAP) to eaj
replace the fully-connected layer behind the convolutional Sj = PN (5)
ak
k=1 e
layer. At the end of CNN, there would usually be one or
more fully-connected layers, which could convert multi-D where N is the number of classes, a is the output vector of the
feature maps into a 1D feature vector. Each node of the fully-connected layer, and aj is the j − th value of the output
fully-connected layer is connected with the nodes of the vector.
upper layer, thus the weight parameters of the fully-connected
layer may occupy the most. For instance, in the model V. EXPERIMENTAL RESULTS
Krizhevsky [35], the first fully-connected layer FC1 has In this paper, three widely used public datasets were used
4096 nodes, and the output of the upper pooling layer Max- to evaluate the generalization ability and the accuracy of the
Pool3 has 9216 nodes. Thus, there would be more than LSTM-CNN model. They were all recorded continuously and
37 million weight parameters between the MaxPool3 layer a common method is to use a fixed-length sliding window
and the FC1 layer, which would consume a lot of memory to segment the sensor data. Here, the length of the window
and computational cost. Unlike the fully-connected layer, is 128, with a step size of 64. But for the OPPORTUNITY
the GAP layer performs a global averaging pooling operation dataset, the length of the window is 24. To be specific, a sub-
on each feature map. There is no parameter to optimize in set of the dataset was used to construct the test set that is
the GAP layer. Thus, it achieves the goal of reducing global separated entirely from the training set to better evaluate the
model parameters. Furthermore, GAP sums out the spatial model performance. For the UCI-HAR dataset, the database
information, so it is more robust to the spatial transformation was built from the recordings of 30 subjects who performed
of the input. 6 activities. Among them, the recordings of 22 subjects were
used to build the training set, and the rest was used to build
D. BATCH NORMALIZATION LAYER the test set. The WISDM dataset consists of 6 activities
During the training process, the distribution of input data performed by 36 subjects. The training set is composed of the
of each layer would continuously change due to the weight recordings of 30 subjects and the remaining recordings of 6
parameters of the upper layer are constantly updated. There- subjects are used to build the test set. The two parts are com-
fore, it is necessary to change the weight parameters to adapt pletely separate. For the OPPORTUNITY dataset, the same
to this new distribution, which leads to difficulty in network subset employed in the OPPORTUNITY challenge [33] was
training and slows down the convergence speed. To address used to train and test our models. The training set includes the
this problem, a batch normalization layer (BN) is added after full recordings of Subject 1, as well as for three ADLs and
the GAP layer to accelerate the convergence of the model. drill sessions of subjects 2 and 3. And the test set composes
The BN layer normalizes and reconstructs the input data on of ADL4 and ADL5 for Subjects 2 and 3. Table 5 details
each batch of training samples to ensure the stability of the the number of instances of the test set and the training set
output of the previous layer, so as to improve the training obtained on each dataset after segmentation.
speed and accuracy.
A. MODEL IMPLEMENTATION
E. OUTPUT LAYER Keras was used to build the proposed network structure,
In the LSTM-CNN model, the output layer consists of a which is a high-level neural networks API written in
fully-connected layer and a Softmax classifier. There is an Python capable of running on top of TensorFlow, CNTK,
important benefit to adding the fully-connected layer at the or Theano. In the experiments, TensorFlow was used as

56860 VOLUME 8, 2020

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

TABLE 6. List of selected hyper-parameters. TP

as TP+FN , where TP, FP are the number of true and false
positives, respectively, and FN corresponds to the number
of false negatives. F1 score offsets imbalances in classes by
weighting classes based on their proportion of samples. The
formula of the F1 score is as follows:
X precisioni · recalli
F1 = 2 ∗ wi (6)
precisioni + recalli
i

where wi = ni /N is the proportion of samples of class i, with

ni being the number of samples of the i−th class and N being
the total number of samples.

C. EVALUATION ON THREE PUBLIC DATASETS

In order to comprehensively verify the performance of the
proposed model, three public datasets were used for test-
ing. Table 7, 8, and 9 show the classification confusion
matrices obtained when the model was predicted with the
test set of the UCI-HAR, WISDM, and OPPORTUNITY
the backend. The model training and classification were on datasets, respectively. For the UCI-HAR dataset, there were
a PC that has an E5-2620 Xeon CPU with 2.10 GHz, 64GB 2940 instances that have been correctly classified, and the
RAM and an NVIDIA QUADRO P5000 graphics card with overall accuracy reached 95.80%. There was relatively poor
16 GB memory. And the PC is equipped with an Ubuntu discrimination between sitting and standing. The recall and
operating system with 64 bits. precision were in the range of 92%∼93%. The main rea-
The model was trained in a fully-supervised manner, son may be that the two activities are similar from the per-
and the gradient was back-propagated from the Softmax layer spective of motion sensors. It is difficult to mine deeper
to the LSTM layer. The weights and biases of each layer were information only by acceleration and angular velocity data.
initialized by randomly selected values. Cross entropy is used When the trained model was exposed to the test set that
to evaluate the difference between the real distribution and contains approximately 3036 new instances, the overall accu-
the probability distribution. In this paper, the cross-entropy racy of the WISDM dataset (an unbalanced dataset) reached
loss function was used to measure the error between the 95.75%. The OPPORTUNITY dataset is just as unbalanced
prediction and the true values. Adam [36] is a stochastic as the WISDM and it contains 17 activities in the gesture
optimization algorithm based on the first-order gradient, here recognition case. Finally, an overall accuracy of 92.63% was
it was selected as the optimizer. For the sake of efficiency, achieved. In addition, when the Null class is removed from
in the training stage, the batch size was set to 192 and the the classification task (see Table 10), our method achieved
number of epochs was 200. Furthermore, a small learning an overall accuracy of 87.58% in the gesture recognition
rate of 0.001 was used to enhance the fitting ability, and the task.
order of the training set was randomly shuffled to improve the In order to further verify the performance of the model,
robustness of the model. The selected hyper-parameters were LSTM-CNN was compared with CNN of Yang et al. [38],
listed in Table 6. and DeepConvLSTM [25] under the same experiment sce-
nario. All the results were verified by the F1 score to ensure
B. PERFORMANCE MEASURE the fairness and consistency of the following comparison
When collecting human activity data in natural environments, results. Fig. 5 shows the evaluation results of the deep mod-
imbalances often occur [37]. The WISDM and OPPORTU- els mentioned above. Compared with the CNN model of
NITY mentioned above are both imbalanced datasets. If the Yang et al., LSTM-CNN has a significant increase of about
classifier predicts each instance as a majority class and uses 7% for the OPPORTUNITY dataset and is superior to the
the overall classification accuracy to evaluate the model per- DeepConvLSTM model. It can also be seen that LSTM-CNN
formance, the results could achieve high accuracy. Therefore outperforms the other two models on the UCI-HAR and
the overall classification accuracy is not an appropriate mea- WISDM datasets, with the best-reported result increasing by
sure of performance. F-measure (F1 score) takes both false an average of 3%. It should be noted that the model parame-
positives and false negatives into account and it combines two ters have been greatly reduced under adding the GAP layer to
measures defined based on the total number of correctly rec- the network. These results confirm our findings that support-
ognized samples, which is known in the information retrieval ing the use of the GAP layer instead of a fully-connected layer
community as ‘‘precision’’ and ‘‘recall’’. Thus, the F1 score brings significant advantages in HAR tasks. It also proves that
is usually a more useful performance indicator than accu- the proposed method has superior performance on different
TP
racy. Precision corresponds to TP+FP , and recall is defined public datasets.

VOLUME 8, 2020 56861

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

TABLE 7. Classification confusion matrix on the UCI-HAR.

TABLE 8. Classification confusion matrix on the WISDM.

TABLE 9. Classification confusion matrix on the OPPORTUNITY.

D. IMPACT OF NETWORK STRUCTURE ON speed in the forward phase was given. The experiments were
MODEL PERFORMANCE implemented based on the UCI-HAR dataset.
In this section, we explored the impact of several network The structure of model A belongs to the classical convo-
structures on model performance. As shown in Table 11, five lutional neural network structure, in which the number of
kinds of model architectures (A, B, C, D, and LSTM-CNN) nodes in the fully-connected layer is 128. In the classical
were constructed respectively for experimental comparison, CNN structure, the last convolutional layer is usually fol-
and the classification results were evaluated by the num- lowed by a fully-connected layer to synthesize the features
ber of model parameters and the F1 score on the test set. extracted from previous layers. Although this could improve
Furthermore, in terms of training iterations, the computation the accuracy of the model, it also brings a huge number of

56862 VOLUME 8, 2020

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

TABLE 10. Classification confusion matrix on the OPPORTUNITY (without the null class).

TABLE 11. Experiments on different network architectures.

the GAP layer is used to replace the fully-connected layer

behind the convolutional layer to perform a global averag-
ing pooling operation on each feature map output from the
upper layer, which structurally regularizes the entire network
to reduce the over-fitting problem. The parameters of the
model B are only 27462, which is about 94% less than that of
the model A while the performance remains almost the same.
It proves the feasibility of replacing the fully-connected layer
with the GAP layer. Accordingly, the computation speed has
also been improved, with an average of 1202 milliseconds
per epoch. However, the use of the GAP layer would focus
the training pressure of the model on the convolutional lay-
ers, which would cause the model to converge slowly. The
model C adds a BN layer after the GAP layer to stabilize
FIGURE 5. Performance of three models on three public datasets.
the output of the upper layer. It speeds up the convergence
of the model and improves accuracy. Finally, the F1 score of
parameters. As can be seen that the F1 score of model A this model reaches 93.35%. The recordings of activities based
reaches 91.88% when the trained model is exposed to the on mobile sensors are temporal sequences and LSTM has the
test set of the UCI-HAR. Accordingly, there are more than ability to capture temporal information from sequential data.
502 thousand model parameters, of which the parameters In model D, the data captured from mobile sensors are firstly
of fully-connected layers occupy the most. And it takes fed into two layers of LSTMs and then transmitted to convo-
1681 milliseconds per epoch in the training stage. In model B, lutional layers for feature extraction. Finally, it outperforms

VOLUME 8, 2020 56863

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

FIGURE 6. Impact of optimizer on model performance.

the model C by 1% on average. In our work, we added another

layer of LSTM, in a total of 2 layers, on the basis of model D FIGURE 7. Impact of increasing number of filters of the second
convolutional layer on model performance.
to further improve the model performance. Eventually, the F1
score on the test set reached the expected 95.78%. It could
be seen that the computation speed of the model D and
LSTM-CNN is greatly reduced. They are 5156 milliseconds
per epoch and 9416 milliseconds per epoch, respectively.
This is due to the LSTM layers added to the model. It is
because of its special network structure that LSTM could
extract temporal information effectively. However, every coin
has two sides. When training LSTM layers, the calculation
of each time step depends on the output of the previous time
step. As a result, it could not compute in parallel, which slows
down the computation speed of the model.
To sums up, the strategy of using the global average
pooling layer and batch normalization layer to replace the
fully-connected layer is effective. Moreover, the method of
using LSTM to extract the temporal information to improve
model performance is favorable. The model proposed in this FIGURE 8. Impact of batch size on model performance.

paper not only could achieve high recognition accuracy but

also greatly simplify the model structure.
fluctuation of gradient descent curve is the most stable. Here,
E. IMPACT OF HYPER-PARAMETERS ON Adam was used as the optimizer when training the model
MODEL PERFORMANCE LSTM-CNN.
Hyper-parameters have a great impact on the classification
model performance. This section presents the impact of the 2) EFFECT OF NUMBER OF FILTERS
important hyper-parameters such as the number of convo- The more filters (namely convolution kernels), the more com-
lution filters, the batch size and the type of optimizer on plex and deeper features the model may learn. But it also
model performance. The experiments were implemented on increases the model parameters, which may lead to overfit-
the UCI-HAR dataset and the model performance was evalu- ting. Thus, how to choose the number of filters is of critical
ated by varying a number of model parameters. F1 score was importance. Fig. 7 shows the accuracy and parameters of
used as the measurement criteria. the model LSTM-CNN with a varying number of filters of
the second convolutional layer. With an increasing number
1) EFFECT OF OPTIMIZER of filters, the network parameters increase from 36614 to
Optimizer is used to update and calculate network parameters 62598. Absolutely, the accuracy of the model does increase
that affect model training and the output, so as to approxi- correspondingly. F1 score reaches 96.38% when the number
mate or reach the optimal value, thereby minimizing the loss of filters is selected as 192, which outperforms when the
function. It is the essence of neural network training. Thus number of filters is 64 by 4%. However, the model parameters
it is important to choose a suitable optimizer to train deep increase by more than 70%.
models. Several common optimizers such as SGD, Adagrad,
Adadelta, Adam, and RMSprop were experimentally ver- 3) EFFECT OF BATCH SIZE
ified, as shown in Fig. 6. It can be seen that the model Mini-batch processing is a common method in deep learning
trained by Adam optimizer has the best fitting effect and the when training neural networks. Optimizing the cumulative

56864 VOLUME 8, 2020

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

error over the entire training set would make the gradient [5] Y. Kim and B. Toomajian, ‘‘Hand gesture recognition using micro-Doppler
descent slowly, also may lead the model into local optimum. signatures with convolutional neural network,’’ IEEE Access, vol. 4,
pp. 7125–7130, 2016.
If the error of only one sample is optimized in one iteration, [6] M. Cornacchia, K. Ozcan, Y. Zheng, and S. Velipasalar, ‘‘A survey on
the gradient descent could fluctuate drastically, which would activity detection and classification using wearable sensors,’’ IEEE Sensors
eventually lead to difficulty in training. Fig. 8 presents the J., vol. 17, no. 2, pp. 386–403, Jan. 2017.
[7] K. Yatani and K. N. Truong, ‘‘BodyScope: A wearable acoustic sensor for
accuracy varying with 5 different batch sizes. It can be seen activity recognition,’’ in Proc. ACM Conf. Ubiquitous Comput. (UbiComp),
that the accuracy reaches the highest when the batch size is 2012, pp. 341–350.
selected as 192. [8] B. Cagliyan, C. Karabacak, and S. Z. Gurbuz, ‘‘Human activity recognition
using a low cost, COTS radar network,’’ in Proc. IEEE Radar Conf.,
May 2014, pp. 1223–1228.
VI. CONCLUSION [9] X. Yang and Y. Tian, ‘‘Super normal vector for human activity recognition
with depth cameras,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
A novel deep neural network that combines convolutional no. 5, pp. 1028–1039, May 2017.
layers with LSTM for human activity recognition was pro- [10] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, ‘‘Skeleton-based
posed in this paper. The weight parameters of CNN mainly action recognition using spatio-temporal LSTM network with trust gates,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 3007–3021,
concentrate on the fully-connected layer. In response to Dec. 2018.
this characteristic, a GAP layer is used to replace the [11] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, ‘‘Fast unsupervised ego-
fully-connected layer behind the convolutional layer, which action learning for first-person sports videos,’’ in Proc. CVPR, Jun. 2011,
pp. 3241–3248.
greatly reduces the model parameters while maintaining a [12] M. R. Amer and S. Todorovic, ‘‘Sum product networks for activity recogni-
high recognition rate. Moreover, a BN layer is added after the tion,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 800–813,
GAP layer to speed up the convergence of the model and obvi- Apr. 2016.
[13] W. Lin, S. Xing, J. Nan, L. Wenyuan, and L. Binbin, ‘‘Concurrent recog-
ous effect was obtained. In the proposed architecture, the raw nition of cross-scale activities via sensorless sensing,’’ IEEE Sensors J.,
data collected by mobile sensors is fed into a two-layer vol. 19, no. 2, pp. 658–669, Jan. 2019.
LSTMs followed by convolutional layers, which makes it [14] I. H. Lopez-Nava and A. Munoz-Melendez, ‘‘Wearable inertial sensors
for human motion analysis: A review,’’ IEEE Sensors J., vol. 16, no. 22,
capable of learning the temporal dynamics on various time pp. 7821–7834, Nov. 2016.
scales according to the learned parameters of LSTMs so as [15] L. Chen, J. Hoey, C. D. Nugent, D. J. Cook, and Z. Yu, ‘‘Sensor-based
to obtain better accuracy. In order to prove the generalization activity recognition,’’ IEEE Trans. Syst., Man, Cybern. C, Appl. Rev.,
vol. 42, no. 6, pp. 790–808, Nov. 2012.
ability and effectiveness of the proposed model, the three pub- [16] J. Margarito, R. Helaoui, and A. M. Bianchi, ‘‘User-independent recog-
lic datasets, UC-HAR, WISDM, and OPPORTUNITY, were nition of sports activities from a single wrist-worn accelerometer:
used for the experiment. Considering that the accuracy is not A template-matching-based approach,’’ IEEE Trans. Biomed. Eng., vol. 63,
no. 4, pp. 788–796, Apr. 2016.
an appropriate and comprehensive measure of performance, [17] C. Zhu and W. Sheng, ‘‘Wearable sensor-based hand gesture and
the F1 score was used to evaluate the model performance. daily activity recognition for robot-assisted living,’’ IEEE Trans.
Eventually, the F1 score reached 95.78%, 95.85% and 92.63% Syst., Man, Cybern. A, Syst. Humans, vol. 41, no. 3, pp. 569–573,
May 2011.
on the UCI-HAR, WISDM and OPPORTUNITY datasets, [18] L. Chen, C. D. Nugent, and H. Wang, ‘‘A knowledge-driven approach
respectively. Furthermore, we also explored the impact of to activity recognition in smart homes,’’ IEEE Trans. Knowl. Data Eng.,
some hyper-parameters on model performance such as the vol. 24, no. 6, pp. 961–974, Jun. 2012.
[19] A. Jain and V. Kanhangad, ‘‘Human activity classification in smartphones
number of filters, the type of optimizers and batch size. using accelerometer and gyroscope sensors,’’ IEEE Sensors J., vol. 18,
Finally, the optimal hyper-parameters for the final design no. 3, pp. 1169–1177, Feb. 2018.
were selected to train the model. To sum up, compared with [20] N. Jalloul, F. Poree, G. Viardot, P. L’Hostis, and G. Carrault,
‘‘Activity recognition using complex network analysis,’’ IEEE
the methods proposed in other literatures, the LSTM-CNN J. Biomed. Health Informat., vol. 22, no. 4, pp. 989–1000,
model shows consistent superior performance and has good Jul. 2018.
generalization. It can not only avoid complex feature extrac- [21] P. Gupta and T. Dallas, ‘‘Feature selection and activity recognition system
using a single triaxial accelerometer,’’ IEEE Trans. Biomed. Eng., vol. 61,
tion but also has high recognition accuracy under the premise no. 6, pp. 1780–1786, Jun. 2014.
of a few model parameters. [22] E. Fullerton, B. Heller, and M. Munoz-Organero, ‘‘Recognizing
human activity in free-living using multiple body-worn
accelerometers,’’ IEEE Sensors J., vol. 17, no. 16, pp. 5290–5297,
REFERENCES Aug. 2017.
[23] Y. Bengio, ‘‘Deep learning of representations: Looking forward,’’ in Proc.
[1] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, ‘‘Deep learning for sensor-
Int. Conf. Stat. Lang. Speech Process. Berlin, Germany: Springer, 2013,
based activity recognition: A survey,’’ Pattern Recognit. Lett., vol. 119,
pp. 1–37.
pp. 3–11, Mar. 2019.
[24] Y. Zheng, Q. Liu, and E. Chen, ‘‘Time series classification using
[2] P. Vepakomma, D. De, S. K. Das, and S. Bhansali, ‘‘A-wristocracy: Deep multi-channels deep convolutional neural networks,’’ in Proc. Int.
learning on wrist-worn sensing for recognition of user complex activities,’’ Conf. Web-Age Inf. Manage. Cham, Switzerland: Springer, 2014,
in Proc. IEEE 12th Int. Conf. Wearable Implant. Body Sensor Netw. (BSN), pp. 298–310.
Jun. 2015, pp. 1–6. [25] F. Ordóñez and D. Roggen, ‘‘Deep convolutional and LSTM recurrent
[3] J. Qin, L. Liu, Z. Zhang, Y. Wang, and L. Shao, ‘‘Compressive sequen- neural networks for multimodal wearable activity recognition,’’ Sensors,
tial learning for action similarity labeling,’’ IEEE Trans. Image Process., vol. 16, no. 1, p. 115, 2016.
vol. 25, no. 2, pp. 756–769, Feb. 2016. [26] Y. Lin, J. Le Kernec, S. Yang, F. Fioranelli, O. Romain, and Z. Zhao,
[4] N. Y. Hammerla, S. Halloran, and T. Ploetz, ‘‘Deep, convolutional, ‘‘Human activity classification with radar: Optimization and noise
and recurrent models for human activity recognition using wearables,’’ robustness with iterative convolutional neural networks followed with
2016, arXiv:1604.08880. [Online]. Available: http://arxiv.org/abs/1604. random forests,’’ IEEE Sensors J., vol. 18, no. 23, pp. 9669–9681,
08880 Dec. 2018.

VOLUME 8, 2020 56865

K. Xia et al.: LSTM-CNN Architecture for Human Activity Recognition

[27] M.-O. Mario, ‘‘Human activity recognition based on single sensor square KUN XIA was born in China, in 1980. He received
HV acceleration images and convolutional neural networks,’’ IEEE Sen- the B.Eng. degree in industrial automation and
sors J., vol. 19, no. 4, pp. 1487–1498, Feb. 2019. the Ph.D. degree in power electronics and power
[28] P. Agarwal and M. Alam, ‘‘A lightweight deep learning model for human drives from the Hefei University of Technology
activity recognition on edge devices,’’ 2019, arXiv:1909.12917. [Online]. (HFUT), Hefei, China, in 2002 and 2007, respec-
Available: https://arxiv.org/abs/1909.12917 tively. He was a Visiting Scholar with the Electrical
[29] C. Xu, D. Chai, J. He, X. Zhang, and S. Duan, ‘‘InnoHAR: A deep neural and Computer Engineering Department, National
network for complex human activity recognition,’’ IEEE Access, vol. 7,
University of Singapore, Singapore, in 2015. From
pp. 9893–9902, 2019.
2007 to 2011, he was a Lecturer with the Uni-
[30] J.-L. Reyes-Ortiz, L. Oneto, A. Samà, X. Parra, and D. Anguita,
‘‘Transition-aware human activity recognition using smartphones,’’ Neu- versity of Shanghai for Science and Technology
rocomputing, vol. 171, pp. 754–767, Jan. 2016. (USST), Shanghai, China, where he has been an Associate Professor and
[31] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, ‘‘Activity recognition using the Department Head of the Electrical Engineering Department, since 2011.
cell phone accelerometers,’’ ACM SIGKDD Explor. Newslett., vol. 12, His current research interests include motor control and deep learning.
no. 2, pp. 74–82, Mar. 2011.
[32] D. Roggen, A. Calatroni, M. Rossi, T. Holleczek, K. Forster, G. Troster,
P. Lukowicz, D. Bannach, G. Pirkl, A. Ferscha, J. Doppler, C. Holzmann,
M. Kurz, G. Holl, R. Chavarriaga, H. Sagha, H. Bayati, M. Creatura, JIANGUANG HUANG was born in China,
and J. D. R. Millan, ‘‘Collecting complex activity datasets in highly rich in 1996. He received the B.Eng. degree from the
networked sensor environments,’’ in Proc. 7th Int. Conf. Networked Sens. Department of Electrical Engineering, University
Syst. (INSS), Jun. 2010, pp. 233–240. of Shanghai for Science and Technology (USST),
[33] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster, Shanghai, China, in 2018, where he is currently
J. D. R. Millán, and D. Roggen, ‘‘The opportunity challenge: A benchmark pursuing the M.Eng. degree. His current research
database for on-body sensor-based activity recognition,’’ Pattern Recognit. interests include motor control and deep learning.
Lett., vol. 34, no. 15, pp. 2033–2042, Nov. 2013.
[34] C. A. Ronaoo and S. B. Cho, ‘‘Evaluation of deep convolutional neural
network architectures for human activity recognition with smartphone
sensors,’’ in Proc. KIISE Korea Comput. Congr., 2015, pp. 858–860.
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
HANYU WANG was born in China, in 1995.
cess. Syst., 2012, pp. 1097–1105.
[36] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic opti- She received the B.Eng. degree from the Depart-
mization,’’ 2014, arXiv:1412.6980. [Online]. Available: http://arxiv. ment of Electrical Engineering, University of
org/abs/1412.6980 Shanghai for Science and Technology (USST),
[37] C. A. Ronao and S.-B. Cho, ‘‘Human activity recognition with smartphone Shanghai, China, in 2018, where she is currently
sensors using deep learning neural networks,’’ Expert Syst. Appl., vol. 59, pursuing the M.Eng. degree with the Electri-
pp. 235–244, Oct. 2016. cal Engineering Department. Her current research
[38] J. Yang, M. N. Nguyen, X. L. Li, and P. P. San, ‘‘Deep convolutional neural interests include motor control and deep learning.
networks on multichannel time series for human activity recognition,’’ in
Proc. 24th Int. Joint Conf. Artif. Intell., Jun. 2015.