Human Activity Recognition Using CNN & LSTM: A. WISDM Dataset
Human Activity Recognition Using CNN & LSTM: A. WISDM Dataset
Abstract — In identifying objects, understanding the world, fitness to gaming, security fields, the healthcare industry and
analyzing time series and predicting future sequences, the recent even more.
developments in Artificial Intelligence (AI) have made human
beings more inclined towards novel research goals. There is a CNN and RNN architectures have become more
growing interest in Recurrent Neural Networks (RNN) by AI predominant with the recent emerging trend of Deep
researchers today, which includes major applications in the Learning, and the application of Deep Learning models to
fields of speech recognition, language modeling, video train time series of inertial sensor data is still under
processing and time series analysis. Recognition of Human investigation by researchers[2],[3]. Deep learning models
Behavior or the Human Activity Recognition (HAR) is one of such as CNN and RNN concentrate on a data-driven approach
the difficult issues in this wonderful AI field that seeks answers.
to sequential information to learn discriminatory
As an assistive technology combined with innovations such as
characteristics from raw sensor data. Human activities
the Internet of Things (IoT), it can be primarily used for
eldercare and childcare. HAR also covers a broad variety of normally are measured with sensors either be external or
real-life applications, ranging from healthcare to personal wearable such as accelerometers and gyroscopes.
fitness, gaming, military applications, security fields, etc. HAR Accelerometer data measures people's speed of doing things
can be achieved with sensors, images, smartphones or videos and gyroscope data measures the angular velocity of the
where the advancement of Human Computer Interaction (HCI) actions. Then, since these sensors provide a large dataset
technology has become more popular for capturing behaviors development, it will be an important task to process and
using sensors such as accelerometers and gyroscopes. This paper analyze the entire dataset of correct automated systems. In this
introduces an approach that uses CNN and Long Short-Term context, to avoid the data analysis problems associated with
Memory (LSTM) to predict human behaviors on the basis of the the system, HAR systems will have an important task. A
WISDM dataset.
feature vector will be extracted from this large raw data
collected, and an activity recognition model based on the
Keywords—Human Activity Recognition, Convolutional
Neural Networks (CNN), Long Short-Term Memory (LSTM) feature vector will be generated at the end of the learning
algorithms[1]. Therefore, it is essential to select a well-trained,
efficient model to grasp the maximum accuracy of the
recognition process.
I. INTRODUCTION
Human Activity Recognition is the process of defining, The rest of this paper is organized as follows. Section 2
assessing and understanding what sort of acts and objectives gives an overview of the dataset used, LSTM architecture,
one or more agents or individuals will perform. Decisions CNN LSTM architecture and the HAR paradigm. Section 3
would be made on the basis of their past behavioral acts. In his gives our implications and methodology used to implement
or her day-to-day routine, a typical human may perform major the system. Experiment results are shown under the section 4.
activities such as walking, running, sitting, standing, laying, Finally, paper concludes giving the conclusion cited and
walking-upstairs, walking-downstairs, etc. If HAR could be expecting future works associated with the learnings and
combined with IoT technologies and means, defining and results of the overall research.
evaluating different human behaviors will bring out some
II. LITERATURE REVIEW
smart solutions relevant to childcare and eldercare areas.[1].
For example, assuming a situation in which a child is held in A. WISDM Dataset
a day care center, parents have gone to work and need to verify
The dataset used in the experiment is the standard WISDM
what their child is actually doing right now or is healthy at this
dataset, which is also known as Smartphone and smartwatch
time, this HAR may be used as a measure to predict the actions
activity, and Biometrics dataset. It contains accelerometer and
of the child. Even, in the case of elderly people watching
gyroscope time series sensor data collected from a smartphone
guardians or caretakers, by avoiding certain acts elderly
and smartwatch as 51 test subjects performing 18 activities for
people prefer to do, this technology may be used to create a
3 minutes each with a 20Hz sampling rate. 36 users have been
safer atmosphere for them. Thus, HAR could have enticing
solutions for real life human problems , ranging from personal participated in the experiment. This is available and can be
978-1-6654-1475-3
978-0-73 81 -4403-0/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: Cornell University Library. Downloaded on May 23,2021 at 17:42:52 UTC from IEEE Xplore. Restrictions apply.
downloaded from the UCI machine-learning repository. The sequences of input data, such as each axis of the accelerometer
size of the dataset is 1,098,207 which contains data relate to 6 and gyroscope data. The model learns to extract features from
attributes as walking, jogging, upstairs, downstairs, sitting and sequences of observations and how to map the internal
standing. The columns are as user, activity, timestamp, x- features to different activity types.
acceleration, y-acceleration and z-acceleration. Originally this
is an unbalanced dataset where walking contains 38.6% of ht-1 ht h
data, Jogging contains 31.2% of data, upstairs include 11.2%
data and Downstairs, Sitting and Standing contains 9.1%,
5.5% 4.4% of the data respectively. A tanh A
B. CNN Architecture
The basic structure and the functionality of the visual
cortex of the human brain have inspired CNN architecture. x t-1 x X
Authorized licensed use limited to: Cornell University Library. Downloaded on May 23,2021 at 17:42:52 UTC from IEEE Xplore. Restrictions apply.
Naive Bayes and Bayesian networks have shown an accuracy A balanced dataset would always be better in a perfect
of 98% and 90.57% respectively. K-Nearest neighbor have classification and prediction process because we can assure
had an average accuracy of 99.25% and 90.61%. Lastly SVM each of the representing classes would hold the same
show an accuracy of 97.5% [1]. probability to occur without any biasness. Thus, WISDM
dataset was balanced by selecting the same amount of data
HAR systems also faces many challenges, such as large rows for each of the 6 activities which is graphically
variability of a given action, similarity between classes, time represented as the next pie chart (Fig. 5).
consumption, and the high proportion of null values [9]. All
of these challenges have led researchers to develop
representation methods of systematic features and efficient
recognition methods to effectively solve these problems. In
[10] researchers proposed deep convolutional network with
utilization of CNN and LSTM. This paper took advantage of
LSTM to solve sequential human activity recognition problem
and achieved a good precision. But the complex network
framework suffered from low efficiency and can hardly meet
real-time requirements in practice applications.
III. METHODOLOGY
Authorized licensed use limited to: Cornell University Library. Downloaded on May 23,2021 at 17:42:52 UTC from IEEE Xplore. Restrictions apply.
and downstairs have multiple variations with the signals while B. CNN Architecture
stationary activities, like sitting and standing, compasses only The CNN model was defined as having two CNN hidden
quite small amount of variations in their accelerometer layers. Each of them are followed by two dropout layers of 0.5
signals, as those shown in Fig. 6. in order to reduce overfitting of the model to the training data.
The activity column which is a categorical variable in the Then a dense fully connected layer is used to interpret the
dataset was then converted in to the numerical format. For this features extracted by the CNN hidden layers. Finally, a dense
purpose, the LabelEncoder function from the Sklearn library layer with the softmax activation function was added as the
was used for preprocessing. In the process of feature scaling, final layer to make predictions (Table I).
all the features were scaled to be within the same range, which The sparse categorical cross entropy loss function will be
would guarantee the value manipulations of every features used as the loss function and the efficient adam version of
equivalent and reweight naturally the prediction model by real stochastic gradient descent was used to optimize the network
dependency of the corresponding relevance of the features. with a learning rate of 0.001. CNN model was trained for 50
Here, the Sklearn's StandardScaler function, which scale each epochs and a batch size of 64 samples were used. After the
feature by its maximum absolute value, was used for the model is fit, it was evaluated on the test dataset and the
scaling. accuracy of the CNN model was obtained.
C. LSTM Architecture
The LSTM model was defined as having a single LSTM
hidden layer. A dropout layer valuing 0.5 follows this. Then a
dense fully connected layer is used to interpret the features
(c) Upstairs (d) Downstairs extracted by the single LSTM hidden layer. Finally, a dense
layer was added as the final layer to make predictions (Table
II).
For the purpose of compiling and training the LSTM
model, the same values for the loss function, optimizer, batch
size and the number of epochs, which we used, in compiling
and training the CNN model were used. After the model is fit,
it was evaluated on the test dataset and the accuracy was
obtained.
(e) Sitting (f) Standing TABLE TT T A B L E TT. THE DIMENSIONAL SRUCTURE OF THE ADOPTED
L S T M MODEL.
Fig. 6. Variation nature of the signals for the six activities.
Output Shape
Finally, the data will need to be prepared in a format LSTM None, 100 41600
required by the designate models. For this purpose, fixed sized
Dropout None 100 0
frame segments were created from the raw signals. The
Dense None 100 10100
procedure would generate indexes as specified by a fixed size
of steps moving over a thread of signal. The fixed step-size Dense None, 6 606
was parameterized with 20 for this study. The frame size used Total params: 52,306
is 80 (step-size x 4), which equals to 4 seconds of data, i.e., Trainable params: 52,306
elementary sample are created in 4 seconds per segment. The Non-trainable params: 0
label (activity) for each segment is selected by the most
frequent class label or generally the mode presented in that
window accordingly. The resulted dataset in the desired
format is then split with an 8:2 ratio for training and testing
datasets, respectively.
Authorized licensed use limited to: Cornell University Library. Downloaded on May 23,2021 at 17:42:52 UTC from IEEE Xplore. Restrictions apply.
I V . EXPERIMENTAL RESULTS 12). The confusion matrix contains information about the
actual and predicted classifications done by a classification
A. Results from CNN and LSTM Models
system and the performance of such systems is commonly
The implementation was realized under a Jupyter evaluated using the data in the matrix. By checking on the test
notebook environment of Google Colaboratory® by Python samples, the confusion matrices were charted as follows for
programming language. With the two model architectures the two models. Here the corresponding encoded values are 0
described in the previous section, all the two models were for walking, 1 for jogging, 2 for upstairs, 3 for downstairs, 4
compiled together with the sparse categorical cross entropy for sitting and 5 for standing.
loss function and the Adam optimizer with a learning rate of
0.001. All the NN models was fitted for the training data and
test data with a batch size of 64 and run for 50 epochs. The
training accuracy was then plotted together with the validation
accuracy varying the iterations for performance evaluation
related to the two models (Fig. 7 and Fig. 8).
With respect to the CNN model, a training accuracy of
99.53% was achieved while the validation accuracy of93.46%
was simultaneously achieved as that shown in Fig. 7.
Fig. 9. Losses calculated during the iterative procedure for both training and
validation with the CNN model.
Fig. 10. Losses calculated during the iterative procedure for both training and
validation with the LSTM model.
15 0 0 0 3 0 12 0 0 0 5 1
0 17 0 0 0 1 0 17 0 0 0 1
0 0 18 0 0 0 0 0 18 0 0 0
Fig. 8. Training and validation accuracies with the LSTM model. 0 0 0 18 0 0 1 0 0 17 0 0
6 0 0 0 12 0 4 0 0 0 14 0
Accompanying with the training and validation 0 0 0 1 0 16 0 0 0 0 0 17
accuracies, the training and validation losses calculated during
the procedure were also charted in a graph varying the number
of iterations for all the two models. It is seen that both the Fig. 11. Confusion matrix with Fig. 12. Confusion matrix with
training and validation losses are gradually decreasing with CNN model. LSTM model.
the iterations to converge to the approximation within
respective precise ranges in both the models. The relative
lower validation loss resulted in the two models guarantees B. Comparison of Results from CNN and LSTM Models
that no overfitting happened to the converged models. Figures We then constructed a classification report for the two
9 and 10 show the training and validation loss resulted from models with the classification results achieved. There it was
the CNN model and LSTM model, respectively. finally concluded that the CNN model shows far better
accuracy terms compared with the LSTM model (Table III).
In addition to the accuracies, confusion matrices, which
helps to graphically check the true label and the predicted label
more comparatively, has also been constructed (Figs. 11 and
Authorized licensed use limited to: Cornell University Library. Downloaded on May 23,2021 at 17:42:52 UTC from IEEE Xplore. Restrictions apply.
T A B L E III. RESULTS COMPARISON BETWEEN C N N AND L S T M MODELS. REFERENCES
Model CNN Model LSTM Model [1] L. Alpoim, A. F. da Silva, and C. P. Santos, "Human Activity
Recognition Systems: State of Art," in 2019 IEEE 6th Portuguese
Accuracy 99.53% 84.71% Meeting on Bioengineering (ENBENG), Lisbon, Portugal, Feb. 2019,
Average Precision 94% 77% pp. 1-4, doi: 10.1109/ENBENG.2019.8692468.
Average Recall 93% 79% [2] S. Oniga and J. Suto, "Human activity recognition using neural
networks," in Proceedings of the 2014 15th International Carpathian
Average F1-score 93% 76% Control Conference (ICCC), Velke Karlovice, Czech Republic, May
2014, pp. 403-406, doi: 10.1109/CarpathianCC.2014.6843636.
[3] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, "Activity recognition
using cell phone accelerometers," SIGKDD Explor. Newsl., vol. 12,
V. CONCLUSION no. 2, pp. 74-82, Mar. 2011, doi: 10.1145/1964897.1964918.
[4] A. Murad and J.-Y. Pyun, "Deep Recurrent Neural Networks for
In this paper, we have presented a CNN model and a Human Activity Recognition," Sensors, vol. 17, no. 11, p. 2556, Nov.
LSTM model with 99.593% accuracy and 84.71% accuracy 2017, doi: 10.3390/s17112556.
respectively for 6 daily life activities with the WISDM dataset. [5] C. Jobanputra, J. Bavishi, and N. Doshi, "Human Activity Recognition:
Use of Conv2D layers for CNN, Dropout regularization and A Survey," Procedia Computer Science, vol. 155, pp. 698-703, 2019,
doi: 10.1016/j.procs.2019.08.100.
using perfect model hyper parameters in the networks of the
two models has made them fast and robust in terms of speed [6] P. Kuppusamy and C. Harika, "Human Action Recognition using CNN
and LSTM-RNN with Attention Model" International Journal od
and accuracy. As further works, authors present an idea of Innovative Technology and Exploring Engineering(IJITEE), vol.8,
using this presented Human Activity Recognition framework Issue 8, pp.1639-1643, 2019
as a solution for a smart childcare or eldercare monitoring [7] Y. Chen, K. Zhong, J. Zhang, Q. Sun, and X. Zhao, "LSTM Networks
system based on IoT technologies. Also, it will be a perfect for Mobile Human Activity Recognition," presented at the 2016
task if we can generate our own dataset with the use of International Conference on Artificial Intelligence: Technologies and
Applications, Bangkok, Thailand, 2016, doi: 10.2991/icaita-
appropriate sensors and applications for a defined number of 16.2016.13.
frequent activities people are performing in day to day lives. [8] C. Hofmann, C. Patschkowski, B. Haefner, and G. Lanza, "Machine
This research area seems having multiple advanced Learning Based Activity Recognition To Identify Wasteful Activities
applications with Deep Learning applications in near future. In Production," Procedia Manufacturing, vol. 45, pp. 171-176, 2020,
In addition, as future works authors suggest the application of doi: 10.1016/j .promfg.2020.04.090.
reinforcement learning paradigm on the domain of activity [9] L. B. Marinho, A. H. de Souza Junior, and P. P. Reboujas Filho, "A
New Approach to Human Activity Recognition Using Machine
recognition and classification. Learning Techniques," in Intelligent Systems Design and Applications,
vol. 557, A. M. Madureira, A. Abraham, D. Gamboa, and P. Novais,
ACKNOWLEDGMENT Eds. Cham: Springer International Publishing, 2017, pp. 529-538.
[10] T. Zebin, M. Sperrin, N. Peek, and A. J. Casson, "Human activity
The authors gratefully acknowledge the support grants recognition from inertial sensor time-series using batch normalized
from Ministry of Science and Technology of Taiwan through deep LSTM recurrent networks," in 2018 40th Annual International
its grant 108-2221-E-305-012, the National Taipei University Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Honolulu, HI, Jul. 2018, pp. 1-4, doi:
through its grant 109-NTPU_0RDA-F-006 and the 10.1109/EMBC.2018.8513115.
University System of Taipei Joint Research Program through
[11] Wikipedia, "List of Python software," 2020. [Online]. Available:
its grant USTP-NTPU-TMU-109-01. https://en.wikipedia.org/wiki/List_of_Python_software. [Accessed:
20- Sep- 2020].
Authorized licensed use limited to: Cornell University Library. Downloaded on May 23,2021 at 17:42:52 UTC from IEEE Xplore. Restrictions apply.